抓取多个url时json数据的顺序混乱Scrapy

9o685dep 于 2022-11-09 发布在其他

关注(0)|答案(1)|浏览(124)

我是一个新的scrapy。我做了一个脚本来从一个网站上抓取数据，它工作得很好，我得到的结果是一个JSON文件，它看起来很完美。现在当我尝试使用我的脚本来抓取多个URL（同一个网站）时，它工作了，我可以得到每个URL的JSON文件中的数据，但有一个bug。我的打印结构如下（如脚本中的代码所示）

[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:},  #URL1
{attribute:} #URL1
]

当我把两个URL放进废弃的时候，我得到了这个：

[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:},#URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{titleDesc:,,,Content:}, #URL2
{attribute:} #URL2
]

它仍然是好的，但当我添加更多的时候，结构就乱了，变成了这样：

[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:}, #URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{Title:,,,Description:,,,Brochure:}, #URL3
{titleDesc:,,,Content:}, #URL2
{attribute:}, #URL2
{titleDesc:,,,Content:}, #URL3
{attribute:}
]

如果你仔细看，你会发现第三个网址的标题在第二个网址的标题下面。有人能帮忙吗？

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "attributes"
    start_urls = ["https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/161/",
    "https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/162/"]

    def parse(self, response):
        yield{
            "title": response.css ("div.sku-top-title::text").get(),
            "desc" : response.css ("div.sku-top-desc::text").get(),
            "brochure" :'brochure'  
        }
        for post in response.css(".el-collapse"):
            for i in range(len(post.css(".el-collapse-item__header"))):
                res=""
                lst=post.css(".value-el-desc")
                x=lst[i].css(".value-el-desc p::text").extract()
                for y in x:
                    res+=y.strip()+"&&"
                try:      
                    yield{         
                        "descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
                        "desc" :res 
                        }  
                except:
                    continue
            res=""

        for post in response.css(".lie-one-canshu"):
            try:       
                dicti = {"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()}
                yield dicti                   
            except:
                continue

更新：我注意到这个bug不是永久性的，有时我执行脚本，结果很好。

scrapy

来源：https://stackoverflow.com/questions/73420200/order-of-json-data-is-messed-up-when-scraping-multiple-urls-scrapy

1条答案

按热度按时间

siv3szwd1#

Scrapy是异步的，所以不能保证输出或处理项的顺序，至少不能保证开箱即用。如果你想从一个URL中输出所有的项，那么我建议你每次调用parse方法只产生一个项......
例如：

def parse(self, response):
    results = {
       'items': [{
           "title": response.css ("div.sku-top-title::text").get(),
           "desc" : response.css ("div.sku-top-desc::text").get(),
           "brochure" :'brochure'  
        }]
    }
    for post in response.css(".el-collapse"):
        for i in range(len(post.css(".el-collapse-item__header"))):
            res=""
            lst=post.css(".value-el-desc")
            x=lst[i].css(".value-el-desc p::text").extract()
            for y in x:
                res+=y.strip()+"&&"
            try:      
                results['items'].append({         
                    "descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
                    "desc" : res 
                 }) 
            except:
                continue
        res = ""

    for post in response.css(".lie-one-canshu"):
        try:       
            results['items'].append({  
                "attribute" : post.css('.lie-one-canshu::text')[0].get().strip()
            })
        except:
            continue
    yield results

赞(0）回复(0）举报 2022-11-09

我来回答

抓取多个url时json数据的顺序混乱Scrapy

1条答案

相关问题

热门标签

最新问答