为什么Scrapy跳过一些URL而不跳过其他URL？

wtlkbnrh 于 2023-03-08 发布在其他

关注(0)|答案(2)|浏览(161)

我正在写一个scrappy crawler来从亚马逊抓取衬衫的信息，这个crawler从亚马逊的一个页面开始搜索，比如“滑稽衬衫”，然后收集所有的结果项容器，然后解析每个结果项，收集衬衫的数据。
我使用ScraperAPI和Scrapy-user-agents来躲避亚马逊，我的蜘蛛的代码是：

class AmazonSpiderSpider(scrapy.Spider):
    name = 'amazon_spider'
    page_number = 2

    keyword_file = open("keywords.txt", "r+")
    all_key_words = keyword_file.readlines()
    keyword_file.close()
    all_links = []
    keyword_list = []

    for keyword in all_key_words:
        keyword_list.append(keyword)
        formatted_keyword = keyword.replace('\n', '')
        formatted_keyword = formatted_keyword.strip()
        formatted_keyword = formatted_keyword.replace(' ', '+')
        all_links.append("http://api.scraperapi.com/?api_key=mykeyd&url=https://www.amazon.com/s?k=" + formatted_keyword + "&ref=nb_sb_noss_2")

    start_urls = all_links

def parse(self, response):
    print("========== starting parse ===========")

    all_containers = response.css(".s-result-item")
    for shirts in all_containers:
        next_page = shirts.css('.a-link-normal::attr(href)').extract_first()
        if next_page is not None:
            if "https://www.amazon.com" not in next_page:
                next_page = "https://www.amazon.com" + next_page
            yield scrapy.Request('http://api.scraperapi.com/?api_key=mykey&url=' + next_page, callback=self.parse_dir_contents)

    second_page = response.css('li.a-last a::attr(href)').get()
    if second_page is not None and AmazonSpiderSpider.page_number < 3:
        AmazonSpiderSpider.page_number += 1
        yield response.follow(second_page, callback=self.parse)

def parse_dir_contents(self, response):
    items = ScrapeAmazonItem()

    print("============= parsing page ==============")

    temp = response.css('#productTitle::text').extract()
    product_name = ''.join(temp)
    product_name = product_name.replace('\n', '')
    product_name = product_name.strip()

    temp = response.css('#priceblock_ourprice::text').extract()
    product_price = ''.join(temp)
    product_price = product_price.replace('\n', '')
    product_price = product_price.strip()

    temp = response.css('#SalesRank::text').extract()
    product_score = ''.join(temp)
    product_score = product_score.strip()
    product_score = re.sub(r'\D', '', product_score)

    product_ASIN = re.search(r'(?<=/)B[A-Z0-9]{9}', response.url)
    product_ASIN = product_ASIN.group(0)

    items['product_ASIN'] = product_ASIN
    items['product_name'] = product_name
    items['product_price'] = product_price
    items['product_score'] = product_score

    yield items

爬行看起来像这样：
https://i.stack.imgur.com/UbVUt.png
我得到了一个200返回，所以我知道我从网页上获得的数据，但有时它不进入parse_dir_contents，或它只抓取了几件衬衫的信息，然后移动到下一个关键字，而没有遵循分页。
使用两个关键字：我文件中的第一个关键字（keywords.txt），它可能会找到1-3件衬衫，然后移到下一个关键字。然后第二个关键字完全成功，找到所有衬衫并进行分页。在具有5+个关键字的关键字文件中，前2-3个关键字被跳过，然后加载下一个关键字，只找到2-3件衬衫，然后才移动到下一个词，这是再次完全成功。在一个文件与10+关键字，我得到非常零星的行为。
我不知道为什么会这样，有人能解释吗？

scrapy

来源：https://stackoverflow.com/questions/57760431/why-is-scrapy-skipping-some-urls-but-not-others

2条答案

按热度按时间

wfauudbj1#

首先检查robots.txt是否被忽略，从你所说的我想你已经有了。
有时候，从响应返回的HTML代码与您查看产品时看到的代码并不相同。我真的不知道到底是怎么回事，在您的情况下，但您可以检查什么蜘蛛实际上是"阅读"。

scrapy shell 'yourURL'

在那之后

view(response)

如果请求成功，您可以查看爬行器实际看到的代码。
有时请求不成功（也许亚马逊正在将您重定向到验证码或其他什么）。
你可以在抓取时检查响应（请检查下面的代码，我是从内存中完成的）

import request

#inside your parse method

r = request.get("url")
print(r.content)

如果我没记错的话，您可以从scrappy本身获得URL（类似于response.url。

赞(0）回复(0）举报 2023-03-08

9rnv2umw2#

尝试在你的scrappy请求中使用dont_filter=True。我也遇到了同样的问题，看起来scrappy爬虫忽略了一些网址，因为它认为它们是重复的。

dont_filter=True

这确保了scrapy不会用它的dupefilter过滤任何URL。

赞(0）回复(0）举报 2023-03-08

我来回答

为什么Scrapy跳过一些URL而不跳过其他URL？

2条答案

相关问题

热门标签

最新问答