Scrapy不会使用所有start_urls

mrwjdhj3  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(234)

我已经挣扎了很长一段时间,一直无法解决这个问题。问题是,我有一个包含几百个URL的start_urls列表,但只有一部分URL被我的蜘蛛的start_requests()使用。

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):

    #SETTINGS
    name = 'example'
    allowed_domains = []
    start_urls = []

    #set rules for links to follow        
    link_follow_extractor = LinkExtractor(allow=allowed_domains,unique=True) 
    rules = (Rule(link_follow_extractor, callback='parse', process_request = 'process_request', follow=True),) 

    def __init__(self,*args,**kwargs):
        super(MySpider, self).__init__(* args,**kwargs)

        #urls to scrape
        self.start_urls = ['https://example1.com','https://example2.com']
        self.allowed_domains = ['example1.com','example2.com']          

    def start_requests(self):

        #create initial requests for urls in start_urls        
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse,priority=1000,meta={'priority':100,'start':True})

    def parse(self, response):
        print("parse")

我已经在StackOverflow上读了很多关于这个问题的帖子,还有Github上的一些线程(一直追溯到2015年),但是还没有能够让它工作。
据我所知,问题是当我创建初始请求时,其他请求已经生成了一个响应,该响应被解析并创建了填满队列的新请求。我确认这是我的问题,因为当我使用中间件将每个域的下载页面数限制为2时,问题似乎得到了解决。这是有意义的,因为第一个创建的请求将只生成几个新请求,从而在队列中为start_urls列表的剩余部分留下空间。
我还注意到,当我将并发请求从32个减少到2个时,start_urls列表的一小部分也被消耗掉了。将并发请求的数量增加到几百个是不可能的,因为这会导致超时。
目前还不清楚为什么蜘蛛会出现这种行为,只是不继续使用start_urls。如果有人能给予我一些关于这个问题的潜在解决方案的建议,我将非常感激。

2wnc66cl

2wnc66cl1#

我也在纠结同一个问题:我的爬行器永远不会越过我定义的任何start_urls的page 1。
除了说明CrawlSpider类在每个响应中内部使用它自己的解析之外,因此您永远不应该使用自定义解析来冒蜘蛛不再工作的风险,说明文档没有提到的是CrawlSpider类使用的解析器不解析start_urls(即使它要求解析start_url),因此蜘蛛最初工作,并在尝试爬网到下一个页面/start_url时失败,并显示“there 's no parse in the callback”错误。
长话短说,试着这样做(它对我很有效):为start_urls添加一个解析函数。它和我的一样不需要做任何事情

def parse(self, start_urls):
    for i in range(1, len(start_urls)):
        print('Starting to scrap page: '+ i)
    self.start_urls = start_urls

下面是我的整个代码(用户代理是在项目的设置中定义的):

from urllib.request import Request
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class PSSpider(CrawlSpider):
    name = 'jogos'
    allowed_domains = ['meugameusado.com.br']
    start_urls = ['https://www.meugameusado.com.br/playstation/playstation-3/jogos?pagina=1', 'https://www.meugameusado.com.br/playstation/playstation-4/jogos?pagina=1',
    'https://www.meugameusado.com.br/playstation/playstation-2/jogos?pagina=1', 'https://www.meugameusado.com.br/playstation/playstation-5/jogos?pagina=1',
    'https://www.meugameusado.com.br/playstation/playstation-vita/jogos?pagina=1'] 

    def parse(self, start_urls):
        for i in range(1, len(start_urls)):
            print('Starting to scrap page: '+ i)
        self.start_urls = start_urls

    rules = (
        Rule(LinkExtractor(allow=([r'/playstation/playstation-2/jogos?pagina=[1-999]',r'/playstation/playstation-3/jogos?pagina=[1-999]',
         r'/playstation/playstation-4/jogos?pagina=[1-999]', r'/playstation/playstation-5/jogos?pagina=[1-999]', r'/playstation/playstation-vita/jogos?pagina=[1-999]', 'jogo-'])
         ,deny=('/jogos-de-','/jogos?sort=','/jogo-de-','buscar?','-mega-drive','-sega-cd','-game-gear','-xbox','-x360','-xbox-360','-xbox-series','-nes','-gc','-gbc','-snes','-n64','-3ds','-wii','switch','-gamecube','-xbox-one','-gba','-ds',r'/nintendo*', r'/xbox*', r'/classicos*',r'/raridades*',r'/outros*'))
         ,callback='parse_item'
         ,follow=True),
    )

    def parse_item(self, response):
        yield {
            'title': response.css('h1.nome-produto::text').get(),
            'price': response.css('span.desconto-a-vista strong::text').get(),
            'images': response.css('span > img::attr(data-largeimg)').getall(),
            'video': response.css('#playerVideo::attr("src")').get(),
            'descricao': response.xpath('//*[@id="descricao"]/h3[contains(text(),"ESPECIFICAÇÕES")]/preceding-sibling::p/text()').getall(),
            'especificacao1': response.xpath('//*[@id="descricao"]/h3[contains(text(),"GARANTIA")]/preceding-sibling::ul/li/strong/text()').getall(),
            'especificacao2': response.xpath('//*[@id="descricao"]/h3[contains(text(),"GARANTIA")]/preceding-sibling::ul/li/text()').getall(),
            'tags': response.xpath('//*[@id="descricao"]/h3[contains(text(),"TAGS")]/following-sibling::ul/li/a/text()').getall(),
            'url': response.url,
        }

相关问题