scrapy 遵循网站上所有页面的零碎规则

cgyqldqp  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(139)

我写了一个简单的蜘蛛,我想跟踪一个域中的所有链接(在这个例子中是amazon.com)。这是我目前为止的代码


# -*- coding: utf-8 -*-

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from urllib.parse import urlparse
from scrapy.utils.response import open_in_browser
class ExampleSpider(CrawlSpider):
    name = 'example'
    allowed_domains = ['amazon.com']
    rules = (
        Rule(LinkExtractor(allow='',
            deny_extensions=['7z', '7zip', 'apk', 'bz2', 'cdr,' 'dmg', 'ico,' 'iso,' 'tar', 'tar.gz','pdf','docx'],
            ), callback='parse_item', follow=True,
            ),
    )
    custom_settings = {'LOG_ENABLED':True}
    def start_requests(self):
        #print(self.website)
        url = 'https://www.amazon.com/s?k=water+balloons'
        yield scrapy.Request(url,callback=self.parse_item,)

    def parse_item(self,response):
        #open_in_browser(response)
        print(response.url)

我检查了这个问题,但答案不工作scrapy follow all the links and get status,我也尝试取代allow=''restrict_xpaths='\\a',但它没有解决,任何帮助是感激不尽的.
注意:爬行器必须位于“amazon.com“域中

yb3bgrhw

yb3bgrhw1#

您已经正确指定了规则,但代码的问题是您没有在start_requests方法中调用正确的方法。
为了触发规则,您需要将第一个请求发送到内置的parse方法
大概是这样的:

def start_requests(self):
    #print(self.website)
    url = 'https://www.amazon.com/s?k=water+balloons'
    yield scrapy.Request(url,callback=self.parse)

相关问题