scrapy 当start_requests被覆盖时无法爬网

vmdwslir  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(124)

我试图通过覆盖start_requests将起始URL作为元数据附加到请求中,但是蜘蛛似乎拒绝抓取起始URL旁边的其他页面。有人知道如何在请求中包含元数据并抓取起始URL之外的页面吗?
谢谢你

class TSpider(CrawlSpider):
    name = 't'
    allowed_domains = ['books.toscrapes.com']
    start_urls = ['https://books.toscrapes.com']

    rules = (
        Rule(LinkExtractor(allow=[r'.*page.*']), callback='parse_item', follow=True),
    )

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url, callback=parse_item, meta={'start_url': url})

    def parse_item(self, response):
        item = {}
        item['title'] = response.xpath('//head/title/text()').extract()
        item['url'] = response.url
        item['start_url'] = response.meta['start_url']
        yield item
ohtdti5x

ohtdti5x1#

您的问题是start_requests方法中的回调,请删除它。
如果要将起始URL添加到每个请求,可以执行以下操作之一:

**方法1:**使用process_request(比方法2更好)。

from scrapy import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

def process_request(request, response):
    request.meta['start_url'] = response.request.meta.get('start_url')
    return request

class TSpider(CrawlSpider):
    name = 't'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['https://books.toscrape.com']

    rules = (
        Rule(LinkExtractor(allow=[r'.*page.*']), callback='parse_item', follow=True, process_request=process_request),
    )

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url, meta={'start_url': url})

    def parse_item(self, response):
        item = dict()
        item['title'] = response.xpath('//head/title/text()').extract()
        item['url'] = response.url
        item['start_url'] = response.request.meta.get('start_url')
        yield item

**方法2:**覆写要求以遵循方法。

from scrapy import Request
from scrapy.http import HtmlResponse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class TSpider(CrawlSpider):
    name = 't'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['https://books.toscrape.com']

    rules = (
        Rule(LinkExtractor(allow=[r'.*page.*']), callback='parse_item', follow=True),
    )

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url, meta={'start_url': url})

    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        for rule_index, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            for link in rule.process_links(links):
                seen.add(link)
                request = self._build_request(rule_index, link)
                request.meta['start_url'] = response.meta.get('start_url')  # I added just this one line
                yield rule.process_request(request, response)

    def parse_item(self, response):
        item = dict()
        item['title'] = response.xpath('//head/title/text()').extract()
        item['url'] = response.url
        item['start_url'] = response.meta.get('start_url')
        yield item

相关问题