scrapy Spider.start_requests()接受1个位置参数,但给出了3个

wz1wpwve  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(134)

我正在尝试使用爬行蜘蛛抓取网站。当我在命令行上运行爬行时,我得到类型错误- start_requests()接受1个位置参数,给出了3个。我检查了中间件设置,其中**def process_start_requests(self,start_requests,spider)**有3个参数。我已经提到了这个问题-scrapy project middleware -TypeError: process_start_requests() takes 2 positional arguments but 3 were given,但无法解决这个问题。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy import Request

class FpSpider(CrawlSpider):
    name = 'fp'
    allowed_domains = 'foodpanda.com.bd'

    rules = (Rule(LinkExtractor(allow=('product', 'pandamart')),
             callback='parse_items', follow=True, process_request='start_requests'),)

    def start_requests(self):
        yield Request(url='https://www.foodpanda.com.bd/darkstore/vbpl/pandamart-gulshan-2', meta=dict(playwright=True),
                      headers={
            'sec-ch-ua': '"Google Chrome";v="105", "Not)A;Brand";v="8", "Chromium";v="105"',
            'Accept': 'application/json, text/plain, */*',
            'Referer': 'https://www.foodpanda.com.bd/',
            'sec-ch-ua-mobile': '?0',
            'X-FP-API-KEY': 'volo',
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
            'sec-ch-ua-platform': '"macOS"'
        }
        )

    def parse_items(self, response):
        item = {}
        item['name'] = response.css('h1.name::text').get()
        item['price'] = response.css('div.price::text').get()
        item['original_price'] = response.css('div.original-price::text').get()
        yield item

错误如下所示:Scrapy type error

wkyowqbh

wkyowqbh1#

问题是这种说法:process_request='start_requests' .
start_request是保留的,用于第一次请求。如果您想为后续请求启用Playwright,我假设您正在尝试使用process_requests,则需要为该函数使用一个不同的名称。
请参见下面的代码:

def enable_playwright(request, response):
    request.meta["playwright"] = True
    return request

class FpSpider(CrawlSpider):
    name = "fp"
    allowed_domains = ["foodpanda.com.bd"]

    rules = (Rule(LinkExtractor(allow=('product', 'pandamart')),
            callback='parse_items',
            follow=True, 
            process_request=enable_playwright # Note a different function name
            # process_request='start_requests' #THIS was the problem
            ),)
    # Rest of the code here

另请注意,allowed_domains是一个列表,而不是字符串。

相关问题