scrapy 覆盖Crawl Spider中的默认User-Agent

ct2axkht  于 2023-03-23  发布在  其他
关注(0)|答案(2)|浏览(141)

我在覆盖Crawl Spider模板中的默认User-Agent时遇到问题。

user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'

    def start_requests(self):
        yield scrapy.Request(url ="https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating", headers ={'User-Agent':self.user_agent})

    rules = (Rule(LinkExtractor(restrict_xpaths= '//h3[@class="lister-item-header"]/a'), callback="parse_item", follow=True, process_request = 'set_user_agent'),)

    def set_user_agent(self, request):
        request.headers['User-Agent'] =self.user_agent
        return request

    def parse_item(self, response):
        yield {
            'title': response.xpath('//div[@class="sc-b5e8e7ce-1 kNhUtn"]/h1[@class="sc-b73cd867-0 gLtJub"]/text()').get()
        }

我得到了这样的错误

File "/mnt/c/Users/asib0/OneDrive/scrapy_project1/scrapy-env/lib/python3.10/site-packages/scrapy/spidermiddlewares/depth.py", line 35, in process_spider_output_async
    async for r in result or ():
  File "/mnt/c/Users/asib0/OneDrive/scrapy_project1/scrapy-env/lib/python3.10/site-packages/scrapy/core/spidermw.py", line 116, in process_async
    async for r in iterable:
  File "/mnt/c/Users/asib0/OneDrive/scrapy_project1/scrapy-env/lib/python3.10/site-packages/scrapy/spiders/crawl.py", line 129, in _parse_response
    for request_or_item in self._requests_to_follow(response):
  File "/mnt/c/Users/asib0/OneDrive/scrapy_project1/scrapy-env/lib/python3.10/site-packages/scrapy/spiders/crawl.py", line 105, in _requests_to_follow
    yield rule.process_request(request, response)
TypeError: BestMovieSpider.set_user_agent() takes 2 positional arguments but 3 were given
2023-03-06 17:56:58 [scrapy.core.engine] INFO: Closing spider (finished)

如何在爬行蜘蛛模板中正确设置用户代理?

cngwdvgl

cngwdvgl1#

使用自定义设置为所有请求设置用户代理...这要容易得多。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class BestMovieSpider(CrawlSpider):
    name = "best_movie"
    allowed_domains = ["www.imdb.com"]
    start_urls = ["https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating"]
    custom_settings = {
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
    }

    rules = (Rule(LinkExtractor(restrict_xpaths= '//h3[@class="lister-item-header"]/a'), callback="parse_item", follow=True)

  def parse_item(self, response):
        yield {
            'title': response.xpath('//div[@class="sc-b5e8e7ce-1 kNhUtn"]/h1[@class="sc-b73cd867-0 gLtJub"]/text()').get()
        }
wd2eg0qa

wd2eg0qa2#

你也需要传递请求。因为你的set_user_agent方法需要request作为参数。检查下面的代码,这将修复错误。

def start_requests(self):
    request = scrapy.Request(url="https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating")
    request = self.user_agent(request)
    yield request

相关问题