scrapy 如何设置SitemapSpider的默认cookies?

jexiocij  于 2022-12-18  发布在  其他
关注(0)|答案(2)|浏览(107)

我正在尝试设置自己的标题和cookie时,使用SitemapSpider抓取:

class MySpider(SitemapSpider):
    name = 'myspider'
    sitemap_urls = ['https://www.sitemap-1.xml']
    headers = {'pragma': 'no-cache',}
    cookies = {"sdsd": "23234",}

    def _request_sitemaps(self, response):
        for url in self.sitemap_urls:
            yield scrapy.Request(url=url,headers=self.headers,cookies=self.cookies,callback=self._parse_sitemap)

    def parse(self, response, **cb_kwargs):
        print(response.css('title::text').get())

...但它不起作用(cookie和头文件不传递),我该如何实现它?

eoigrqb6

eoigrqb61#

我的决定

class MySpider(SitemapSpider):
    name = 'spider'
    sitemap_urls = ['https://www.sitemap-1.xml']
    headers = {'authority': 'www.example.com',}
    cookies = {"dsd": "jdjsj233",}

    def start_requests(self):
        for url in self.sitemap_urls:
            yield Request(url, self._parse_sitemap)

    def _parse_sitemap(self, response):
        response = response.replace(body=self._get_sitemap_body(response))
        for request in super()._parse_sitemap(response):
            url = request.url
            endpoint_request = request.replace(
                url=url,
                callback=self.parse,
                headers=self.headers,
                cookies=self.cookies,
            )
            yield endpoint_request

    def parse(self, response, **cb_kwargs):
        print(response.css('title::text').get())
rbl8hiat

rbl8hiat2#

根据SitemapSpider的源代码,我认为将_request_sitemaps重命名为start_requests应该可以做到这一点。

相关问题