Scrapy:将参数传递给Cookie

31moq8wy  于 2022-11-23  发布在  其他
关注(0)|答案(1)|浏览(153)

有必要绕过此站点的所有位置mkm如果我理解正确,地理位置是通过URL(https://mkm-metal.ru/?REGION_ID=141)中的ID参数和cookie中的ID参数(“BITRIX_SM_CITY_ID”:loc_id)中。

import scrapy
import re

class Mkm(scrapy.Spider):
    name = 'mkm'

    def start_requests(self, **cb_kwargs):
        for loc_id in ['142', '8', '12', '96']:
            url = f"https://mkm-metal.ru/?REGION_ID={loc_id}"
            cb_kwargs['cookies'] = {'BITRIX_SM_CITY_ID': loc_id}
            yield scrapy.Request(
                url=url,
                callback=self.parse,
                # meta={'cookiejar': loc_id},
                cookies=cb_kwargs['cookies'],
                cb_kwargs=cb_kwargs,
            )

    def parse(self, response, **cb_kwargs):
        yield scrapy.Request(
            url='https://mkm-metal.ru/catalog/',
            callback=self.parse_2,
            # meta={'cookiejar': response.meta['cookiejar']},
            cookies=cb_kwargs['cookies'],
        )

    def parse_2(self, response, **cb_kwargs):
        city = response.css('a.place span::text').get().strip()
        print(city, response.url)

但是在我的例子中,parse_2方法只返回一个城市(第一个ID = 142)。
这是日志。

2022-06-05 17:32:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=142> (referer: None)
2022-06-05 17:32:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=8> (referer: None)
2022-06-05 17:32:46 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://mkm-metal.ru/catalog/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2022-06-05 17:32:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/catalog/> (referer: https://mkm-metal.ru/?REGION_ID=142)
Бугульма https://mkm-metal.ru/catalog/
2022-06-05 17:32:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=12> (referer: None)
2022-06-05 17:32:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=96> (referer: None)
2022-06-05 17:32:47 [scrapy.core.engine] INFO: Closing spider (finished)
zphenhs4

zphenhs41#

在函数parse中,你为每个cookie请求相同的url。Scrapy过滤重复的请求,这样你只得到第一个请求,其余的都被忽略。添加dont_filter=True

import scrapy
import re

class Mkm(scrapy.Spider):
    name = 'mkm'

    def start_requests(self, **cb_kwargs):
        for loc_id in ['142', '8', '12', '96']:
            url = f"https://mkm-metal.ru/?REGION_ID={loc_id}"
            cb_kwargs['cookies'] = {'BITRIX_SM_CITY_ID': loc_id}
            yield scrapy.Request(
                url=url,
                callback=self.parse,
                # meta={'cookiejar': loc_id},
                cookies=cb_kwargs['cookies'],
                cb_kwargs=cb_kwargs,
            )

    def parse(self, response, **cb_kwargs):
        yield scrapy.Request(
            url='https://mkm-metal.ru/catalog/',
            callback=self.parse_2,
            # meta={'cookiejar': response.meta['cookiejar']},
            cookies=cb_kwargs['cookies'],
            dont_filter=True
        )

    def parse_2(self, response, **cb_kwargs):
        city = response.css('a.place span::text').get().strip()
        print(city, response.url)

相关问题