scrapy 旋转代理(STORM、SMART)在每个零碎请求中不提供唯一IP

nbysray5  于 2022-11-09  发布在  Storm
关注(0)|答案(1)|浏览(236)

我如何确保我在每个零碎的请求中得到新的IP?我尝试了StormProxy和SmartProxy,但它给出的IP对于一个会话是相同的。
但是,每次运行时的ip都是新的,但对于单个会话,ip是相同的。
我的代码如下:

import json
import uuid
import scrapy
from scrapy.crawler import CrawlerProcess

class IpTest(scrapy.Spider):
    name = 'IP_test'
    previous_ip = ''
    count = 1
    ip_url = 'https://ifconfig.me/all.json'

    def start_requests(self,):
        yield scrapy.Request(
            self.ip_url,
            dont_filter=True,
            meta={
                'cookiejar': uuid.uuid4().hex,
                'proxy': MY_ROTATING_PROXY # either stormproxy or smartproxy
            }
        )

    def parse(self, response):
        ip_address = json.loads(response.text)['ip_addr']
        self.logger.info(f"IP: {ip_address}")
        if self.count < 10:
            self.count += 1
            yield from self.start_requests()

settings = {
    'DOWNLOAD_DELAY': 1,
    'CONCURRENT_REQUESTS': 1,
}

process = CrawlerProcess(settings)
process.crawl(IpTest)
process.start()

输出日志:

2020-12-27 21:15:52 [scrapy.core.engine] INFO: Spider opened
2020-12-27 21:15:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-12-27 21:15:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-12-27 21:15:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: None)
2020-12-27 21:15:55 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:15:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:15:56 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:15:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:15:57 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:15:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:15:59 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:00 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:01 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:03 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:04 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:06 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:07 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:07 [scrapy.core.engine] INFO: Closing spider (finished)

我在这里做错了什么?我甚至试着禁用cookie(COOKIES_ENABLED = False),从request.meta中删除cookiejar。但没有成功。

kx1ctssn

kx1ctssn1#

这很难,但我找到了答案。对于Storm,您需要传递带有“Connection”的报头:'close'。在这种情况下,您将为每个请求获取新的代理。例如:

HEADERS = {'Connection': 'close'}
yield Request(url=url, callback=self.parse, body=body, headers=HEADERS)

在这种情况下,Storm将关闭连接,并根据请求为您提供新的IP

相关问题