scrapy 如何修复刮擦复位连接

kgqe7b3p  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(122)

我一直得到一个错误时,试图scrapy几个url与scrapy使用 selenium 中间件。
Middleware.py:

class SeleniumMiddleWare(object):

    def __init__(self):
        path = "G:/Downloads/chromedriver.exe"
        options = uc.ChromeOptions()
        options.headless=True
        chrome_prefs = {}
        options.experimental_options["prefs"] = chrome_prefs
        chrome_prefs["profile.default_content_settings"] = {"images": 2}
        chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}
        self.driver=  uc.Chrome(options= options, use_subprocess=True, driver_executable_path = path)

    def process_request(self, request, spider):
        try:
            self.driver.get(request.url)
        except:
            pass
        content = self.driver.page_source
        self.driver.quit()

        return HtmlResponse(request.url, encoding='utf-8', body=content, request=request)

    def process_response(self, request, response, spider):
        return response

Spider.py:

class SeleniumSpider(scrapy.Spider):
    name = 'steamdb'
    #allowed_domains = ['steamdb.info']
    start_urls = ['https://steamdb.info/graph/']

    def parse(self, response):  
        table = response.xpath('//*[@id="table-apps"]/tbody')
        rows = table.css('tr[class= "app"]')
        #b= a.css('tr [class = "app"]::text')
        #table = b.xpath('//*[@id="table-apps"]/tbody/tr')

        for element in rows:
            link = "https://steamdb.info".join(element.css('::attr(href)').get())
            name = element.css('a ::text')[0].get()
            game_info = {"Link": link, "Name": name}
            yield scrapy.Request(url =link, callback = self.parse_info, cb_kwargs= dict(game_info = game_info))

    def parse_info(self, response, game_info ):
        game_info["sales"] = response.xpath('//*[@id="graphs"]/div[5]/div[2]/ul/li[1]/strong/span/text()').getall()
        yield game_info

注意:scraper可以在不使用cb_kwargs的情况下工作,如果我只抓取start_urls中的页面,它可以工作,但当我向其他url或跟随页面发出新的请求时,它就不行了。
错误:

2022-07-12 20:53:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://steamdb.info/graph/> (referer: https://steamdb.info/graph/)
2022-07-12 20:53:54 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://localhost:52304/session/99578d3d4f168c77b58a85f67be06927/execute/sync {"script": "return navigator.webdriver", "args": []}
2022-07-12 20:53:54 [urllib3.connectionpool] DEBUG: Resetting dropped connection: localhost
2022-07-12 20:53:56 [urllib3.util.retry] DEBUG: Incremented Retry for (url='/session/99578d3d4f168c77b58a85f67be06927/execute/sync'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
2022-07-12 20:53:56 [urllib3.connectionpool] WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000015E5EB66EC0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/99578d3d4f168c77b58a85f67be06927/execute/sync
2022-07-12 20:53:56 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (2): localhost:52304
2022-07-12 20:53:58 [urllib3.util.retry] DEBUG: Incremented Retry for (url='/session/99578d3d4f168c77b58a85f67be06927/execute/sync'): Retry(total=1, connect=None, read=None, redirect=None, status=None)
2022-07-12 20:53:58 [urllib3.connectionpool] WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000015E5ED6C970>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/99578d3d4f168c77b58a85f67be06927/execute/sync
j2cgzkjk

j2cgzkjk1#

the target machine actively refused it表示服务器响应,但指定端口(52304)关闭,能否检查是否可以访问?可能是本地防火墙阻止了?
UPD:看起来您正在每个process_request中调用self.driver.quit(),请重新初始化驱动程序或在完成之前不调用.quit()

相关问题