我一直得到一个错误时,试图scrapy几个url与scrapy使用 selenium 中间件。
Middleware.py:
class SeleniumMiddleWare(object):
def __init__(self):
path = "G:/Downloads/chromedriver.exe"
options = uc.ChromeOptions()
options.headless=True
chrome_prefs = {}
options.experimental_options["prefs"] = chrome_prefs
chrome_prefs["profile.default_content_settings"] = {"images": 2}
chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}
self.driver= uc.Chrome(options= options, use_subprocess=True, driver_executable_path = path)
def process_request(self, request, spider):
try:
self.driver.get(request.url)
except:
pass
content = self.driver.page_source
self.driver.quit()
return HtmlResponse(request.url, encoding='utf-8', body=content, request=request)
def process_response(self, request, response, spider):
return response
Spider.py:
class SeleniumSpider(scrapy.Spider):
name = 'steamdb'
#allowed_domains = ['steamdb.info']
start_urls = ['https://steamdb.info/graph/']
def parse(self, response):
table = response.xpath('//*[@id="table-apps"]/tbody')
rows = table.css('tr[class= "app"]')
#b= a.css('tr [class = "app"]::text')
#table = b.xpath('//*[@id="table-apps"]/tbody/tr')
for element in rows:
link = "https://steamdb.info".join(element.css('::attr(href)').get())
name = element.css('a ::text')[0].get()
game_info = {"Link": link, "Name": name}
yield scrapy.Request(url =link, callback = self.parse_info, cb_kwargs= dict(game_info = game_info))
def parse_info(self, response, game_info ):
game_info["sales"] = response.xpath('//*[@id="graphs"]/div[5]/div[2]/ul/li[1]/strong/span/text()').getall()
yield game_info
注意:scraper可以在不使用cb_kwargs
的情况下工作,如果我只抓取start_urls
中的页面,它可以工作,但当我向其他url或跟随页面发出新的请求时,它就不行了。
错误:
2022-07-12 20:53:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://steamdb.info/graph/> (referer: https://steamdb.info/graph/)
2022-07-12 20:53:54 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://localhost:52304/session/99578d3d4f168c77b58a85f67be06927/execute/sync {"script": "return navigator.webdriver", "args": []}
2022-07-12 20:53:54 [urllib3.connectionpool] DEBUG: Resetting dropped connection: localhost
2022-07-12 20:53:56 [urllib3.util.retry] DEBUG: Incremented Retry for (url='/session/99578d3d4f168c77b58a85f67be06927/execute/sync'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
2022-07-12 20:53:56 [urllib3.connectionpool] WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000015E5EB66EC0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/99578d3d4f168c77b58a85f67be06927/execute/sync
2022-07-12 20:53:56 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (2): localhost:52304
2022-07-12 20:53:58 [urllib3.util.retry] DEBUG: Incremented Retry for (url='/session/99578d3d4f168c77b58a85f67be06927/execute/sync'): Retry(total=1, connect=None, read=None, redirect=None, status=None)
2022-07-12 20:53:58 [urllib3.connectionpool] WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000015E5ED6C970>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/99578d3d4f168c77b58a85f67be06927/execute/sync
1条答案
按热度按时间j2cgzkjk1#
the target machine actively refused it
表示服务器响应,但指定端口(52304)关闭,能否检查是否可以访问?可能是本地防火墙阻止了?UPD:看起来您正在每个
process_request
中调用self.driver.quit()
,请重新初始化驱动程序或在完成之前不调用.quit()