我正在努力熟悉scrapy,所以我正在学习这个udemy课程,现在正在学习scrapy_selenium,但是这个代码使用的是教师正在使用的是不工作,我不知道为什么。页面只是不会加载,当我设置驱动程序不是无头的,它只是弹出一个空白的灰色屏幕。
下面是我的蜘蛛代码:
import scrapy
from scrapy_selenium import SeleniumRequest
class ExampleSpider(scrapy.Spider):
name = 'example'
def start_request(self):
yield SeleniumRequest(
# url='https://graytecknologies.com',
url='https://duckduckgo.com',
wait_time=10,
screenshot=True,
callback=self.parse
)
...
def parse(self, response):
img = response.meta['screenshot']
with open('img.png', 'wb') as f:
f.write(img)
这里是我的设置:
# Scrapy settings for silkdeals project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from shutil import which
BOT_NAME = 'silkdeals'
SPIDER_MODULES = ['silkdeals.spiders']
NEWSPIDER_MODULE = 'silkdeals.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'silkdeals (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
# COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
# }
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
# 'silkdeals.middlewares.SilkdealsSpiderMiddleware': 543,
# }
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
# 'silkdeals.middlewares.SilkdealsDownloaderMiddleware': 543,
# }
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
# }
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
# 'silkdeals.pipelines.SilkdealsPipeline': 300,
# }
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
# SELENIUM_DRIVER_EXECUTABLE_PATH = 'C:\Program Files\chromedriver.exe'
# '--headless' if using chrome instead of firefox
SELENIUM_DRIVER_ARGUMENTS = ['-headless']
# SELENIUM_DRIVER_ARGUMENTS = []
每当我运行scraper时,输出告诉我0页被刮擦。
输出:
PS C:\Users\carte\OneDrive\Documents\Code\Learn Scrapy\silkdeals> scrapy crawl example
2022-11-07 22:30:51 [scrapy.utils.log] INFO: Scrapy 2.6.3 started (bot: silkdeals)
2022-11-07 22:30:51 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.8.0, Python 3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1n 15 Mar 2022), cryptography 36.0.2, Platform Windows-10-10.0.22000-SP0
2022-11-07 22:30:51 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'silkdeals',
'NEWSPIDER_MODULE': 'silkdeals.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['silkdeals.spiders']}
2022-11-07 22:30:51 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-07 22:30:51 [scrapy.extensions.telnet] INFO: Telnet Password: bd2298d4a98a4500
2022-11-07 22:30:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-11-07 22:30:51 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://localhost:51127/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"extensions": [], "args": ["-headless"]}}}, "desiredCapabilities": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"extensions": [], "args": ["-headless"]}}}
2022-11-07 22:30:51 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): localhost:51127
DevTools listening on ws://127.0.0.1:51132/devtools/browser/58dd62a5-ba86-4076-9fdb-62e0ddb024a3
2022-11-07 22:30:52 [urllib3.connectionpool] DEBUG: http://localhost:51127 "POST /session HTTP/1.1" 200 789
2022-11-07 22:30:52 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2022-11-07 22:30:52 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy_selenium.SeleniumMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-11-07 22:30:52 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-11-07 22:30:52 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-11-07 22:30:52 [scrapy.core.engine] INFO: Spider opened
2022-11-07 22:30:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-07 22:30:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-11-07 22:30:52 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-07 22:30:52 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://localhost:51127/session/d3a5a343fd3c336f71341866eb2949fd {}
2022-11-07 22:30:52 [urllib3.connectionpool] DEBUG: http://localhost:51127 "DELETE /session/d3a5a343fd3c336f71341866eb2949fd HTTP/1.1" 200 14
2022-11-07 22:30:52 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2022-11-07 22:30:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.001672,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 8, 3, 30, 52, 458675),
'log_count/DEBUG': 8,
'log_count/INFO': 10,
'start_time': datetime.datetime(2022, 11, 8, 3, 30, 52, 457003)}
2022-11-07 22:30:54 [scrapy.core.engine] INFO: Spider closed (finished)
PS C:\Users\carte\OneDrive\Documents\Code\Learn Scrapy\silkdeals>
1条答案
按热度按时间i86rm4rw1#
我忘了在覆盖的start_requests()方法上添加s ..... smh拼写是我最大的敌人。