你好,我试图使抓取机器人的网站,使用JavaScript.我有大约20从该网站的网址,并希望扩大到houncers,我需要的网址被刮得很频繁,所以我尝试使用lua脚本做“动态”等待时间.当我使用默认的webkit引擎,该网站的html输出只是文本说,该网站不支持此浏览器,这就是为什么我使用chromium引擎。没有lua脚本,scraping只提供chromium引擎的输出项,但它确实有效。在我尝试使用lua之后,我使用chromium引擎得到了错误,使用webkit它执行时没有错误,但没有给予任何输出项。这是我使用lua的开始请求:
def start_requests(self):
lua_script = """
function main(splash, args)
assert(splash:go(args.url))
while not splash:select('div.o-matchRow')
splash:wait(1)
print('waiting...')
end
return {html=splash:html()}
end
"""
for url in self.start_urls:
yield SplashRequest(
url=url,
callback=self.parse,
endpoint='execute',
args={'engine': 'chromium', 'lua_source': lua_script}
)
字符串
这是一些简单的东西,我想测试出来。有人知道什么是处理lua和 chrome 引擎,或者我怎么能使用webkit时,网站不支持它?(顺便说一句,对不起我的英语,我不是一个母语)这些是与 chrome 引擎的错误:
2023-12-04 21:23:54 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: tipsport_scraper)
2023-12-04 21:23:54 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.4 (tags/v3.11.4:d2340ef, Jun 7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)
], pyOpenSSL 23.3.0 (OpenSSL 3.1.4 24 Oct 2023), cryptography 41.0.5, Platform Windows-10-10.0.19045-SP0
2023-12-04 21:23:54 [scrapy.addons] INFO: Enabled addons:
[]
2023-12-04 21:23:54 [asyncio] DEBUG: Using selector: SelectSelector
2023-12-04 21:23:54 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-12-04 21:23:54 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2023-12-04 21:23:54 [scrapy.extensions.telnet] INFO: Telnet Password: **************
2023-12-04 21:23:54 [py.warnings] WARNING: C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\extensions\feedexport.py:406: ScrapyDeprecationWarning: The `FEED_URI` and `FEED_FORMAT` settings have been
deprecated in favor of the `FEEDS` setting. Please see the `FEEDS` setting docs for more details
exporter = cls(crawler)
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2023-12-04 21:23:54 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tipsport_scraper',
'CONCURRENT_REQUESTS': 5,
'DOWNLOAD_DELAY': 5,
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'FEED_EXPORT_ENCODING': 'utf-8',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'NEWSPIDER_MODULE': 'tipsport_scraper.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'SPIDER_MODULES': ['tipsport_scraper.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled item pipelines:
['tipsport_scraper.pipelines.TipsportScraperPipeline']
2023-12-04 21:23:54 [scrapy.core.engine] INFO: Spider opened
2023-12-04 21:23:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-12-04 21:23:54 [scrapy.extensions.telnet] INFO: Telnet console listening on **********
2023-12-04 21:23:54 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.tipsport.cz/kurzy/fotbal-16?limit=1000 via http://localhost:8050/execute>
Traceback (most recent call last):
File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\twisted\internet\defer.py", line 1697, in _inlineCallbacks
result = context.run(gen.send, result)
File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\core\downloader\middleware.py", line 68, in process_response
method(request=request, response=response, spider=spider)
File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\middleware.py", line 412, in process_response
response = self._change_response_class(request, response)
File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\middleware.py", line 433, in _change_response_class
response = response.replace(cls=respcls, request=request)
File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\http\response\__init__.py", line 125, in replace
return cls(*args, **kwargs)
File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\response.py", line 120, in __init__
self._load_from_json()
File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\response.py", line 174, in _load_from_json
error = self.data['info']['error']
TypeError: string indices must be integers, not 'str'
2023-12-04 21:23:54 [scrapy.core.engine] INFO: Closing spider (finished)
2023-12-04 21:23:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1045,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 255,
'downloader/response_count': 1,
'downloader/response_status_count/400': 1,
'elapsed_time_seconds': 0.233518,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 12, 4, 20, 23, 54, 847285, tzinfo=datetime.timezone.utc),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'splash/execute/request_count': 1,
'splash/execute/response_count/400': 1,
'start_time': datetime.datetime(2023, 12, 4, 20, 23, 54, 613767, tzinfo=datetime.timezone.utc)}
2023-12-04 21:23:54 [scrapy.core.engine] INFO: Spider closed (finished)
型
我删除了telenet密码和某种IP,以防万一,如果它是敏感的东西,我用 * 替换它们。
1条答案
按热度按时间hrirmatl1#
对于Chromium,请确保正确设置了Splash以处理Chromium请求。如果它仍然不起作用,更新Splash可能会有所帮助。
对于WebKit,网站似乎会阻止它,因此尝试将Scrapy中的用户代理更改为更常见的代理。此外,检查您在Lua脚本中等待的
div.o-matchRow
是否确实存在于网站上。如果确实存在,并且您仍然有问题,请尝试设置脚本等待的时间限制,以避免卡住。日志中的
TypeError
表明脚本处理响应的方式存在问题。请确保在脚本中正确处理数据格式。