使用Scrapy shell向网站发送请求时出错

yrdbyhpb  于 2022-12-04  发布在  Shell
关注(0)|答案(2)|浏览(194)

我正在学习 Scrapy framework。我尝试使用Scrapy shell。我尝试从“www.example.com“获取响应https://quotes.toscrape.com/。命令如下-

python -m scrapy shell

在壳里

>> from scrapy import Request
>> req = Request("https://quotes.toscrape.com/")
>> fetch(req)

然后我发现了这样的错误

PS D:\Projects\scrapyLearn\introSpider\introSpider> python -m scrapy shell
2022-11-30 15:04:52 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: introSpider)
2022-11-30 15:04:52 [scrapy.utils.log] INFO: Versions: lxml 4.9.0.0, libxml2 2.9.10, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.0, Twisted 22.10.0, Python 3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.4, Platform Windows-10-10.0.22000-SP0
2022-11-30 15:04:52 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'introSpider',
 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'LOGSTATS_INTERVAL': 0,
 'NEWSPIDER_MODULE': 'introSpider.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['introSpider.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2022-11-30 15:04:52 [asyncio] DEBUG: Using selector: SelectSelector
2022-11-30 15:04:52 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2022-11-30 15:04:52 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop2022-11-30 15:04:52 [scrapy.extensions.telnet] INFO: Telnet Password: 9ec5c326bbb22c54
2022-11-30 15:04:52 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2022-11-30 15:04:52 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-11-30 15:04:52 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-11-30 15:04:52 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-11-30 15:04:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x000002601B1B48D0>
[s]   item       {}
[s]   settings   <scrapy.settings.Settings object at 0x000002601B3EC550>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> from scrapy import Request
>>> req = Request("https://quotes.toscrape.com/")
>>> fetch(req)
2022-11-30 15:05:46 [scrapy.core.engine] INFO: Spider opened
2022-11-30 15:05:47 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None)
2022-11-30 15:05:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/> (referer: None)
>>> 2022-11-30 15:05:47 [scrapy.core.scraper] ERROR: Spider error processing <GET https://quotes.toscrape.com/> (referer: None)
Traceback (most recent call last):
  File "C:\Users\arnoLiono\AppData\Local\Programs\Python\Python311\Lib\site-packages\twisted\internet\defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\arnoLiono\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\utils\defer.py", line 285, in f
    return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\arnoLiono\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\utils\defer.py", line 272, in deferred_from_coro
    event_loop = get_asyncio_event_loop_policy().get_event_loop()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\arnoLiono\AppData\Local\Programs\Python\Python311\Lib\asyncio\events.py", line 677, in get_event_loop
    raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'Thread-1 (start)'.
2022-11-30 15:05:47 [py.warnings] WARNING: C:\Users\arnoLiono\AppData\Local\Programs\Python\Python311\Lib\site-packages\twisted\internet\defer.py:892: RuntimeWarning: coroutine 'SpiderMiddlewareManager.scrape_response.<locals>.process_callback_output' was never awaited
  current.result = callback(  # type: ignore[misc]

而且shell还在运行。我不知道什么是错误。以及如何修复它。
我只是想从“https://quotes.toscrape.com/“网站得到回应。

wgxvkvu9

wgxvkvu91#

如果您使用的是windows。这是由一个错误引起的。
下面是github issue
这与robots.txt文件完全无关。

2g32fytz

2g32fytz2#

我重新创建了相同的步骤,并且没有问题获得页面。我建议您在www.example.com:ROBOTSTXT_OBEY = False中更改此设置settings.py,因为正如您在日志中所看到的,当向不存在的https://quotes.toscrape.com/robots.txt发出第一个请求时,Scrapy会收到404(错误)。
我还建议您直接使用fetch,并将url作为参数,例如:fetch("https://quotes.toscrape.com/")

相关问题