scrapy 信号只在主线程中起作用

8ehkhllq  于 2023-05-17  发布在  其他
关注(0)|答案(2)|浏览(235)

我是Django的新手。我正试图通过django视图运行我的scrappy spider。当我运行命令提示符时,我的scrappy代码工作得很好。但是当我尝试在Django上运行它时,它失败了。错误消息:信号只在主线程中起作用。
django视图中的我的代码(如下)

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.crawler import CrawlerProcess
from scrapy import log, signals
from Working.spiders.workSpider import WorkSpider
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings

spider = WorkSpider(domain='scrapinghub.com')
crawler = CrawlerProcess(Settings())
crawler.start()
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()

请帮我解决这个问题。谢谢

kse8i1jr

kse8i1jr1#

这个错误基本上是说你不在主线程中,所以信号没有被处理。
从CrawlerProcess切换到CrawlerRunner为我解决了这个问题(我猜在CrawlerRunner中你在主线程中)http://doc.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerRunner
希望这对你有帮助

11dmarpk

11dmarpk2#

我找到了一种不用信号的爬行方法

def crawl(spider: Type[Spider], spider_kwargs: dict = None):
    spider_kwargs = {} if spider_kwargs is None else spider_kwargs
    crawler = CrawlerProcess()
    crawler.start()
    crawler.crawl(spider, **spider_kwargs)
    crawler.start(stop_after_crawl=True, install_signal_handlers=False)

用途

from scrapy import Spider

if __name__ == "__main__":

    class BaseSpider(Spider):
        name: str

    crawl(BaseSpider, { "name": "base_spider" })

相关问题