scrapy 报废-React器已安装使用TwistedScheduler时出错

zte4gxcn  于 2022-11-09  发布在  React
关注(0)|答案(4)|浏览(136)

我有下面的Python代码来启动APScheduler/TwistedScheduler cronjob以启动蜘蛛。
使用一个蜘蛛没有问题,效果很好。但是使用两个蜘蛛会导致错误:twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed .
我确实找到了一个related question,使用CrawlerRunner作为解决方案。但是,我使用的是TwistedScheduler对象,所以我不知道如何使用多个cron作业(多个add_job())使其工作。

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler

from myprojectscraper.spiders.my_homepage_spider import MyHomepageSpider
from myprojectscraper.spiders.my_spider import MySpider

process = CrawlerProcess(get_project_settings())

# Start the crawler in a scheduler

scheduler = TwistedScheduler(timezone="Europe/Amsterdam")

# Use cron job; runs the 'homepage' spider every 4 hours (eg. 12:10, 16:10, 20:10, etc.)

scheduler.add_job(process.crawl, 'cron', args=[MyHomepageSpider], hour='*/4', minute=10)

# Use cron job; runs the full spider every week on the monday, tuesday and saturday at 4:35 midnight

scheduler.add_job(process.crawl, 'cron', args=[MySpider], day_of_week='mon,thu,sat', hour=4, minute=35)
scheduler.start()
process.start(False)
svgewumm

svgewumm1#

https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
还有另一个Scrapy实用程序,它提供了对爬行过程的更多控制:scrapy.crawler.CrawlerRunner.这个类是一个简单的 Package 器,它封装了一些简单的帮助器来运行多个爬虫程序,但是它不会以任何方式启动或干扰现有的React器。
如果您的应用程序已经在使用Twisted,并且您希望在同一个React器中运行Scrapy,则建议您使用CrawlerRunner而不是CrawlerProcess。
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
默认情况下,当您运行scrapy爬网时,Scrapy会为每个进程运行一个spider。但是,Scrapy支持使用内部API为每个进程运行多个spider。

from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
from apscheduler.schedulers.twisted import TwistedScheduler

from myprojectscraper.spiders.my_homepage_spider import MyHomepageSpider
from myprojectscraper.spiders.my_spider import MySpider

configure_logging()

runner = CrawlerRunner(get_project_settings())
scheduler = TwistedScheduler(timezone="Europe/Amsterdam")

# Use cron job; runs the 'homepage' spider every 4 hours (eg. 12:10, 16:10, 20:10, etc.)

scheduler.add_job(runner.crawl, 'cron', args=[MyHomepageSpider], hour='*/4', minute=10)

# Use cron job; runs the full spider every week on the monday, tuesday and saturday at 4:35 midnight

scheduler.add_job(runner.crawl, 'cron', args=[MySpider], day_of_week='mon,thu,sat', hour=4, minute=35)

deferred = runner.join()
deferred.addBoth(lambda _: reactor.stop())

scheduler.start()
reactor.run()  # the script will block here until all crawling jobs are finished
scheduler.shutdown()
dxpyg8gm

dxpyg8gm2#

您可以尝试在开始该过程之前删除已安装的reactor:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler

from myprojectscraper.spiders.my_homepage_spider import MyHomepageSpider
from myprojectscraper.spiders.my_spider import MySpider
import sys  """<--- import sys here"""

process = CrawlerProcess(get_project_settings())

# Start the crawler in a scheduler

scheduler = TwistedScheduler(timezone="Europe/Amsterdam")

# Use cron job; runs the 'homepage' spider every 4 hours (eg. 12:10, 16:10, 20:10, etc.)

scheduler.add_job(process.crawl, 'cron', args=[MyHomepageSpider], hour='*/4', minute=10)

# Use cron job; runs the full spider every week on the monday, tuesday and saturday at 4:35 midnight

scheduler.add_job(process.crawl, 'cron', args=[MySpider], day_of_week='mon,thu,sat', hour=4, minute=35)
scheduler.start()

if "twisted.internet.reactor" in sys.modules:
    del sys.modules["twisted.internet.reactor"] """<--- Delete twisted reactor if already installed here """

process.start(False)

这对我很有效。

pdsfdshx

pdsfdshx3#

我现在将BlockingScheduler与ProcessCrawlerRunner结合使用,并通过configure_logging()启用日志记录。

from multiprocessing import Process

from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from apscheduler.schedulers.blocking import BlockingScheduler

from myprojectscraper.spiders.my_homepage_spider import MyHomepageSpider
from myprojectscraper.spiders.my_spider import MySpider

from twisted.internet import reactor

# Create Process around the CrawlerRunner

class CrawlerRunnerProcess(Process):
    def __init__(self, spider):
        Process.__init__(self)
        self.runner = CrawlerRunner(get_project_settings())
        self.spider = spider

    def run(self):
        deferred = self.runner.crawl(self.spider)
        deferred.addBoth(lambda _: reactor.stop())
        reactor.run(installSignalHandlers=False)

# The wrapper to make it run multiple spiders, multiple times

def run_spider(spider):
    crawler = CrawlerRunnerProcess(spider)
    crawler.start()
    crawler.join()

# Enable logging when using CrawlerRunner

configure_logging()

# Start the crawler in a scheduler

scheduler = BlockingScheduler(timezone="Europe/Amsterdam")

# Use cron job; runs the 'homepage' spider every 4 hours (eg. 12:10, 16:10, 20:10, etc.)

scheduler.add_job(run_spider, 'cron', args=[MyHomepageSpider], hour='*/4', minute=10)

# Use cron job; runs the full spider every week on the monday, tuesday and saturday at 4:35 midnight

scheduler.add_job(run_spider, 'cron', args=[MySpider], day_of_week='mon,thu,sat', hour=4, minute=35)
scheduler.start()

脚本至少不会直接退出(它会阻塞)。我现在得到了预期的输出:

2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Adding job tentatively -- it will be properly scheduled when the scheduler starts
2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Adding job tentatively -- it will be properly scheduled when the scheduler starts
2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Added job "run_spider" to job store "default"
2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Added job "run_spider" to job store "default"
2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Scheduler started
2022-03-31 22:50:24 [apscheduler.scheduler] DEBUG: Looking for jobs to run
2022-03-31 22:50:24 [apscheduler.scheduler] DEBUG: Next wakeup is due at 2022-04-01 00:10:00+02:00 (in 4775.280995 seconds)

因为我们使用的是BlockingScheduler,所以调度程序不会直接退出,但是start()是一个阻塞调用,这意味着它允许调度程序无限地运行作业。

xpcnnkqh

xpcnnkqh4#

对我来说,解决方案是输入twisted的源代码,进入internet文件夹并找到selectedreactor.py
然后转到页面底部的def install处,并在installReactor(reactor)的正上方添加以下内容
例如:

def install():
    """Configure the twisted mainloop to be run using the select() reactor."""
    reactor = SelectReactor()
    from twisted.internet.main import installReactor

    if "twisted.internet.reactor" in sys.modules:
        del sys.modules["twisted.internet.reactor"]

    installReactor(reactor)

__all__ = ["install"]

这将拆除所有预安装的React器,然后安装一个新的。
这应该永久删除这个问题,因为我还没有遇到任何问题,这种方法。

相关问题