scrapy 如何从同一个python脚本运行2个爬虫程序

wgeznvg7  于 2022-11-09  发布在  Python
关注(0)|答案(1)|浏览(118)

我有两个可以独立运行的Python爬虫。
crawler1.py crawler2.py
它们是我要运行的分析的一部分,我想将它们全部导入到一个公共脚本中。

from crawler1.py import * 
from crawler2.py import *

在我的脚本下面我有这样的东西

if <condition1>:
    // running crawler1 
    runCrawler('crawlerName', '/dir1/dir2/')

if <condition2>:
    // running crawler2 
    runCrawler('crawlerName', '/dir1/dir2/')

runCrawler为:

def runCrawler(crawlerName, crawlerFileName):
    print('Running crawler for ' + crawlerName)

    process = CP(
        settings={
            'FEED_URI'   : crawlerFileName,
            'FEED_FORMAT': 'csv'
        }
    )

    process.crawl(globals()[crawlerName])
    process.start()

出现以下错误:

Exception has occurred: ReactorAlreadyInstalledError
reactor already installed

第一个爬虫运行正常。第二个有问题。有什么想法吗?
我通过visual studio调试器运行上面的代码。

mw3dktmi

mw3dktmi1#

最好的办法是这样
您的代码应该是

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

# your code

settings={
            'FEED_FORMAT': 'csv'
        }
process = CrawlerRunner(Settings)

if condition1:
    process.crawl(spider1,crawlerFileName=crawlerFileName) 
if condition2:
    process.crawl(spider2,crawlerFileName=crawlerFileName)

d = process.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()  # it will run both crawlers and code inside the function

你的蜘蛛应该像

class spider1(scrapy.Spider):
    name = "spider1"
    custom_settings = {'FEED_URI'   : spider1.crawlerFileName}
    def start_requests(self):
            yield scrapy.Request('https://scrapy.org/') 

    def parse(self, response):
        pass

相关问题