无法从其他python文件运行scrapy,传递url

yb3bgrhw  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(124)

我尝试开始scrapy后,做了一些其他的事情在我的主要。后,解析一些数据,我有网址,并希望将他们首先传递给scrapy,然后开始的过程。
主要的

urls = {'start_urls': ['google.com']}

        spider = LiveSpider(None,**urls)
        #spider.run()
        process = CrawlerProcess(LiveSpider)
        process.crawl(spider)
        process.start()

零碎的实施

class LiveSpider(scrapy.Spider):

    name = 'live'
    allowed_domains = ['google.com']
    #start_urls = []

    def __init__(self, category=None, *args,**kwargs):
        super(LiveSpider, self).__init__(*args,**kwargs)
        self.start_urls = kwargs.get('start_urls')
        for url in self.start_urls:
            print('urls: ' + url)
        print('Urls done')

    def parse(self, response):
        res = response.text

日志

urls: google.com
Urls done
Traceback (most recent call last):
  File "/home/jocke/PycharmProjects/theca-crawler/read_sitemap.py", line 95, in <module>
    process = CrawlerProcess(LiveSpider)
  File "/home/jocke/PycharmProjects/theca-crawler/venv/lib/python3.10/site-packages/scrapy/crawler.py", line 289, in __init__
    super().__init__(settings)
  File "/home/jocke/PycharmProjects/theca-crawler/venv/lib/python3.10/site-packages/scrapy/crawler.py", line 166, in __init__
    self.spider_loader = self._get_spider_loader(settings)
  File "/home/jocke/PycharmProjects/theca-crawler/venv/lib/python3.10/site-packages/scrapy/crawler.py", line 148, in _get_spider_loader
    cls_path = settings.get('SPIDER_LOADER_CLASS')
AttributeError: type object 'LiveSpider' has no attribute 'get'
nimxete2

nimxete21#

在运行了几个测试之后,这就是解决方案。将start_urls添加到爬网调用中。
根本不需要init函数
主要


# Define urls list before

process = CrawlerProcess()
process.crawl(LiveSpider, start_urls=product_urls)
process.start(stop_after_crawl=False)

刮擦工具
职业活蜘蛛(战斗蜘蛛):

name = 'live'
allowed_domains = ['google.com']
start_urls = []

def parse(self, response):
    res = response.text
    #Do some parsing

相关问题