我尝试开始scrapy后,做了一些其他的事情在我的主要。后,解析一些数据,我有网址,并希望将他们首先传递给scrapy,然后开始的过程。
主要的
urls = {'start_urls': ['google.com']}
spider = LiveSpider(None,**urls)
#spider.run()
process = CrawlerProcess(LiveSpider)
process.crawl(spider)
process.start()
零碎的实施
class LiveSpider(scrapy.Spider):
name = 'live'
allowed_domains = ['google.com']
#start_urls = []
def __init__(self, category=None, *args,**kwargs):
super(LiveSpider, self).__init__(*args,**kwargs)
self.start_urls = kwargs.get('start_urls')
for url in self.start_urls:
print('urls: ' + url)
print('Urls done')
def parse(self, response):
res = response.text
日志
urls: google.com
Urls done
Traceback (most recent call last):
File "/home/jocke/PycharmProjects/theca-crawler/read_sitemap.py", line 95, in <module>
process = CrawlerProcess(LiveSpider)
File "/home/jocke/PycharmProjects/theca-crawler/venv/lib/python3.10/site-packages/scrapy/crawler.py", line 289, in __init__
super().__init__(settings)
File "/home/jocke/PycharmProjects/theca-crawler/venv/lib/python3.10/site-packages/scrapy/crawler.py", line 166, in __init__
self.spider_loader = self._get_spider_loader(settings)
File "/home/jocke/PycharmProjects/theca-crawler/venv/lib/python3.10/site-packages/scrapy/crawler.py", line 148, in _get_spider_loader
cls_path = settings.get('SPIDER_LOADER_CLASS')
AttributeError: type object 'LiveSpider' has no attribute 'get'
1条答案
按热度按时间nimxete21#
在运行了几个测试之后,这就是解决方案。将start_urls添加到爬网调用中。
根本不需要init函数
主要
刮擦工具
职业活蜘蛛(战斗蜘蛛):