无法从Scrapy Spider导出数据,未定义start_url

iswrvxsc  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(159)

我对Python还很陌生,根据DataCamp和youtube教程的指示,我正在尝试运行一个蜘蛛来抓取一个网站,并从最近的(数千个)视频中提取元数据。
到目前为止,我的Spider看起来像这样:

class NaughtySpider(scrapy.Spider):
  name = "naughtyspider"
  allowed_domains = ["example.com"]
  start_url = ("https://www.example.com/video?o=cm")
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = start_url,
                         callback = self.parse_video)
  # First parsing method
  def parse_video(self, response):
    self.log('F i n i s h e d  s c r a p i n g ' + response.url)
    video_links = response.css('ul#videoCategory').css('li.videoBox').css('div.thumbnail-info-wrapper').css('span.title > a').css('::attr(href)') #Correct path, chooses 32 videos from page ignoring the links coming from ads
    links_to_follow = video_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_metadata)
    #Continue through pagination
    next_page_url = response.css('li.page_next > a.orangeButton::attr(href)').extract_first()
    if next_page_url:
        next_page_url = response.urljoin(next_page_url)
        yield scrapy.Request(url=next_page_url, callback=self.parse_video)
  # Second parsing method
  def parse_metadata(self, response):
    # Create a SelectorList of the course titles text
    video_title = response.css('div.title-container > h1.title > span.inlineFree::text')
    # Extract the text and strip it clean
    video_title_ext = video_title.extract_first().strip()
    # Extract views
    video_views = response.css('span.count::text').extract_first()
    # Extract tags
    video_tags = response.css('div.tagsWrapper a::text').extract()
    del video_tags[-1] #Eliminate '+' tag, which is for suggestions
    # Extract Categories
    video_categories = response.css('div.categoriesWrapper a::text').extract()
    del video_categories[-1] #Same as tags
    # Fill in the dictionary
    yield {
        'title': video_title_ext,
        'views': video_views,
        'tags': video_tags,
        'categories': video_categories,
    }

我按照文档中介绍的这种看似简单的方法导出收集到的数据

scrapy crawl quotes -o quotes.json

但当我运行等价的代码时

scrapy crawl naughtyspider -o data.csv

我得到以下错误日志:

2019-08-17 22:24:54 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "C:\Users\bla\Anaconda3\lib\site-packages\scrapy\core\engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "C:\Users\bla\naughty\naughty\spiders\NaughtySpider.py", line 11, in start_requests
    yield scrapy.Request(url = start_url,
NameError: name 'start_url' is not defined
2019-08-17 22:24:54 [scrapy.core.engine] INFO: Closing spider (finished)

特别令人沮丧的是,它是在前面的代码行中定义的。我在其他问题中看到过类似的情况,但似乎没有一个完全符合我正在使用的代码。
提前感谢,如果有重大错误影响代码,请道歉,周围的资源似乎对初学者一点也不友好(没有说明他们使用的是哪个终端/ shell,例如,主要使用Mac等)。

5sxhfpxr

5sxhfpxr1#

如果要引用一个类变量,则需要使用self.

def start_requests(self):
    yield scrapy.Request(url = self.start_url,
                         callback = self.parse_video)

相关问题