scrapy 蜘蛛只爬行最后一个网址,但不是全部

aurhwmvo  于 2022-11-23  发布在  其他
关注(0)|答案(1)|浏览(165)

我想使用Scrapy抓取多个存储在csv文件中的url。我的代码工作(没有显示错误),但它只抓取最后一个url,但不是所有的url。这是我的代码图片。请告诉我我做错了什么。我想抓取所有的url并将抓取的文本保存在一起。我已经尝试了很多在StackOverflow上找到的建议。我的代码-

import scrapy
from scrapy import Request
from ..items import personalprojectItem

class ArticleSpider(scrapy.Spider):
    name = 'articles'
    with open('C:\\Users\\Admin\\Documents\\Bhavya\\input_urls.csv') as file:
        for line in file:
            start_urls = line

            def start_requests(self):
                request = Request(url=self.start_urls)
                yield request

        def parse(self, response):
            item = personalprojectItem()
            article = response.css('div p::text').extract()
            item['article'] = article
            yield item
a14dhokn

a14dhokn1#

下面是一个简单的例子,说明如何在一个零碎的项目中包含一个来自文件的url列表。
在scrapy project文件夹中,我们有一个包含以下链接的文本文件:

https://www.theguardian.com/technology/2022/nov/18/elon-musk-twitter-engineers-workers-mass-resignation
https://www.theguardian.com/world/2022/nov/18/iranian-protesters-set-fire-to-ayatollah-khomeinis-ancestral-home
https://www.theguardian.com/world/2022/nov/18/canada-safari-park-shooting-animals-two-charged

spider代码如下所示(同样是最小的示例):

import scrapy

class GuardianSpider(scrapy.Spider):
    name = 'guardian'
    allowed_domains = ['theguardian.com']
    start_urls = [x for x in open('urls_list.txt', 'r').readlines()]

    def parse(self, response): 
        title = response.xpath('//h1/text()').get()
        header = response.xpath('//div[@data-gu-name="standfirst"]//p/text()').get()
        yield {
            'title': title,
            'header': header
        }

如果我们用scrapy crawl guardian -o guardian_news.json运行spider,我们会得到一个如下所示的JSON文件:

[
{"title": "Elon Musk summons Twitter engineers amid mass resignations and puts up poll on Trump ban", "header": "Reports show nearly 1,200 workers left company after demand for \u2018long hours at high intensity\u2019, while Musk starts poll on whether to reinstate Donald Trump"},
{"title": "Iranian protesters set fire to Ayatollah Khomeini\u2019s ancestral home", "header": "Social media images show what is now a museum commemorating the Islamic Republic founder ablaze as protests continue"},
{"title": "Two Canadian men charged with shooting animals at safari park", "header": "Mathieu Godard and Jeremiah Mathias-Polson accused of breaking into Parc Omega in Quebec and killing three wild boar and an elk"}
]

您可以在此处找到Scrapy文档:https://docs.scrapy.org/en/latest/

相关问题