如何获得一个完整的新闻文章从一个网站与scrapy

guz6ccqo 于 2022-11-09 发布在其他

关注(0)|答案(2)|浏览(167)

我还在学习如何做网页抓取，我试着通过从索引页获取所有文章，然后抓取它们的信息，以及全文来抓取网站。使用下面的代码，我可以得到我需要的所有信息--日期、时间、类别、标题--除了全文。
text': news.css('p.categoryArticle__excerpt::text').get()未捕获所有文本。
下面是我到目前为止编写的代码：

import scrapy

class CoalNewsFromOilPrice(scrapy.Spider):
    name = 'coalnews'
    start_urls = ['https://oilprice.com/Energy/Coal/']

    def parse(self, response):
        for news in response.css('div.categoryArticle__content'):
            yield {
                'datetime': news.css('p.categoryArticle__meta::text').get(),
                'category': news.xpath('//h1[@class="categoryHeading"]/text()').extract()[0].replace('/', '').replace(' ',''),
                'title': news.css('h2.categoryArticle__title::text').get(),
                'text':  news.css('p.categoryArticle__excerpt::text').get(),
            }
        next_page = response.css('a.num').attrib['href']
        if next_page is not None:
            yield response.follow(next_page, callback = self.parse)

这里是我需要的元素。当我打开html网址时，它显示了完整的文本。但我仍然不知道我应该如何得到它。我想提取html网址，但我不知道如何。

<div class="categoryArticle__content">

       <a href="https://oilprice.com/Energy/Coal/Russias-Coal-Exports-Are-On-The-Rise-As-EU-Ban-Looms.html">
          <h2 class="categoryArticle__title">Russia’s Coal Exports Are On The Rise As EU Ban Looms</h2>
       </a>
       <p class="categoryArticle__meta">Jul 06, 2022 at 09:41 | Tsvetana Paraskova</p>
       <p class="categoryArticle__excerpt"></p>
        Russian seaborne coal exports are estimated to have increased since Putin’s 
        invasion of Ukraine and the EU announcement it was banning Russian coal imports 
        from August.&nbsp;&nbsp;&nbsp;

                        </div>

我该怎么做才能得到文章的全文？

scrapy

来源：https://stackoverflow.com/questions/72969756/how-to-get-a-full-news-article-from-a-website-with-scrapy

2条答案

按热度按时间

xmakbtuz1#

现在，您的代码可以很好地在start_urls中提取全文沿着分页。实际上，我转到了详细信息页面，并从详细信息页面中使用xpath表达式获取了所有必需的数据项。

import scrapy
from scrapy.crawler import CrawlerProcess

class CoalNewsFromOilPrice(scrapy.Spider):
    name = 'coalnews'
    start_urls = ['https://oilprice.com/Energy/Coal/Page-'+str(x)+'.html' for x in range(1,18)]

    def parse(self, response):
        for link in response.xpath('//*[@class="categoryArticle__content"]/a/@href'):
            yield scrapy.Request(
                url=link.get(),
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            'datetime': response.xpath('//*[@class="article_byline"]/text()[2]').get(),
            'category': response.xpath('(//*[@itemprop="name"])[3]/text()').get(),
            'title': response.xpath('//*[@class="singleArticle__content"]/h1/text()').get(),
            'text':''.join([x.get().strip() for x in response.xpath('//*[@id="article-content"]//p//text()')])
            }

if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(CoalNewsFromOilPrice)
    process.start()

赞(0）回复(0）举报 2022-11-09

ssgvzors2#

网页抓取的成功在于理解目标页面的HTML结构，并编写正确的选择器。
与人类导航链接和查看内容的方式完全相同，代码应该导航到起始URL上的每个链接，然后查找要获取的正确元素。
看起来每篇文章上都有一个id为#article-content的div，它应该可以让你看到全文。

赞(0）回复(0）举报 2022-11-09