用scrapy刮取文章,但结果为空

csga3l58  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(136)

我试着抓取网站上的所有文章来获取全文、日期和标题。我使用xpath来获取我需要的信息。我试着非常小心地编写xpath,但是当我运行代码时,它什么结果都没有。
错误消息:

result = xpathev(query, namespaces=nsp,
  File "src/lxml/etree.pyx", line 1582, in lxml.etree._Element.xpath
  File "src/lxml/xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
  File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: Invalid expression

据我所知,该消息意味着xpath有问题。
下面是我创建的代码:

import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess

class barchart(scrapy.Spider):
    name = 'barchart'
    start_urls = ['https://www.barchart.com/news/commodities/energy']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
    }

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url, callback=self.parse)

    def parse(self, response):
        for link in response.xpath('//*[@class="stories-list"]//*[@class=["story clearfix "]/a/@href'):
            yield response.follow(
                url=link.get(),
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@class="field__item"]/time/text()').extract(),
            'title': response.xpath('//*[@class="article-header-wrapper"]//h1//text()').get(),
            'text':''.join([x.get().strip() for x in response.xpath('//*[@class="article-content ng-binding ng-scope"]//p//text()')])
        }

if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(barchart)
    process.start()

我应该如何编写xpath,以便捕获此Web废弃所需的所有信息?
非常感谢你的帮助

7fyelxc5

7fyelxc51#

在对初始的xpath表达式做了一些小的修改之后,我就可以从第一页得到所有的链接了,但是看起来内部的文章本身的呈现方式不同,可能使用了angular,所以对于那些文章,我最终使用了scrapy-selenium扩展。
有了这个配置,我就能够得到结果。

import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess

from scrapy_selenium import SeleniumRequest

class barchart(scrapy.Spider):
    name = 'barchart'
    start_urls = ['https://www.barchart.com/news/commodities/energy']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 10,
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
        'SELENIUM_DRIVER_NAME' : "chrome",
        'SELENIUM_DRIVER_EXECUTABLE_PATH' : "chromedriver.exe",
        'SELENIUM_DRIVER_ARGUMENTS' : [],
        "DOWNLOADER_MIDDLEWARES" : {
            'scrapy_selenium.SeleniumMiddleware': 800
        }
    }

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url, callback=self.parse)

    def parse(self, response):
        sections = response.xpath('//div[contains(@class,"story clearfix ")]')
        for section in sections:
            link = section.xpath('.//a[contains(@class,"story-link")]/@href').get()
            yield SeleniumRequest(url=link, callback=self.parse_item, wait_time=10)

    def parse_item(self, response):
        item = {
            'date': response.xpath('//div[@class="article-meta"]/span[contains(@class,"article-published")]/text()').get().strip(),
            'title': response.xpath('//h1[contains(@class,"article-title")]/text()').get().strip(),
            'text':''.join([x.get().strip() for x in response.xpath('//div[contains(@class,"article-content")]//p/text()')])
        }
        yield item

if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(barchart)
    process.start()

相关问题