我试着抓取网站上的所有文章来获取全文、日期和标题。我使用xpath来获取我需要的信息。我试着非常小心地编写xpath,但是当我运行代码时,它什么结果都没有。
错误消息:
result = xpathev(query, namespaces=nsp,
File "src/lxml/etree.pyx", line 1582, in lxml.etree._Element.xpath
File "src/lxml/xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: Invalid expression
据我所知,该消息意味着xpath有问题。
下面是我创建的代码:
import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess
class barchart(scrapy.Spider):
name = 'barchart'
start_urls = ['https://www.barchart.com/news/commodities/energy']
custom_settings = {
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'DOWNLOAD_DELAY': 1,
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
def start_requests(self):
for url in self.start_urls:
yield Request(url, callback=self.parse)
def parse(self, response):
for link in response.xpath('//*[@class="stories-list"]//*[@class=["story clearfix "]/a/@href'):
yield response.follow(
url=link.get(),
callback=self.parse_item
)
def parse_item(self, response):
yield {
'date': response.xpath('//*[@class="field__item"]/time/text()').extract(),
'title': response.xpath('//*[@class="article-header-wrapper"]//h1//text()').get(),
'text':''.join([x.get().strip() for x in response.xpath('//*[@class="article-content ng-binding ng-scope"]//p//text()')])
}
if __name__ == '__main__':
process = CrawlerProcess()
process.crawl(barchart)
process.start()
我应该如何编写xpath,以便捕获此Web废弃所需的所有信息?
非常感谢你的帮助
1条答案
按热度按时间7fyelxc51#
在对初始的xpath表达式做了一些小的修改之后,我就可以从第一页得到所有的链接了,但是看起来内部的文章本身的呈现方式不同,可能使用了angular,所以对于那些文章,我最终使用了scrapy-selenium扩展。
有了这个配置,我就能够得到结果。