我想刮一下www.example.com的提要sitepoint.com,这是我的代码:
import scrapy
from urllib.parse import urljoin
class SitepointSpider(scrapy.Spider):
# TODO: Add url tags (like /javascript) to the spider based on class paraneters
name = "sitepoint"
allowed_domains = ["sitepoint.com"]
start_urls = ["http://sitepoint.com/javascript/"]
def parse(self, response):
data = []
for article in response.css("article"):
title = article.css("a.t12xxw3g::text").get()
href = article.css("a.t12xxw3g::attr(href)").get()
img = article.css("img.f13hvvvv::attr(src)").get()
time = article.css("time::text").get()
url = urljoin("https://sitepoint.com", href)
text = scrapy.Request(url, callback=self.parse_article)
data.append(
{"title": title, "href": href, "img": img, "time": time, "text": text}
)
yield data
def parse_article(self, response):
text = response.xpath(
'//*[@id="main-content"]/article/div/div/div[1]/section/text()'
).extract()
yield text
这是我得到的回应:-
[{'title': 'How to Build an MVP with React and Firebase',
'href': '/react-firebase-build-mvp/',
'img': 'https://uploads.sitepoint.com/wp-content/uploads/2021/09/1632802723react-firebase-mvp-
app.jpg',
'time': 'September 28, 2021',
'text': <GET https://sitepoint.com/react-firebase-build-mvp/>}]
它只是不刮URL。我遵循了X1 E0 F1 X中所说的一切,但仍然不能使它工作。
1条答案
按热度按时间euoag5mw1#
你必须访问详细页面,从上市刮的文章。
在这种情况下,您必须首先生成URL,然后在最后一个spider中生成数据
此外,
//*[@id="main-content"]/article/div/div/div[1]/section/text()
不会返回任何文本,因为在section
标记下有很多HTML元素一种解决方案是,您可以刮取
section
标记中的所有HTML元素,并在以后清理它们以获取文章文本数据以下是完整的工作代码