Scrapy,请求返回〈GET url>而不抓取任何内容

mklgxw1f  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(121)

我想刮一下www.example.com的提要sitepoint.com,这是我的代码:

import scrapy
from urllib.parse import urljoin

class SitepointSpider(scrapy.Spider):
    # TODO: Add url tags (like /javascript) to the spider based on class paraneters
    name = "sitepoint"
    allowed_domains = ["sitepoint.com"]
    start_urls = ["http://sitepoint.com/javascript/"]

    def parse(self, response):
        data = []
        for article in response.css("article"):
            title = article.css("a.t12xxw3g::text").get()
            href = article.css("a.t12xxw3g::attr(href)").get()
            img = article.css("img.f13hvvvv::attr(src)").get()
            time = article.css("time::text").get()
            url = urljoin("https://sitepoint.com", href)
            text = scrapy.Request(url, callback=self.parse_article)

            data.append(
                {"title": title, "href": href, "img": img, "time": time, "text": text}
            )
        yield data

    def parse_article(self, response):
        text = response.xpath(
           '//*[@id="main-content"]/article/div/div/div[1]/section/text()'
        ).extract()
        yield text

这是我得到的回应:-

[{'title': 'How to Build an MVP with React and Firebase', 
'href': '/react-firebase-build-mvp/', 
'img': 'https://uploads.sitepoint.com/wp-content/uploads/2021/09/1632802723react-firebase-mvp- 
app.jpg', 
'time': 'September 28, 2021', 
'text': <GET https://sitepoint.com/react-firebase-build-mvp/>}]

它只是不刮URL。我遵循了X1 E0 F1 X中所说的一切,但仍然不能使它工作。

euoag5mw

euoag5mw1#

你必须访问详细页面,从上市刮的文章。
在这种情况下,您必须首先生成URL,然后在最后一个spider中生成数据
此外,//*[@id="main-content"]/article/div/div/div[1]/section/text()不会返回任何文本,因为在section标记下有很多HTML元素
一种解决方案是,您可以刮取section标记中的所有HTML元素,并在以后清理它们以获取文章文本数据
以下是完整的工作代码

import re

import scrapy
from urllib.parse import urljoin

class SitepointSpider(scrapy.Spider):
    # TODO: Add url tags (like /javascript) to the spider based on class paraneters
    name = "sitepoint"
    allowed_domains = ["sitepoint.com"]
    start_urls = ["http://sitepoint.com/javascript/"]

    def clean_text(self, raw_html):
        """
        :param raw_html: this will take raw html code
        :return: text without html tags
        """
        cleaner = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
        return re.sub(cleaner, '', raw_html)

    def parse(self, response):
        for article in response.css("article"):
            title = article.css("a.t12xxw3g::text").get()
            href = article.css("a.t12xxw3g::attr(href)").get()
            img = article.css("img.f13hvvvv::attr(src)").get()
            time = article.css("time::text").get()
            url = urljoin("https://sitepoint.com", href)
            yield scrapy.Request(url, callback=self.parse_article, meta={"title": title,
                                                                         "href": href,
                                                                         "img": img,
                                                                         "time": time})

    def parse_article(self, response):
        title = response.request.meta["title"]
        href = response.request.meta["href"]
        img = response.request.meta["img"]
        time = response.request.meta["time"]
        all_data = {}
        article_html = response.xpath('//*[@id="main-content"]/article/div/div/div[1]/section').get()
        all_data["title"] = title
        all_data["href"] = href
        all_data["img"] = img
        all_data["time"] = time
        all_data["text"] = self.clean_text(article_html)

        yield all_data

相关问题