scrapy而不是scraping

dluptydi 于 2021-09-08 发布在 Java

关注(0)|答案(2)|浏览(434)

我正试图从网站上抓取新闻，但我创建的蜘蛛在抓取时没有抓取任何内容，我在日志中收到：info:crawled 0 pages（以0 pages/min的速度），scrawed 0 items（以0 items/min的速度）。
下面是我的代码：

import scrapy
from ..items import AoscraperItem

items = AoscraperItem()

class AoSpider(scrapy.Spider):
    name = "ao_spider"

    def start_requests(self):
        yield scrapy.Request(url="https://mothership.sg/", callback=self.parse)

    def parse(self, response,**kwargs):
        article_links = response.xpath("//div[@class='ind-article']/a/@href")
        article_links_ext = article_links.extract()

        for url in article_links_ext:
            yield response.follow(url=url, callback=self.parse_article)

    def parse_article(self, response):
        title = response.xpath("//h1/text()").get()
        # author_date = response.xpath("//div[@class='article-info ao-link-news']/span")
        author = response.xpath("//span[@class='author-name']/text()").get()
        date = response.xpath("//span[@class='publish-date']/text()").get()

        items["title"] = title
        items["author"] = author
        items["date"] = date

        yield items

我不明白为什么它不会在网站上刮掉任何东西。
如果有人能帮忙，我真的很感激。

python web-scraping scrapy

来源：https://stackoverflow.com/questions/68312960/scrapy-not-scraping

2条答案

按热度按时间

w6mmgewl1#

你的 XPath 在中提取链接时不正确 parse 功能。应该是 article_links = response.xpath("//div[contains(@class,'ind-article')]/a/@href") 或者您可以使用下面修改过的代码。
代码

import scrapy
from ..items import AoscraperItem

items = AoscraperItem()

class AoSpider(scrapy.Spider):
    name = "ao_spider"

    def start_requests(self):
        yield scrapy.Request(url="https://mothership.sg/", callback=self.parse)

    def parse(self, response,**kwargs):
        article_links = response.xpath("//div[contains(@class,'ind-article')]/a/@href")
        article_links_ext = article_links.extract()

        for url in article_links_ext:
                yield response.follow(url=url, callback=self.parse_article,dont_filter=True)

    def parse_article(self, response):
        title = response.xpath("//h1/text()").get()
        # author_date = response.xpath("//div[@class='article-info ao-link-news']/span")
        author = response.xpath("//span[@class='author-name']/text()").get()
        date = response.xpath("//span[@class='publish-date']/text()").get()

        items["title"] = title
        items["author"] = author
        items["date"] = date

        yield items

赞(0）回复(0）举报 2021-09-08

ttcibm8c2#

希望它能很好地工作。

import scrapy
from ..items import AoscraperItem

items = AoscraperItem()

class AoSpider(scrapy.Spider):
    name = "ao_spider"

    def start_requests(self):
        yield scrapy.Request(url="https://mothership.sg/", callback=self.parse)

    def parse(self, response,**kwargs):
        article_links = response.xpath('//*[@id="latest-news"]/div/a/@href')
        article_links_ext = article_links.extract()

        for url in article_links_ext:
                yield response.follow(url=url, callback=self.parse_article,dont_filter=True)

    def parse_article(self, response):
        title = response.xpath("//h1/text()").get()
        # author_date = response.xpath("//div[@class='article-info ao-link-news']/span")
        author = response.xpath("//span[@class='author-name']/text()").get()
        date = response.xpath('(//*[@class="publish-date"]/text())[2]').get()

        items["title"] = title
        items["author"] = author
        items["date"] = date

        yield items

赞(0）回复(0）举报 2021-09-08

我来回答

scrapy而不是scraping

2条答案

相关问题

热门标签

最新问答