scrapy 如何使用Scrappy从上一个函数中抓取链接

pbwdgjma  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(138)

我有这个代码来抓取一个网站。def parse函数给出了一个完整新闻的完整链接,def parse_item返回3个条目,分别是日期、标题和完整URL的全文。
我如何也刮和保存链接从def parse?所以代码将返回4项,这是日期,标题,文本,也是链接。
下面是代码:

import scrapy
from scrapy.crawler import CrawlerProcess

class weeklymining(scrapy.Spider):
    name = 'weeklymining'
    start_urls = ['https://www.miningweekly.com/page/coal/page:'+str(x) for x in range(0,351)]

    def parse(self, response):
        for link in response.xpath('//*[@class="en-serif"]/a/@href'):
            yield response.follow(
                url=link.get(),
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@class="article_title"]//p/span[1]/text()').extract(),
            'title': response.xpath('//*[@id="article_headline"]/text()').get(),
            'text':''.join([x.get().strip() for x in response.xpath('//*[@id="article_content_container"]//p//text()')])
            }
if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(weeklymining)
    process.start()

如果您能提供任何帮助,我们将不胜感激。

omqzjyyz

omqzjyyz1#

只需将response.url添加到产出项即可。
例如:

import scrapy
from scrapy.crawler import CrawlerProcess

class weeklymining(scrapy.Spider):
    name = 'weeklymining'
    start_urls = ['https://www.miningweekly.com/page/coal/page:'+str(x) for x in range(0,351)]

    def parse(self, response):
        for link in response.xpath('//*[@class="en-serif"]/a/@href'):
            yield response.follow(
                url=link.get(),
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@class="article_title"]//p/span[1]/text()').extract(),
            'title': response.xpath('//*[@id="article_headline"]/text()').get(),
            'text':''.join([x.get().strip() for x in response.xpath('//*[@id="article_content_container"]//p//text()')]),
            'link': response.url   # <-- added this
            }
if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(weeklymining)
    process.start()

相关问题