我试图从start_url中提取一些字段,并希望添加从每个已获得的URL中获得的PDF链接字段。我尝试了Scrapy,但没有幸运地添加PDF字段。以下是我的代码,
import scrapy
class MybookSpider(scrapy.Spider):
name = 'mybooks'
allowed_domains = ['gln.kemdikbud.go.id']
start_urls = ['https://gln.kemdikbud.go.id/glnsite/category/modul-gls/page/1/']
def parse(self, response):
#pass
# gathering all links
book_urls = response.xpath("//div[@class='td-module-thumb']//a/@href").getall()
total_url = len(book_urls)
i = 0
for a in range(total_url):
title = response.xpath("//h3[@class='entry-title td-module-title']//a/text()")[i].extract()
url_source = response.xpath("//div[@class='td-module-thumb']//a/@href")[i].get()
thumbnail = response.xpath('//*[@class="td-block-span4"]//*[has-class("entry-thumb")]//@src')[i].extract()
pdf = scrapy.Request(book_urls[i], self.find_details)
yield {
'Book Title': title,
'URL': url_source,
'Mini IMG': thumbnail,
'PDF Link': pdf
}
i+=1
def find_details(self, response):
# find PDF link
pdf = response.xpath("//div[@class='td-post-content']//a/@href").get()
return pdf
当我将PDF导出为CSV时,如何正确添加PDF链接字段?
1条答案
按热度按时间fbcarpbf1#
请输入您的电子邮件地址:
这意味着
pdf
变量是一个请求。Scrapy是异步的,所以你很难从一个函数中得到一个返回值。只需发出一个请求,然后用cb_kwargs将细节传递给回调函数。