我想将图像存储在Excel工作表CSV中,但给我的是"data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="
而不是图像URL
class NewsSpider(scrapy.Spider):
name = "articles"
def start_requests(self):
url = input("Enter the article url: ")
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
Feature_Image =response.xpath('//*[@id="article-wrapper"]/article/section[2]/div/div/div/img//@src').get()
Feature_Image = response.urljoin(Feature_Image)
yield{
'Publication Date': Published_Date,
'Feature_Image': Feature_Image,
'Article Content': Content
}
# =============== Data Store +++++++++++++++++++++
Data = [[Category,Headlines,Author,Source,Published_Date,Feature_Image,Content,url]]
try:
df = pd.DataFrame (Data, columns = ['Category','Headlines','Author','Source','Published_Date','Feature_Image','Content','URL'])
print(df)
with open('C:/Users/Public/pagedata.csv', 'a') as f:
df.to_csv(f, header=False)
except:
df = pd.DataFrame (Data, columns = ['Category','Headlines','Author','Source','Published_Date','Feature_Image','Content','URL'])
print(df)
df.to_csv('C:/Users/Public/pagedata.csv', mode='a')
1条答案
按热度按时间2j4z5cfb1#
1.图片的url是绝对url,所以不需要用
urljoin()
方法重新做绝对url,这也是不抓取原始图片url的主要原因。1.您的图像url选择了xpath表达式只选择一个图像。因此请去掉@src中多余的正斜杠
1.您没有获得正确的图像url,因为@src选择了作为输出的图像url,但原始图像url的属性为
@data-src
试试看:
输出: