scrapy 我想将图像存储在Excel工作表CSV中,但提供了以下数据:image/

i7uaboj4  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(189)

我想将图像存储在Excel工作表CSV中,但给我的是""而不是图像URL

class NewsSpider(scrapy.Spider):
    name = "articles"

    def start_requests(self):
        url = input("Enter the article url: ")

        yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):

        Feature_Image =response.xpath('//*[@id="article-wrapper"]/article/section[2]/div/div/div/img//@src').get()
        Feature_Image = response.urljoin(Feature_Image)

        yield{
        'Publication Date': Published_Date,
        'Feature_Image': Feature_Image,
        'Article Content': Content
        }
            # =============== Data Store +++++++++++++++++++++
        Data = [[Category,Headlines,Author,Source,Published_Date,Feature_Image,Content,url]]
        try:
            df = pd.DataFrame (Data, columns = ['Category','Headlines','Author','Source','Published_Date','Feature_Image','Content','URL'])
            print(df)
            with open('C:/Users/Public/pagedata.csv', 'a') as f:
                df.to_csv(f, header=False)
        except:
            df = pd.DataFrame (Data, columns = ['Category','Headlines','Author','Source','Published_Date','Feature_Image','Content','URL'])
            print(df)
            df.to_csv('C:/Users/Public/pagedata.csv', mode='a')
2j4z5cfb

2j4z5cfb1#

1.图片的url是绝对url,所以不需要用urljoin()方法重新做绝对url,这也是不抓取原始图片url的主要原因。
1.您的图像url选择了xpath表达式只选择一个图像。因此请去掉@src中多余的正斜杠
1.您没有获得正确的图像url,因为@src选择了作为输出的图像url,但原始图像url的属性为@data-src

试试看:

import scrapy

class NewsSpider(scrapy.Spider):
    name = "articles"
    def start_requests(self):
        #https://skift.com/2022/10/08/american-express-travels-rebound-and-other-top-stories-this-week/
        url = input("Enter the article url: ")

        yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):

        Feature_Image =response.xpath('//*[@id="article-wrapper"]/article/section[2]/div/div/div/img/@data-src').get()

        yield {

            #'Publication Date': Published_Date,
            'Feature_Image': Feature_Image,
            #'Article Content': Content

            }

输出:

{'Feature_Image': 'https://skift.com/wp-content/uploads/2022/10/American_Express_office_in_Rome-1-e1665181357253-1024x682.jpg'}

相关问题