I am trying to scrape the image src using scrapy in python but instead, form img element want to scrape from < source> element that has no class

plupiseo  于 2022-11-09  发布在  Python
关注(0)|答案(1)|浏览(161)

我试图在python中使用scrapy来刮取图像src,但是相反,表单img元素想从没有class属性或src属性的元素中刮取,任何人都可以请帮助我如何做到这一点,提前感谢。

<source media="(min-width: 1024px)" sizes="1140px" srcset="https://static1.simpleflyingimages.com/wordpress/wp-content/uploads/2022/09/Thomas-Boon-Air-Canada-2.jpg?q=50&amp;fit=contain&amp;w=1140&amp;h=&amp;dpr=1.5">

我尝试的代码:

from urllib.parse import urljoin
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from datetime import datetime
import pandas as pd

class NewsSpider(scrapy.Spider):
    name = "simpleflying"

    def start_requests(self):
        url = input("Enter the article url: ")

        yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        Feature_Image = [i.strip() for i in response.css('source media="(min-width: 1024px)" ::attr(data-origin-srcset)').getall()][0]
        yield{
            'Feature_Image': Feature_Image,
        }

这是网站的链接:https://simpleflying.com/best-airlines-travel-with-babies-young-children/

qij5mzcb

qij5mzcb1#

您可以尝试下一个示例

import scrapy
class NewsSpider(scrapy.Spider):
    name = "articles"
    def start_requests(self):
        url='https://simpleflying.com/best-airlines-travel-with-babies-young-children/'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):

        img_url = response.xpath('//*[@class="heading_image responsive-img img-size-heading-image-full-width expandable "]/figure/picture/img/@data-img-url').get()
        yield {
            'img_url':img_url
        }

输出:

{'img_url': 'https://static1.simpleflyingimages.com/wordpress/wp-content/uploads/2022/09/Thomas-Boon-Air-Canada-2.jpg'}

相关问题