为什么scrapy图像管道不下载图像？

332nm8kg 于 2022-11-09 发布在其他

关注(0)|答案(2)|浏览(121)

我正在尝试从产品图库下载所有图像。我已经尝试了上述脚本，但不知何故，我无法下载图像。我可以设法下载包含ID的主图像。图库中的其他图像不包含任何ID，我无法下载它们。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class BasicSpider(CrawlSpider):
    name = 'basic'
    allowed_domains = ['www.leebmann24.de']
    start_urls = ['https://www.leebmann24.de/bmw.html']

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//div[@class='category-products']/ul/li/h2/a"), callback='parse_item'),
        Rule(LinkExtractor(restrict_xpaths="//li[@class='next']/a"), callback='parse_item', follow=True),
    )

    def parse_item(self, response):

        yield {
            'URL': response.url,
            'Price': response.xpath("normalize-space(//span[@class='price']/text())").get(),
            'image_urls': response.xpath("//div[@class='item']/a/img/@src").getall()
        }

scrapy

来源：https://stackoverflow.com/questions/73626406/why-scrapy-image-pipeline-is-not-downloading-images

2条答案

按热度按时间

mzillmmw1#

@Raisul Islam，'//*[@id="image-main"]/@src'正在生成图像URL，我没有遇到任何问题。请查看输出，无论这是否是您的期望。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class BasicSpider(CrawlSpider):
    name = 'basic'
    allowed_domains = ['www.leebmann24.de']
    start_urls = ['https://www.leebmann24.de/bmw.html']

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//div[@class='category-products']/ul/li/h2/a"), callback='parse_item'),
        Rule(LinkExtractor(restrict_xpaths="//li[@class='next']/a"), callback='parse_item', follow=True),
    )

    def parse_item(self, response):

        yield {
            'URL': response.url,
            'Price': response.xpath("normalize-space(//span[@class='price']/text())").get(),
            'image_urls': response.xpath('//*[@id="image-main"]/@src').get()
        }

输出：

{'URL': 'https://www.leebmann24.de/aruma-antirutschmatte-3er-f30-f31.html', 'Price': '57,29\xa0€', 'image_urls': 'https://www.leebmann24.de/media/catalog/product/cache/1/image/363x/040ec09b1e35df139433887a97daa66f/a/r/aruma-antirutschmatte-94452302924-1.jpg'}
2022-09-07 02:35:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.leebmann24.de/bmw-erste-hilfe-set-klarsichtbeutel-51477158344.html> (referer: https://www.leebmann24.de/bmw.html?p=2)
2022-09-07 02:35:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.leebmann24.de/bmw-erste-hilfe-set-klarsichtbeutel-51477158344.html>
{'URL': 'https://www.leebmann24.de/bmw-erste-hilfe-set-klarsichtbeutel-51477158344.html', 'Price': '15,64\xa0€', 'image_urls': 'https://www.leebmann24.de/media/catalog/product/cache/1/image/363x/040ec09b1e35df139433887a97daa66f/b/m/bmw-erste-hilfe-klarsichtbeutel-51477158433.jpg'}
2022-09-07 02:35:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.leebmann24.de/erste-hilfe-set.html> (failed 1 times): 503 Service Unavailable
2022-09-07 02:35:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.leebmann24.de/aruma-antirutschmatte-x5-f15.html> (referer: https://www.leebmann24.de/bmw.html)
2022-09-07 02:35:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.leebmann24.de/aruma-antirutschmatte-x5-f15.html>
{'URL': 'https://www.leebmann24.de/aruma-antirutschmatte-x5-f15.html', 'Price': '71,66\xa0€', 'image_urls': 'https://www.leebmann24.de/media/catalog/product/cache/1/image/363x/040ec09b1e35df139433887a97daa66f/a/r/aruma-antirutschmatte-94452347734-1.jpg'}

赞(0）回复(0）举报 2022-11-09

f2uvfpb92#

此表达式将获取除main之外的所有产品图像（您说您已经拥有它）：