Scrapy未下载图像

j2cgzkjk  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(151)

开发人员环境

  • Windows 11
  • PyCharm Community Edition 2021.3.1
  • Python 3.10

我正在学习这个教程Download Images By Python and Scrapy,但我的脚本无法正常工作。

蜘蛛程序.py

import scrapy

class WikiSpider(scrapy.Spider):
    name = 'wiki'
    start_urls = ['https://en.wikipedia.org/wiki/Real_Madrid_CF']

    def parse(self, response):   
        urls = response.css('.image img ::attr(src)').getall()
        clean_urls = []

        for url in urls:
            clean_urls.append(response.urljoin(url))
        yield {
            'image_urls':clean_url
        }

设置.py

BOT_NAME = 'imagedownload'

SPIDER_MODULES = ['imagedownload.spiders']
NEWSPIDER_MODULE = 'imagedownload.spiders'
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'images_folder'

# Obey robots.txt rules

ROBOTSTXT_OBEY = True

在教程中,items.pypipelines.py没有被修改。当我运行我的spider时,它运行没有错误,并且我可以看到解析的图像url,但是,如果图像没有被下载:

我为解决问题所采取的步骤

1.设置ROBOTSTXT_OBEY = False
1.将此代码段添加到我的spider.py文件

save_location = os.getcwd()
    custom_settings = {
        "ITEM_PIPELINES": {'scrapy.pipelines.images.ImagesPipeline': 1},
        "IMAGES_STORE": save_location
    }

1.已尝试将此代码段添加到settings.py

IMAGES_STORE = os.getcwd()

如有任何帮助,我们将不胜感激!

What I expect is for the script to download images
dgiusagp

dgiusagp1#

你很接近了。我认为造成这一问题的原因是你没有为你生成的字典中的图像结果创建合适的Field
我会建议使用一个自定义的scrapy项目与字段预设,你可以这样做,在同一个文件作为您的蜘蛛,使它更容易,然后你应该只是添加所有的ImagesPipeline设置到custom_settings字典在您的Spider类。
例如:

import scrapy

class Item(scrapy.Item):
    images_urls = scrapy.Field()
    images = scrapy.Field()

class WikiSpider(scrapy.Spider):
    custom_settings = {
        "IMAGES_STORE" : "images",  # <- make sure whatever you put here is an existing empty folder at the top level of your project.
        "ITEM_PIPELINES" : {"scrapy.pipelines.images.ImagesPipeline": 1},
        "IMAGES_URLS_FIELD": "images_urls",
        "IMAGES_RESULT_FIELD": "images",
    }

    name = 'wiki'
    start_urls = ['https://en.wikipedia.org/wiki/Real_Madrid_CF']

    def parse(self, response):
        for url in response.css('.image img ::attr(src)').getall():
            item = Item()
            item['images_urls'] = [response.urljoin(url)]
            yield item

相关问题