Scrapy图像下载

flmtquvp  于 2023-04-12  发布在  其他
关注(0)|答案(6)|浏览(143)

我的蜘蛛运行没有显示任何错误,但图像不存储在文件夹中这里是我的scrapy文件:

Spider.py:

import scrapy
import re
import os
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.pipelines.images import ImagesPipeline
from production.items import ProductionItem, ListResidentialItem

class productionSpider(scrapy.Spider):
    name = "production"
    allowed_domains = ["someurl.com"]
    start_urls = [
        "someurl.com"
]

def parse(self, response):
    for sel in response.xpath('//html/body'):
        item = ProductionItem()
        img_url = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract()[0]
        yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseBasicListingInfo,  meta={'item': item})

def parseBasicListingInfo(item, response):
    item = response.request.meta['item']
    item = ListResidentialItem()
    try:
        image_urls = map(unicode.strip,response.xpath('//a[@itemprop="contentUrl"]/@data-href').extract())
        item['image_urls'] = [ x for x in image_urls]
    except IndexError:
        item['image_urls'] = ''

    return item

settings.py:

from scrapy.settings.default_settings import ITEM_PIPELINES
from scrapy.pipelines.images import ImagesPipeline

BOT_NAME = 'production'

SPIDER_MODULES = ['production.spiders']
NEWSPIDER_MODULE = 'production.spiders'
DEFAULT_ITEM_CLASS = 'production.items'

ROBOTSTXT_OBEY = True
DEPTH_PRIORITY = 1
IMAGE_STORE = '/images'

CONCURRENT_REQUESTS = 250

DOWNLOAD_DELAY = 2

ITEM_PIPELINES = {
    'scrapy.contrib.pipeline.images.ImagesPipeline': 300,
}

items.py

# -*- coding: utf-8 -*-
import scrapy

class ProductionItem(scrapy.Item):
    img_url = scrapy.Field()

# ScrapingList Residential & Yield Estate for sale
class ListResidentialItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

    pass

我的管道文件是空的,我不知道我应该添加到pipeline.py文件。
任何帮助都非常感谢。

w51jfk4q

w51jfk4q1#

我的工作最终结果:

spider.py

import scrapy
import re
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.pipelines.images import ImagesPipeline
from production.items import ProductionItem
from production.items import ImageItem

class productionSpider(scrapy.Spider):
    name = "production"
    allowed_domains = ["url"]
    start_urls = [
        "startingurl.com"
    ]

def parse(self, response):
    for sel in response.xpath('//html/body'):
        item = ProductionItem()
        img_url = sel.xpath('//a[@idd="followclaslink"]/@href').extract()[0]
        yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseImages,  meta={'item': item})

def parseImages(self, response):
    for elem in response.xpath("//img"):
        img_url = elem.xpath("@src").extract_first()
        yield ImageItem(image_urls=[img_url])

设置.py

BOT_NAME = 'production'

SPIDER_MODULES = ['production.spiders']
NEWSPIDER_MODULE = 'production.spiders'
DEFAULT_ITEM_CLASS = 'production.items'
ROBOTSTXT_OBEY = True
IMAGES_STORE = '/Users/home/images'

DOWNLOAD_DELAY = 2

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
# Disable cookies (enabled by default)

items.py

# -*- coding: utf-8 -*-
import scrapy

class ProductionItem(scrapy.Item):
    img_url = scrapy.Field()
# ScrapingList Residential & Yield Estate for sale
class ListResidentialItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

class ImageItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

管道.py

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item
pxq42qpu

pxq42qpu2#

由于您不知道在管道中放入什么,我假设您可以使用scrapy提供的默认管道来处理图像,因此在settings.py文件中,您可以像下面这样声明

ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline':1
}

此外,您的图像路径是错误的/意味着您将进入机器的绝对根路径,因此您可以将绝对路径放置到您想要保存的位置,或者只是从运行爬虫的位置创建相对路径。

IMAGES_STORE = '/home/user/Documents/scrapy_project/images'

IMAGES_STORE = 'images'

现在,在spider中提取url,但不将其保存到项目中

item['image_urls'] = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract_first()

如果使用默认管道,则字段必须是image_urls
现在,在items.py文件中,您需要添加以下两个字段(这两个字段都需要使用此字面名称)

image_urls=Field()
images=Field()

应该可以

amrnrhlw

amrnrhlw3#

在我的例子中,是IMAGES_STORE路径导致了问题
我做了IMAGES_STORE = 'images'和它的工作就像一个魅力!
下面是完整的代码:
设置:

ITEM_PIPELINES = {
   'mutualartproject.pipelines.MyImagesPipeline': 1,
}

IMAGES_STORE = 'images'

管线:

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        return item
azpvetkf

azpvetkf4#

只是在这里加上我的错误,让我困惑了几个小时。也许它可以帮助别人。

来自scrapy docs(https://doc.scrapy.org/en/latest/topics/media-pipeline.html#using-the-images-pipeline):

然后,将目标存储设置配置为将用于存储下载的映像的有效值。否则,即使在ITEM_PIPELINES设置中包含管道,管道也将保持禁用状态。
出于某种原因,我使用了冒号“:”而不是等号“=”。

# My misstake:
    IMAGES_STORE : '/Users/my_user/images'

    # Working code
    IMAGES_STORE = '/Users/my_user/images'

这不会返回一个错误,而是导致管道根本不加载,这对我来说很难解决问题。

8ljdwjyq

8ljdwjyq5#

必须在www.example.com文件中启用SPIDER_MIDDLEWARES和DOWNLOADER_MIDDLEWARESsettings.py

bwntbbo3

bwntbbo36#

我也遇到了同样的问题,没有任何帮助。在做了以下事情后开始工作:

pip install pillow

相关问题