Scrapy管道不运行

rqenqsqc  于 2023-03-30  发布在  其他
关注(0)|答案(1)|浏览(149)

我有以下的蜘蛛:

class WebSpider(scrapy.Spider):
    name = "web"
    allowed_domains = ["www.web.com"]
    start_urls = ["https://www.web.com/page/"]
    custom_settings = {
        "ITEM_PIPELINES": {
            "models.pipelines.ModelsPipeline": 1,
            "models.pipelines.MongoDBPipeline": 2,
        },
        "IMAGES_STORE": get_project_settings().get("FILES_STORE"),
    }

def parse_models(self, response):
    ...
    yield WebItem(image_urls=[img_url], images=[name], name=name, collection="web")

class WebItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()
    name = scrapy.Field()
    collection = scrapy.Field()

MongoDBPipeline始终使用以下配置

"ITEM_PIPELINES": {
    "models.pipelines.ModelsPipeline": 1,
    "models.pipelines.MongoDBPipeline": 2,
}

"ITEM_PIPELINES": {
    "models.pipelines.MongoDBPipeline": 2,
}

但ModelsPipeline从不在以下任何配置中运行

"ITEM_PIPELINES": {
    "models.pipelines.ModelsPipeline": 1,
    "models.pipelines.MongoDBPipeline": 2,
}

"ITEM_PIPELINES": {
    "models.pipelines.ModelsPipeline": 1,
}

ModelsPipelineMongoDBPipeline在同一个文件中,代码如下:

class ModelsPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        pdb.set_trace()
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        pdb.set_trace()
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        adapter = ItemAdapter(item)
        adapter['image_paths'] = image_paths
        return item

但从不执行get_media_requests或item_completed
代码与doc https://docs.scrapy.org/en/latest/topics/media-pipeline.html相同
什么是错误的,什么是scrappy不运行ModelsPipeline

编辑

Scrapy版本是2.8.0
谢谢。

nbewdwxp

nbewdwxp1#

使用媒体管道时,您需要填充所有适当的设置才能使其工作。
在您的示例中,ModelsPipeline继承自ImagesPipeline,因此您必须满足所有ImagesPipeline要求。
其中包括:

  1. IMAGES_STORE设置...不是FILES_STORE
  2. scrappy项需要有适当的字段
    1.您可以使用
IMAGES_URLS_FIELD = 'field_name_for_your_images_urls'
IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'

自定义设置这些字段
1.也可以使用默认字段,或同时使用这两个字段:

import scrapy

class MyItem(scrapy.Item):
    # ... other item fields ...
    image_urls = scrapy.Field()
    images = scrapy.Field()

如果您选择使用其他几个可选设置,则需要正确设置它们,并且必须确保您的IMAGES_STORE路径已经存在。

相关问题