是否允许使用Scrapy Image Pipeline进行重复下载？

lrl1mhuk 于 2022-11-09 发布在其他

关注(0)|答案(4)|浏览(237)

请看下面我的代码的一个示例版本，它使用Scrapy Image Pipeline从一个站点下载/抓取图像：

import scrapy
from scrapy_splash import SplashRequest
from imageExtract.items import ImageextractItem

class ExtractSpider(scrapy.Spider):
    name = 'extract'
    start_urls = ['url']

    def parse(self, response):
        image = ImageextractItem()
        titles = ['a', 'b', 'c', 'd', 'e', 'f']
        rel = ['url1', 'url2', 'url3', 'url4', 'url5', 'url6']

        image['title'] = titles
        image['image_urls'] = rel
        return image

这一切都工作正常，但根据默认设置，避免下载重复。有没有什么办法覆盖这一点，使我可以下载重复也？谢谢。

scrapy

来源：https://stackoverflow.com/questions/45177367/allow-duplicate-downloads-with-scrapy-image-pipeline

4条答案

按热度按时间

vq8itlhq1#

多亏了托马斯的指导，我终于找到了下载复制图片的方法。
在MediaPipeline类的_process_request中，我注解了这些行。
# Return cached result if request was already seen # if fp in info.downloaded: # return defer_result(info.downloaded[fp]).addCallbacks(cb, eb)
# Check if request is downloading right now to avoid doing it twice # if fp in info.downloading: # return wad
一个未捕获的KeyError将发生，但它似乎不影响我的结果，所以我停止进一步挖掘。

赞(0）回复(0）举报 2022-11-09

hwazgwia2#

我认为一个可能的解决方案是创建您自己的图像管道，该管道继承自scrapy.pipelines.images.ImagesPipeline，并使用覆盖的方法get_media_requests（请参见文档示例）。

赞(0）回复(0）举报 2022-11-09

uwopmtnx3#

为了克服Rick提到的KeyError，我做的是：
在类MediaPipeline中查找函数_cache_result_and_execute_waiters，您将看到类似的if情况，如下所示

if isinstance(result, Failure):
   # minimize cached information for failure 
   result.cleanFailure()
   result.frames = []
   result.stack = None

我添加了另一个if case，以检查fp是否在info.waiting中，之后的所有内容都在此case中

if fp in info.waiting:
    info.downloading.remove(fp)  
    info.downloaded[fp] = result  # cache result
    for wad in info.waiting.pop(fp):
        defer_result(result).chainDeferred(wad)

在调试日志中，您的scrapy Item的"images"路径名仍然是不正确的。但是我通过为我所有的"image_urls"创建一个图像名称列表，将其保存在正确的路径中

赞(0）回复(0）举报 2022-11-09

57hvy0tb4#

https://github.com/scrapy/scrapy/blob/c5627af15bcf413c04539aeb47dd07cf8b3e4092/scrapy/pipelines/media.py#L99


# Return cached result if request was already seen

        if fp in info.downloaded:
            return defer_result(info.downloaded[fp]).addCallbacks(cb, eb)

        # Otherwise, wait for result
        wad = Deferred().addCallbacks(cb, eb)
        info.waiting[fp].append(wad)

由于fp是一个请求的指纹，其实现如下：
https://github.com/scrapy/scrapy/blob/c5627af15bcf413c04539aeb47dd07cf8b3e4092/scrapy/utils/request.py#L35

def request_fingerprint(
    request: Request,
    include_headers: Optional[Iterable[Union[bytes, str]]] = None,
    keep_fragments: bool = False,
) -> str:
    """
    Return the request fingerprint as an hexadecimal string.

    The request fingerprint is a hash that uniquely identifies the resource the
    request points to. For example, take the following two urls:

    http://www.example.com/query?id=111&cat=222
    http://www.example.com/query?cat=222&id=111

    Even though those are two different URLs both point to the same resource
    and are equivalent (i.e. they should return the same response).
...

我认为在图像url中添加一些随机参数，而不是注解一些源代码，会更优雅。
就像这样：

...
class YourImagePipelineClass(ImagesPipeline):
    def get_media_requests(self, item, info):
        url = item.get('img_url') + '?<some_params_key>=%s' % str(time.time())
        yield scrapy.Request(url, meta=item, dont_filter=True)
...

赞(0）回复(0）举报 2022-11-09

我来回答

是否允许使用Scrapy Image Pipeline进行重复下载？

4条答案

相关问题

热门标签

最新问答