scrapy 相同的文件下载

zd287kbt  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(176)

我的脚本有一个问题,比如相同的文件名,并且正在下载pdf。我已经检查了没有downloadfile的结果输出,我得到了唯一的数据。当我使用管道时,它不知何故产生了重复的下载。
这是我的剧本:

import scrapy
from environment.items import fcpItem

class fscSpider(scrapy.Spider):
    name = 'fsc'
    start_urls = ['https://fsc.org/en/members']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url, 
                callback = self.parse
            )

    def parse(self, response):
        content = response.xpath("(//div[@class='content__wrapper field field--name-field-content field--type-entity-reference-revisions field--label-hidden field__items']/div[@class='content__item even field__item'])[position() >1]")
        loader = fcpItem()
        names_add = response.xpath(".//div[@class = 'field__item resource-item']/article//span[@class='media-caption file-caption']/text()").getall()
        url = response.xpath(".//div[@class = 'field__item resource-item']/article/div[@class='actions']/a//@href").getall()

        pdf=[response.urljoin(x) for x in  url if '#' is not x]
        names = [x.split(' ')[0] for x in names_add]
        for nm, pd in zip(names, pdf):
            loader['names'] = nm
            loader['pdfs'] = [pd]
            yield loader

items.py

class fcpItem(scrapy.Item):
    names = Field()
    pdfs = Field()
    results = Field()

pipelines.py

class DownfilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None, item=None):
        items = item['names']+'.pdf'
        return items

settings.py

from pathlib import Path
import os

BASE_DIR = Path(__file__).resolve().parent.parent
FILES_STORE = os.path.join(BASE_DIR, 'fsc')

ROBOTSTXT_OBEY = False

FILES_URLS_FIELD = 'pdfs'
FILES_RESULT_FIELD = 'results'

ITEM_PIPELINES = {

    'environment.pipelines.pipelines.DownfilesPipeline': 150
}
k2arahey

k2arahey1#

我使用的是css而不是xpath。
在chrome调试面板中,标签是PDF列表项的根。在此情况下,div标签具有PDF的标题和文件下载URL的标签。根标签和标签之间存在两个子标签和兄弟标签的关系,因此xpath不是干净的方法和硬标签。一个更好的css是可以伊斯利地从根到.它不需要关系船路径. css可以跳过关系和只是子/或大子是无关紧要的.它还提供了不必考虑索引问题,即URL数组和标题数组通过索引匹配来同步。
其他关键点是URL路径解码和file_urls需要设置数组类型,即使是单个项。x1c 0d1x

fsc_蜘蛛程序.py

import scrapy
import urllib.parse
from quotes.items import fcpItem

class fscSpider(scrapy.Spider):
    name = 'fsc'
    start_urls = [
        'https://fsc.org/en/members',
    ]

    def parse(self, response):
        for book in response.css('div.field__item.resource-item'):
            url = urllib.parse.unquote(book.css('div.actions a::attr(href)').get(), encoding='utf-8', errors='replace')
            url_left = url[0:url.rfind('/')]+'/'
            title = book.css('span.media-caption.file-caption::text').get()

            item = fcpItem()
            item['original_file_name'] = title.replace(' ','_')
            item['file_urls'] = ['https://fsc.org'+url_left+title.replace(' ','%20')]
            yield item

项目.py

import scrapy

class fcpItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field
    original_file_name = scrapy.Field()

管道.py

import scrapy
from scrapy.pipelines.files import FilesPipeline

class fscPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
        file_name: str = request.url.split("/")[-1].replace('%20','_')
        return file_name

设置.py

BOT_NAME = 'quotes'

FILES_STORE =  'downloads'
SPIDER_MODULES = ['quotes.spiders']
NEWSPIDER_MODULE = 'quotes.spiders'
FEED_EXPORT_ENCODING = 'utf-8'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = { 'quotes.pipelines.fscPipeline': 1}

文件结构

执行

quotes>scrapy crawl fsc

结果

sauutmhj

sauutmhj2#

问题是每次迭代都覆盖相同的零碎项。
您需要做的是为每次解析方法产生的结果创建一个新项。我已经测试过了,并确认它确实产生了您想要的结果。
我在下面的例子中,在需要修改的行上做了和内联not。
例如:

import scrapy
from environment.items import fcpItem

class fscSpider(scrapy.Spider):
    name = 'fsc'
    start_urls = ['https://fsc.org/en/members']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url, 
                callback = self.parse
            )

    def parse(self, response):
        content = response.xpath("(//div[@class='content__wrapper field field--name-field-content field--type-entity-reference-revisions field--label-hidden field__items']/div[@class='content__item even field__item'])[position() >1]")
        names_add = response.xpath(".//div[@class = 'field__item resource-item']/article//span[@class='media-caption file-caption']/text()").getall()
        url = response.xpath(".//div[@class = 'field__item resource-item']/article/div[@class='actions']/a//@href").getall()
        pdf=[response.urljoin(x) for x in  url if '#' is not x]
        names = [x.split(' ')[0] for x in names_add]
        for nm, pd in zip(names, pdf):
            loader = fcpItem()  # Here you create a new item each iteration
            loader['names'] = nm
            loader['pdfs'] = [pd]
            yield loader

相关问题