每个请求的不同Scrapy源导出目标

svmlkihl  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(106)

我尝试通过源导出将各个属性的图像URL保存在其各自的csv文件中,为了使其正常工作,每次在start_requests中生成scrapy.Request时,都必须更改custom_settings中的FEEDs csv_path。每次生成scrapy.Request时,都会为__init__中的self.get_csv_path分配一个与属性ID对应的新csv文件路径,然后def get_feeds_csv_path将其提取到FEEDS,如下面的代码所示。custom_settings中的self.feeds_csv_path似乎无法访问def get_feeds_csv_path,此处的错误在哪里?

import asyncio
from configparser import ConfigParser
import os
import pandas as pd
import scrapy
import requests
import json

class GetpropertyimgurlsSpider(scrapy.Spider):
    name = 'GetPropertyImgUrls'
    custom_settings = {
        "FEEDS": {
            self.feeds_csv_path: {
                "format": "csv",
                "overwrite": True
            }
        }
    }

    def __init__(self, *args,**kwargs):
        self.feeds_csv_path = None
        super(GetpropertyimgurlsSpider, self).__init__(*args,**kwargs)

    def start_requests(self):
        files = self.get_html_files()  # List of html file full paths
        for file in files[:2]:
            self.feeds_csv_path = self.get_feeds_csv_path(file)
            yield scrapy.Request(file, callback=self.parse)

    def parse(self, response):
        texts = response.xpath("//text()").getall()
        text = texts[1]
        json_text = json.loads(text)
        #print(text)
        photos = json_text["@graph"][3]["photo"]
        for photo in photos:
            yield photo["contentUrl"]

    def get_feeds_csv_path(self, html_file_path):
        property_id = html_file_path.split("/")[-2].split("_")[1]
        feeds_csv_path = f"{html_file_path}/images/Property_{property_id}_ImgSrcs.csv"
        return feeds_csv_path

    def get_path(self):
        config = ConfigParser()
        config.read("config.ini")  # Location relative to main.py
        path = config["scrapezoopla"]["path"]
        return path

    #Returns a list of html file dirs
    def get_html_files(self):
        path = self.get_path()
        dir = f"{path}/data/properties/"
        dir_list = os.listdir(dir)
        folders = []
        for ins in dir_list:
            if os.path.isdir(f"{dir}{ins}") == True:
                append_ins = folders.append(ins)

        html_files = []
        for folder in folders:
            html_file = f"{dir}{folder}/{folder}.html"
            if os.path.isfile(html_file) == True:
                append_html_file = html_files.append(f"file:///{html_file}")
        return html_files
gc0ot86w

gc0ot86w1#

我看到的第一个问题是,您在spider类的命名空间作用域中使用了self关键字。self关键字仅在示例方法中可用,其中您将该关键字作为第一个参数传入。例如,def __init__(self...)
即使self是可用的,它仍然不会工作,因为一旦创建了custom_settings字典,self.feeds_csv_path在运行时会立即转换为它的字符串值,因此更新示例变量对custom_settings属性没有影响。
另一个问题是scrapy收集了所有的自定义设置,并在爬行真正开始之前将它们存储在内部,在爬行过程中更新custom_settings字典可能不会真正产生效果,但我不确定。
尽管如此,您的目标仍然是可以实现的。我能想到的一种方法是创建FEEDS字典运行时,但在启动爬网和过滤之前,使用自定义scrapy.Item类来过滤哪个项属于哪个输出。
我没有办法测试它,所以它可能是错误的,但这里是一个例子,我指的是:

from configparser import ConfigParser
import json
import os
import scrapy

def get_path():
    config = ConfigParser()
    config.read("config.ini")  # Location relative to main.py
    path = config["scrapezoopla"]["path"]
    return path

# Returns a list of html file dirs

def get_html_files():
    path = get_path()
    folder = f"{path}/data/properties/"
    dir_list = os.listdir(folder)
    html_files = []
    for ins in dir_list:
        if os.path.isdir(f"{folder}{ins}"):
            if os.path.isfile(f"{folder}{ins}/{ins}.html"):
                html_files.append(f"file:///{folder}{ins}/{ins}.html")
    return html_files

def get_feeds_csv_path(self, html_file_path):
    property_id = html_file_path.split("/")[-2].split("_")[1]
    feeds_csv_path = f"{html_file_path}/images/Property_{property_id}_ImgSrcs.csv"
    return feeds_csv_path

def create_custom_item():
    class Item(scrapy.Item):
        contentUrl = scrapy.Field()
    return Item

def customize_settings():
    feeds = {}
    files = get_html_files()
    start_urls = {}
    for path in files:
        custom_class = create_custom_item()
        output_path = get_feeds_csv_path(path)
        start_urls[path] = custom_class
        feeds[output_path] = {
            "format": "csv",
            "item_classes": [custom_class],
        }
    custom_settings = {"FEEDS": feeds}
    return custom_settings, start_urls

class GetpropertyimgurlsSpider(scrapy.Spider):
    name = 'GetPropertyImgUrls'
    custom_settings, start_urls = customize_settings()

    def start_requests(self):
        for uri, itemclass in self.start_urls.items():
            yield scrapy.Request(uri, callback=self.parse, cb_kwargs={'itemclass': itemclass})

    def parse(self, response, itemclass):
        texts = response.xpath("//text()").getall()
        text = texts[1]
        json_text = json.loads(text)
        photos = json_text["@graph"][3]["photo"]
        for photo in photos:
            item = itemclass()
            item['contentUrl'] = photo["contentUrl"]
            yield item

相关问题