scrapy 抓取信息时出现不可成形列表错误

zy1mlcev  于 2022-11-09  发布在  其他
关注(0)|答案(3)|浏览(111)

我试图提取信息,但他们会给予我错误的不成形的列表,这些是页面链接https://rejestradwokatow.pl/adwokat/abaewicz-agnieszka-51004

import scrapy
from scrapy.http import Request
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }

    def parse(self, response):
        wev={}
        tic = response.xpath("//div[@class='line_list_K']//div//span//text()").getall()
        det = response.xpath("//div[@class='line_list_K']//div//div//text()").getall()
        wev[tuple(tic)]=[i.strip() for i in det]

        yield wev

它们将给予如下输出:

但我希望输出如下:

ny6fqffe

ny6fqffe1#

字典键不能是可变的,必须是可散列的。请尝试以下操作:

def parse(self, response):
    wev={}
    tic = response.xpath("//div[@class='line_list_K']//div//span//text()").getall()
    det = response.xpath("//div[@class='line_list_K']//div//div//text()").getall()
    wev[tuple(tic)]=[i.strip() for i in det]
    print(wev)
    yield wev

或者更简单:

def parse(self, response):
    tic = response.xpath("//div[@class='line_list_K']//div//span//text()").getall()
    det = response.xpath("//div[@class='line_list_K']//div//div//text()").getall()
    yield {tuple(tic): [i.strip() for i in det]}
koaltpgm

koaltpgm2#

必须使用zip()ticdet中的值进行分组

for name, value in zip(tic, det):
            wev[name.strip()] = value.strip()

这将给予wev,其中

{
    'Status:': 'Były adwokat', 
    'Data wpisu w aktualnej izbie na listę adwokatów:': '2013-09-01', 
    'Data skreślenia z listy:': '2019-07-23', 
    'Ostatnie miejsce wpisu:': 'Katowice', 
    'Stary nr wpisu:': '1077', 
    'Zastępca:': 'Pieprzyk Mirosław'
}

并且这将给予具有正确值的CSV

Status:,Data wpisu w aktualnej izbie na listę adwokatów:,Data skreślenia z listy:,Ostatnie miejsce wpisu:,Stary nr wpisu:,Zastępca:
Były adwokat,2013-09-01,2019-07-23,Katowice,1077,Pieprzyk Mirosław

编辑:

最后,您应该首先获取行,然后在每行中搜索namevalue

all_rows = response.xpath("//div[@class='line_list_K']/div")

        for row in all_rows:
            name  = row.xpath(".//span/text()").get()
            value = row.xpath(".//div/text()").get()
            wev[name.strip()] = value.strip()

如果某行没有值,或者某行有异常值,如email,由JavaScript添加(但scrapy可以运行JavaScript),但它将其作为属性保存在标签<div class="address_e" data-ea="adwokat.adach" data-eb="gmail.com">中,这种方法有时会更安全
因为只有一些页面有Email,所以它可能不会在文件中添加这个值-所以它需要在开始时添加默认值到wev = {'Email:': '', ...}

wev = {'Email:': ''}

       for row in all_rows:
            name  = row.xpath(".//span/text()").get()
            value = row.xpath(".//div/text()").get()
            if name and value:
                wev[name.strip()] = value.strip()
            elif name and name.strip() == 'Email:':
                    # <div class="address_e" data-ea="adwokat.adach" data-eb="gmail.com"></div>
                    div = row.xpath('./div')
                    email_a = div.attrib['data-ea']
                    email_b = div.attrib['data-eb']
                    wev[name.strip()] = f'{email_a}@{email_b}'

完整工作代码


# rejestradwokatow

import scrapy
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):

    name = 'test'

    start_urls = [
        #'https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9',
        'https://rejestradwokatow.pl/adwokat/abaewicz-agnieszka-51004',
        'https://rejestradwokatow.pl/adwokat/adach-micha-55082',
    ]

    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }

    def parse(self, response):
        # it may need default value when item doesn't exist on page 
        wev = {
            'Status:': '',
            'Data wpisu w aktualnej izbie na listę adwokatów:': '',
            'Stary nr wpisu:': '',
            'Adres do korespondencji:': '',
            'Fax:': '',
            'Email:': '',
        }

        tic = response.xpath("//div[@class='line_list_K']//div//span//text()").getall()
        det = response.xpath("//div[@class='line_list_K']//div//div//text()").getall()

        #print(tic)
        #print(det)
        #print('---')

        all_rows = response.xpath("//div[@class='line_list_K']/div")

        for row in all_rows:
            name  = row.xpath(".//span/text()").get()
            value = row.xpath(".//div/text()").get()
            if name and value:
                wev[name.strip()] = value.strip()
            elif name and name.strip() == 'Email:':
                # <div class="address_e" data-ea="adwokat.adach" data-eb="gmail.com"></div>
                div = row.xpath('./div')
                email_a = div.attrib['data-ea']
                email_b = div.attrib['data-eb']
                wev[name.strip()] = f'{email_a}@{email_b}'

        print(wev)

        yield wev

# --- run without creating project and save results in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    #'USER_AGENT': 'Mozilla/5.0',
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(TestSpider)
c.start()
qyuhtwio

qyuhtwio3#

检查tic的数据类型。它很可能是一个不能作为字典键的列表。也许你可以根据自己的要求将它转换为元组。

相关问题