使用scrapy删除重复值

vd8tlhqk  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(164)

695 record in page,但他们给了954 record,所以有重复的值,所以我如何删除重复的值,所以他们只给了我695 record,这些是页面链接http://www.palatakd.ru/list/

import scrapy
from scrapy.http import Request

class PushpaSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['http://www.palatakd.ru/list/']
    page_number=1

    def parse(self, response):
        details=response.xpath("//p[@class='detail_block']")
        for detail in details:
            registration=detail.xpath(".//span[contains(.,'Регистрационный номер адвоката в реестре')]//following-sibling::span//text()").get()
            address=detail.xpath(".//span[contains(.,'Адрес')]//following-sibling::span//text()").get()
            phone=detail.xpath(".//span[contains(.,'Телефон')]//following-sibling::span//text()").get()
            fax=detail.xpath(".//span[contains(.,'Факс')]//following-sibling::span//text()").get()
            yield{
                'Телефон':phone,
                'Факс':fax,
                'Регистрационный номер адвоката в реестре':registration,
                'Адрес':address

            }
            next_page = 'http://www.palatakd.ru/list/?PAGEN_1=' + str(PushpaSpider.page_number)

            if PushpaSpider.page_number<=3:
                PushpaSpider.page_number += 1
                yield response.follow(next_page, callback = self.parse)
mwkjh3gx

mwkjh3gx1#

您可以启用项目管道以筛选出重复项。
例如:
在settings.py文件中打开(取消注解)ITEM_PIPELINES

ITEM_PIPELINES = {
   'project.pipelines.ProjectPipeline': 300,
}

在pipelines.py文件中筛选出重复的项目。

from scrapy.exceptions import DropItem

class ProjectPipeline:
    itemlist = []

    def process_item(self, item, spider):
        if item in self.itemlist:
            raise DropItem
        self.itemlist.append(item)
        return item

不需要对蜘蛛进行任何调整。

相关问题