scrapy 为什么我无法抓取页面中的所有项目?[副本]

nzkunb0c  于 2023-08-05  发布在  其他
关注(0)|答案(1)|浏览(142)

此问题在此处已有答案

How can I scrape a page with dynamic content (created by JavaScript) in Python?(18回答)
6天前关门了。
我试图在这个网站上抓取每个房子的href:https://www.zapimoveis.com.br/venda/imoveis/ma+sao-jose-de-ribamar/。问题是页面有150个房子,但我的代码每页只刮15个房子。我不知道问题出在我的xpaths还是代码上。
这就是代码:

def parse(self, response):

hrefs = response.css('a.result-card ::attr(href)').getall()


for url in hrefs:

yield response.follow(url, callback=self.parse_imovel_info,

dont_filter = True

)


def parse_imovel_info(self, response):


zap_item = ZapItem()


imovel_info = response.css('ul.amenities__list ::text').getall()

tipo_imovel = response.css('a.breadcrumb__link--router ::text').get()

endereco_imovel = response.css('span.link ::text').get()

preco_imovel = response.xpath('//li[@class="price__item--main text-regular text-regular__bolder"]/strong/text()').get()

condominio = response.xpath('//li[@class="price__item condominium color-dark text-regular"]/span/text()').get()

iptu = response.xpath('//li[@class="price__item iptu color-dark text-regular"]/span/text()').get()

area = response.xpath('//ul[@class="feature__container info__base-amenities"]/li').css('span[itemprop="floorSize"]::text').get()

num_quarto = response.xpath('//ul[@class="feature__container info__base-amenities"]/li').css('span[itemprop="numberOfRooms"]::text').get()

num_banheiro = response.xpath('//ul[@class="feature__container info__base-amenities"]/li').css('span[itemprop="numberOfBathroomsTotal"]::text').get()

num_vaga = response.xpath('//ul[@class="feature__container info__base-amenities"]/li[@class="feature__item text-regular js-parking-spaces"]/span/text()').get()

andar = response.xpath('//ul[@class="feature__container info__base-amenities"]/li').css('span[itemprop="floorLevel"]::text').get()

url = response.url

id = re.search(r'id-(\d+)/', url).group(1)


filtering = lambda info: [check if info == check.replace('\n', '').lower().strip() else None for check in imovel_info]


lista = {

'academia': list(filter(lambda x: "academia" in x.lower(), imovel_info)),

'piscina': list(filter(lambda x: x != None, filtering('piscina'))),

'spa': list(filter(lambda x: x != None, filtering('spa'))),

'sauna': list(filter(lambda x: "sauna" in x.lower(), imovel_info)),

'varanda_gourmet': list(filter(lambda x: "varanda gourmet" in x.lower(), imovel_info)),

'espaco_gourmet': list(filter(lambda x: "espaço gourmet" in x.lower(), imovel_info)),

'quadra_de_esporte': list(filter(lambda x: 'quadra poliesportiva' in x.lower(), imovel_info)),

'playground': list(filter(lambda x: "playground" in x.lower(), imovel_info)),

'portaria_24_horas': list(filter(lambda x: "portaria 24h" in x.lower(), imovel_info)),

'area_servico': list(filter(lambda x: "área de serviço" in x.lower(), imovel_info)),

'elevador': list(filter(lambda x: "elevador" in x.lower(), imovel_info))

}


for info, conteudo in lista.items():

if len(conteudo) == 0:

zap_item[info] = None

else:

zap_item[info] = conteudo[0]


zap_item['valor'] = preco_imovel,

zap_item['tipo'] = tipo_imovel,

zap_item['endereco'] = endereco_imovel.replace('\n', '').strip(),

zap_item['condominio'] = condominio,

zap_item['iptu'] = iptu,

zap_item['area'] = area,

zap_item['quarto'] = num_quarto,

zap_item['vaga'] = num_vaga,

zap_item['banheiro'] = num_banheiro,

zap_item['andar'] = andar,

zap_item['url'] = response.url,

zap_item['id'] = int(id)

yield zap_item

字符串
有人能帮帮我吗?

mv1qrgav

mv1qrgav1#

根据提供的代码,您似乎正在使用Web抓取框架(可能是Scrapy)从指定的网站提取数据。您遇到了麻烦,因为网站上总共有150个属性,但您的代码在每个页面上只抓取15个房屋。
网站是分页的,所以房子分散在几个页面上,而你的代码只抓取了第一页(有15个房子),这是最有可能导致这种结果的原因。您必须添加分页到您的蜘蛛,以刮所有150家。
此问题的一般解决方案如下:
确定分页URL模式:查看网页上的分页。当您移动到下一个页面时,请检查URL以查看是否有任何模式随每个页面而改变。
让你的spider更加分页友好:更新您的爬行器,以便按照分页链接从每个页面中抓取数据。可能需要更新解析方法以适应分页逻辑。
下面是如何在代码中处理分页的示例:

import scrapy

class MySpider(scrapy.Spider):
    name = 'zap_spider'
    start_urls = ['https://www.zapimoveis.com.br/venda/imoveis/ma+sao-jose-de-ribamar/']

    def parse(self, response):
        # Scrape hrefs from the current page
        hrefs = response.css('a.result-card ::attr(href)').getall()
        for url in hrefs:
            yield response.follow(url, callback=self.parse_imovel_info, dont_filter=True)

        # Check if there's a next page and follow it
        next_page_url = response.css('a.pagination__item--next ::attr(href)').get()
        if next_page_url:
            yield response.follow(next_page_url, callback=self.parse, dont_filter=True)

    def parse_imovel_info(self, response):
        # Your parsing logic remains the same
        # ...

字符串

相关问题