此问题在此处已有答案:
How can I scrape a page with dynamic content (created by JavaScript) in Python?(18回答)
6天前关门了。
我试图在这个网站上抓取每个房子的href:https://www.zapimoveis.com.br/venda/imoveis/ma+sao-jose-de-ribamar/。问题是页面有150个房子,但我的代码每页只刮15个房子。我不知道问题出在我的xpaths还是代码上。
这就是代码:
def parse(self, response):
hrefs = response.css('a.result-card ::attr(href)').getall()
for url in hrefs:
yield response.follow(url, callback=self.parse_imovel_info,
dont_filter = True
)
def parse_imovel_info(self, response):
zap_item = ZapItem()
imovel_info = response.css('ul.amenities__list ::text').getall()
tipo_imovel = response.css('a.breadcrumb__link--router ::text').get()
endereco_imovel = response.css('span.link ::text').get()
preco_imovel = response.xpath('//li[@class="price__item--main text-regular text-regular__bolder"]/strong/text()').get()
condominio = response.xpath('//li[@class="price__item condominium color-dark text-regular"]/span/text()').get()
iptu = response.xpath('//li[@class="price__item iptu color-dark text-regular"]/span/text()').get()
area = response.xpath('//ul[@class="feature__container info__base-amenities"]/li').css('span[itemprop="floorSize"]::text').get()
num_quarto = response.xpath('//ul[@class="feature__container info__base-amenities"]/li').css('span[itemprop="numberOfRooms"]::text').get()
num_banheiro = response.xpath('//ul[@class="feature__container info__base-amenities"]/li').css('span[itemprop="numberOfBathroomsTotal"]::text').get()
num_vaga = response.xpath('//ul[@class="feature__container info__base-amenities"]/li[@class="feature__item text-regular js-parking-spaces"]/span/text()').get()
andar = response.xpath('//ul[@class="feature__container info__base-amenities"]/li').css('span[itemprop="floorLevel"]::text').get()
url = response.url
id = re.search(r'id-(\d+)/', url).group(1)
filtering = lambda info: [check if info == check.replace('\n', '').lower().strip() else None for check in imovel_info]
lista = {
'academia': list(filter(lambda x: "academia" in x.lower(), imovel_info)),
'piscina': list(filter(lambda x: x != None, filtering('piscina'))),
'spa': list(filter(lambda x: x != None, filtering('spa'))),
'sauna': list(filter(lambda x: "sauna" in x.lower(), imovel_info)),
'varanda_gourmet': list(filter(lambda x: "varanda gourmet" in x.lower(), imovel_info)),
'espaco_gourmet': list(filter(lambda x: "espaço gourmet" in x.lower(), imovel_info)),
'quadra_de_esporte': list(filter(lambda x: 'quadra poliesportiva' in x.lower(), imovel_info)),
'playground': list(filter(lambda x: "playground" in x.lower(), imovel_info)),
'portaria_24_horas': list(filter(lambda x: "portaria 24h" in x.lower(), imovel_info)),
'area_servico': list(filter(lambda x: "área de serviço" in x.lower(), imovel_info)),
'elevador': list(filter(lambda x: "elevador" in x.lower(), imovel_info))
}
for info, conteudo in lista.items():
if len(conteudo) == 0:
zap_item[info] = None
else:
zap_item[info] = conteudo[0]
zap_item['valor'] = preco_imovel,
zap_item['tipo'] = tipo_imovel,
zap_item['endereco'] = endereco_imovel.replace('\n', '').strip(),
zap_item['condominio'] = condominio,
zap_item['iptu'] = iptu,
zap_item['area'] = area,
zap_item['quarto'] = num_quarto,
zap_item['vaga'] = num_vaga,
zap_item['banheiro'] = num_banheiro,
zap_item['andar'] = andar,
zap_item['url'] = response.url,
zap_item['id'] = int(id)
yield zap_item
字符串
有人能帮帮我吗?
1条答案
按热度按时间mv1qrgav1#
根据提供的代码,您似乎正在使用Web抓取框架(可能是Scrapy)从指定的网站提取数据。您遇到了麻烦,因为网站上总共有150个属性,但您的代码在每个页面上只抓取15个房屋。
网站是分页的,所以房子分散在几个页面上,而你的代码只抓取了第一页(有15个房子),这是最有可能导致这种结果的原因。您必须添加分页到您的蜘蛛,以刮所有150家。
此问题的一般解决方案如下:
确定分页URL模式:查看网页上的分页。当您移动到下一个页面时,请检查URL以查看是否有任何模式随每个页面而改变。
让你的spider更加分页友好:更新您的爬行器,以便按照分页链接从每个页面中抓取数据。可能需要更新解析方法以适应分页逻辑。
下面是如何在代码中处理分页的示例:
字符串