我试图刮一个特定的网站,但数据是动态加载的。我发现数据是在json files,但我不能得到网站上的所有元素的列表,我需要所有的页面。
- 我怎样才能得到从数字开始的所有相似的json的列表?
- 我怎么能用这样的逻辑通读所有的页面呢?
我不知道该用什么,我试过Scrapy,但等待页面加载太复杂了,我想知道beautifulsoup或其他是否有更快的响应。
编辑:添加剪贴代码
- 我在scrappy中完成了这段代码,但是我不知道如何动态地从页面中获取所有的json
# https://www.fincaraiz.com.co/_next/data/build/proyecto-de-vivienda/altos-del-eden/el-eden/barranquilla/7109201.json?title=altos-del-eden&location1=el-eden&location2=barranquilla&code=7109201
import logging
import scrapy
from scrapy_playwright.page import PageMethod
import json
# scrapy crawl fincaraiz-home -O output-home.json
class PwspiderSpider(scrapy.Spider):
name = "fincaraiz-home"
base_url = "https://www.fincaraiz.com.co"
build_url = "https://www.fincaraiz.com.co/_next/data/build"
def start_requests(self):
yield scrapy.Request(
"https://www.fincaraiz.com.co/finca-raiz/venta/antioquia",
meta=dict(
playwright=True,
playwright_include_page=True,
playwright_page_methods=[
#PageMethod("wait_for_selector", 'div[id="listingContainer"]')
PageMethod("wait_for_selector", 'button:has-text("1")')
],
),
errback=self.errback,
)
async def parse(self, response):
for anuncio in response.xpath("//div[@id='listingContainer']/div"):
# if anuncio.xpath("article/a/@href").extract():
# yield scrapy.Request(
# self.build_url + anuncio.xpath("article/a/@href").extract()[0]+".json",
# callback=self.parse_json,
# # meta=dict(
# # callback=self.parse_json,
# # # playwright=True,
# # # playwright_include_page=True,
# # # playwright_page_methods=[
# # # PageMethod("wait_for_selector", 'button:has-text("1")')
# # # ],
# # ),
# errback=self.errback,
# )
yield {
"link": anuncio.xpath("article/a/@href").extract(),
"tipo_anuncio": anuncio.xpath("article/a/ul/li[1]/div/span/text()").extract(),
"tipo_vendedor": anuncio.xpath("article/a/ul/li[2]/div/span/text()").extract(),
"valor": anuncio.xpath("article/a/div/section/div[1]/span[1]/b/text()").extract(),
"area": anuncio.xpath("article/a/div/section/div[2]/span[1]/text()").extract(),
"habitaciones": anuncio.xpath("article/a/div/section/div[2]/span[3]/text()").extract(),
"banos": anuncio.xpath("article/a/div/section/div[2]/span[5]/text()").extract(),
"parqueadero": anuncio.xpath("article/a/div/section/div[2]/span[7]/text()").extract(),
"ubicacion": anuncio.xpath("article/a/div/section/div[3]/div/span/text()").extract(),
"imagen": anuncio.xpath("article/a/figure/img/@src").extract(),
"tipo_inmueble": anuncio.xpath("article/a/div/footer/div/span/b/text()").extract(),
"inmobiliaria": anuncio.xpath("article/a/div/footer/div/div/div").extract(),
}
# async def parse_json(self, response):
# yield json.loads(response.text)
def errback(self, failure):
logging.info(
"Handling failure in errback, request=%r, exception=%r", failure.request, failure.value
)
1条答案
按热度按时间5w9g7ksd1#
在这个网站上使用playwright不是正确的方法,你应该使用他们的公共搜索API
下面是一个示例,说明如何向API发出POST请求,并非常快速地获得json响应中的所有信息。
这段代码生成了20个页面,每个页面有25个结果,大约3秒钟,它生成的每个项目都包含了你试图用playwright提取的所有信息,看起来像这样。