JSON Webscraper返回空数组

yeotifhr  于 2022-12-24  发布在  其他
关注(0)|答案(1)|浏览(152)

我正在尝试使用以下代码解析网站:

import requests 
    r = requests.get('https://www.finn.no/realestate/homes/search.html?sort=PUBLISHED_DESC')
    print(r.json())

但是,它似乎只是返回一个空数组。
我试着把它放在一个法令里,然后用

import sys, json

struct = {}
try:
    dataform = str(r).strip("'<>() ").replace('\'', '\"')
    struct = json.loads(dataform)
except:
    print(repr(r))
    print(sys.exc_info())
    
struct

代码返回:
〈响应[200]〉(〈类'JSON解码器. JSON解码错误'〉,JSON解码错误('预期值:第1行第1列(字符0)......

9lowa7mx

9lowa7mx1#

现在你试图把HTML文档当作Json来处理,所以很明显这不是你想要的。页面的Json数据被嵌入到一个<script>元素中,所以你可以使用beautifulsoup来定位它,并使用json模块来解析它:

import json
import requests
from bs4 import BeautifulSoup

r = requests.get(
    "https://www.finn.no/realestate/homes/search.html?sort=PUBLISHED_DESC"
)
soup = BeautifulSoup(r.content, "html.parser")

data = soup.select_one("#__NEXT_DATA__")
data = json.loads(data.text)

# pretty print the data:
print(json.dumps(data, indent=4))

图纸:

{
    "props": {
        "pageProps": {
            "search": {
                "docs": [
                    {
                        "type": "realestate",
                        "ad_id": 276867609,
                        "main_search_key": "SEARCH_ID_REALESTATE_NEWBUILDINGS",
                        "heading": "Unik anledning! \u00d8nsker du \u00e5 bo med \"leilighetsf\u00f8lelse\" rett ved bysentrum, og likevel ha plass til storfamilien?",
                        "location": "Fauchaldsgate 2, Gj\u00f8vik",
                        "image": {
                            "url": "https://images.finncdn.no/dynamic/default/2022/11/vertical-0/22/9/276/867/609_1157963344.jpg",
                            "path": "2022/11/vertical-0/22/9/276/867/609_1157963344.jpg",
                            "height": 1280,
                            "width": 1920,
                            "aspect_ratio": 1.5
                        },

...and so on.

相关问题