python-3.x 使用BeautifulSoup提取脚本内容的某些键

zengzsys  于 2023-06-25  发布在  Python
关注(0)|答案(2)|浏览(97)

我用“BeautifulSoup”提取了某个脚本的内容。脚本的内容包含“类似json”的结构化数据。
我想提取第一个“content”组的三个“url”和第二个“content”组的“defeatedBosses”。

这是提取的脚本内容(部分):

new WH.Wow.TodayInWow(WH.ge('tiw-standalone'), [{
    "id": "dungeons-and-raids",
    "groups": [{
        "content": {
            "lines": [{
                "icon": "achievement_boss_archaedas",
                "url": "\/affix=9\/tyrannical"
            }, {
                "icon": "spell_shaman_lavasurge",
                "url": "\/affix=3\/volcanic"
            }, {
                "icon": "spell_shadow_bloodboil",
                "url": "\/affix=8\/sanguine"
            }],
            "icons": "large"
        },
        "id": "mythicaffix",
    }, {
        "content": {
            "defeatedBosses": 9,
        },
        "id": "mythic-progression",
        "url": "\/aberrus-the-shadowed-crucible\/overview"
    }, 

    ...

到目前为止,我的Python(3.11)脚本:

import re
import json
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import json

req = Request('https://www.wowhead.com/today-in-wow', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, "html.parser")

all_scripts = soup.find_all('script')
script_sp = all_scripts[36]

// My try

model_data = re.search(r"content = ({.*?});", script_sp, flags=re.S)
model_data = model_data.group(1)

model_data = json.loads(model_data)

print(model_data)

我得到一个错误:

TypeError: expected string or bytes-like object, got 'Tag'
62o28rlo

62o28rlo1#

给出错误:TypeError:需要字符串或字节类对象,得到“Tag”
你应该调用.string
如果一个标签只有一个子标签,并且该子标签是NavigableString,则该子标签将以.string的形式提供:

all_scripts = soup.find_all('script')
script_sp = all_scripts[36].string

此外,我还将你的正则表达式设置为:

model_data = re.search(r"new WH\.Wow\.TodayInWow\(WH\.ge\('tiw-standalone'\), (\[.*?\](?=\, true\);))", script_sp, flags=re.S)

打印大量JSON数据。
要获得实际所需的值,我将它留给你,因为它太多的JSON找到正确的路径:)

vjhs03f7

vjhs03f72#

下面是一个例子,你可以如何下载页面,解析所需的数据和打印样本信息(关于美国地下城和突袭的信息):

import re
import json
from urllib.request import Request, urlopen

req = Request('https://www.wowhead.com/today-in-wow', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read().decode('utf-8')

json_data = re.search(r"TodayInWow\(WH\.ge\('tiw-standalone'\), (.*), true\);", html_page)
json_data = json.loads(json_data.group(1))

# uncomment to print all data:
# print(json.dumps(json_data, indent=4))

for part in json_data:
    if part['id'] == 'dungeons-and-raids' and part['regionId'] == 'US':
        for g in part['groups']:
            print(g['name'], g.get('url', '-'))

图纸:

Mythic+ Affixes /guides/mythic-keystones-and-dungeons
Aberrus, the Shadowed Crucible (Mythic) https://www.wowhead.com/guide/raids/aberrus-the-shadowed-crucible/overview
Conquest Points -

编辑:为了更容易搜索,我建议将JSON数据从列表转换为字典:

import re
import json
from urllib.request import Request, urlopen

req = Request(
    "https://www.wowhead.com/today-in-wow", headers={"User-Agent": "Mozilla/5.0"}
)
html_page = urlopen(req).read().decode("utf-8")

json_data = re.search(
    r"TodayInWow\(WH\.ge\('tiw-standalone'\), (.*), true\);", html_page
)
json_data = json.loads(json_data.group(1))

# uncomment to print all data:
# print(json.dumps(json_data, indent=4))

# transform the received data from list to a dictionary (for easier search)
data = {
    (d["id"], d["regionId"]): {dd["id"]: dd for dd in d["groups"]} for d in json_data
}

for line in data[("dungeons-and-raids", "US")]["mythicaffix"]['content']['lines']:
    l = line['name'], line['url']
    if line['name'] == 'Tyrannical':
        print(' --> ', *l)
    else:
        print('     ', *l)

图纸:

-->  Tyrannical /affix=9/tyrannical
      Volcanic /affix=3/volcanic
      Sanguine /affix=8/sanguine

相关问题