json 使用Python进行网页搜罗时隔离数据

nle07wnf 于 2023-04-08 发布在 Python

关注(0)|答案(2)|浏览(112)

我试图从一个网站使用Python来存储HTML数据，并使用特定格式的JSON文件设置自动网页搜罗。我已经有了JSON文件模板，并且已经能够使用BeautifulSoup将HTML数据作为.text文件获取。但是，我无法弄清楚如何在不直接更改代码的情况下选择数据的特定部分。有没有什么我可以做的，或者有必要自己插入所有这些数据？谢谢，下面是我使用的代码。

import requests
from bs4 import BeautifulSoup
# need to automate page swaping but for now test
# need to inciment over tr class-2 ->class 895 page = requests.get('https://www.finalfantasyd20.com/bestiary)
page = requests.get('https://www.finalfantasyd20.com/bestiary/undead/abadon/') 
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id= 'main')
Name = soup.find(id='abadon')
# print(Name.text)
# Type = soup.find() not gonna work with how this is caus no header
Stats = results.find_all('p') 
for stat in Stats:
    print(stat.text)
    Str = stat.find(string='Str')
    print(Str)

我已经尝试了许多尝试来隔离特定的值而不把它放进自己，但一直失败。

JSON

来源：https://stackoverflow.com/questions/75894644/isolating-data-while-web-scraping-using-python

2条答案

按热度按时间

1wnzp6jl1#

在我尝试的时候，print（Str）没有输出任何东西。也许你需要这个：

str_list =[]

for stat in Stats:
    print(stat.text)
    Str = stat.find(string='Str')
    str_list.append(stat.text)
    #print(Str)

赞(0）回复(0）举报 2023-04-08

wvt8vs2t2#

正如我所理解的，你想在STATISTICS头（h5）下面报废统计数据。正如你所看到的，在STATISTICS下面有一段
和它的孩子是你的目标：

<p><strong>Str</strong> 26, <strong>Dex</strong> 18......</p>

我们可以把它看作一棵树，其中p是父节点，

<strong>Str</strong> 
       ' 26, '
       <strong>Dex</strong>
       ' 18, '
        .
        .
        .

是子节点
一种解决办法是：
1/查找带有strong标签和'stat'字符串的children，其中stat可以是Str或DEX...[在您的情况下，stat.find（“strong”，string ='Str'）]
2/导航到下一个同级以提取相应的值[Str.next_sibling]
查看BeatifulSoup官方文档以了解更多信息https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=next_sibling#next-sibling-and-previous-sibling
下面是代码的修补版本

import requests
from bs4 import BeautifulSoup
import re
# need to automate page swaping but for now test
# need to inciment over tr class-2 ->class 895 page = requests.get('https://www.finalfantasyd20.com/bestiary)
page = requests.get('https://www.finalfantasyd20.com/bestiary/undead/abadon/')

soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id= 'main')
Name = soup.find(id='abadon')
# print(Name.text)
# Type = soup.find() not gonna work with how this is caus no header
stats = results.find_all('p')
for stat in stats:
    # print(stat.text)
    # print(stat)
    Str = stat.find("strong",string='Str')
    if Str is not None:
        Str_text = Str.text
        # here is the value of Str
        value = Str.next_sibling
        print(value)

你也可以对其他统计数据做同样的事情。

赞(0）回复(0）举报 2023-04-08

我来回答

json 使用Python进行网页搜罗时隔离数据

2条答案

相关问题

热门标签

最新问答