python-3.x 如何使用BeautifulSoup从复杂的HTML代码中提取温度数据？

vwkv1x7d 于 2023-03-20 发布在 Python

关注(0)|答案(2)|浏览(158)

我可以缩小范围，筛选出标题，所以这是我所尝试的：

def get_avg_high_temp(*args):

url_site = get_url(*args)
#returns URL for city argument from weatherspark.com

for url in url_site:
    res = requests.get(url)
    soup = BeautifulSoup(res.text,"html.parser")
    print(soup.find(title='Temp.'))

对一个城市运行上面的代码，得到的结果如下：

get_avg_high_temp('Calgary Canada')

<tr style="color: #333;" title="Temp.">
<td style="text-overflow: ellipsis; overflow: hidden; white-space: nowrap; max-width: 25vw;">
<span style="color: #333;">Temp.</span></td>

<td style="text-decoration: underline rgba(51,51,51,.35);">-6 Â°C</td><td>-5 Â°C</td><td>-1 
Â°C</td><td>5 Â°C</td><td>10 Â°C</td><td>14 Â°C</td><td style="text-decoration: underline 
rgba(51,51,51,.35);">17 Â°C</td><td>16 Â°C</td><td>11 Â°C</td><td>6 Â°C</td><td>-2 Â°C</td><td 
style="text-decoration: underline rgba(51,51,51,.35);">-6 Â°C</td></tr>

我尝试了上面的代码，但得到了上面提到的HTML。我想利用“°”来获得之前的数字...但不知道如何做到这一点。
预期结果是返回摄氏数的数字

python-3.x

来源：https://stackoverflow.com/questions/75735898/how-do-i-pull-in-the-temperature-figures-from-convoluted-html-code-using-beautif

2条答案

按热度按时间

0vvn1miw1#

以下方法应该有效：

def get_avg_high_temp(*args):
    url_site = get_url(*args)
    temps = []
    for url in url_site:
        res = requests.get(url)
        soup = BeautifulSoup(res.text,"html.parser")
        soup = soup.find(title='Temp.')
        soup = soup.find_all('td')
        for chunk in soup:
            if chunk.text[0].isdigit():
                temps.append(int(chunk.text[:-3]))
    print(temps)
    return temps

另外，你获取的是平均温度，而不是平均高温。如果你想要平均高温，就像你的函数定义所说的，从“Temp.”切换到“High”。如果你想要总平均值而不是平均值列表，使用return sum(temps) / len(temps)。

赞(0）回复(0）举报 2023-03-20

gab6jxml2#

您可以使用stripped_strings和split()：

[t.split()[0] for t in soup.stripped_strings][1:]

或者更好地选择更具体的css selector：

[t.text.split()[0] for t in soup.select('[title="Temp."] td:not(:has(span))')]

两者都使用split()将元素中的字符串按空格分割为数字和单位。
以防万一，如果你处理表格，pandas.read_html()通常是一种简单的获取数据的方法。从那里你可以操作，过滤，转换和导出：

import pandas as pd

df = pd.read_html('https://weatherspark.com/y/2349/Average-Weather-in-Calgary-Canada-Year-Round')[2]
df.replace('°F','',regex=True)

示例

from bs4 import BeautifulSoup

html = '''
<tr style="color: #333;" title="Temp.">
<td style="text-overflow: ellipsis; overflow: hidden; white-space: nowrap; max-width: 25vw;">
<span style="color: #333;">Temp.</span></td>

<td style="text-decoration: underline rgba(51,51,51,.35);">-6 Â°C</td><td>-5 Â°C</td><td>-1 
Â°C</td><td>5 Â°C</td><td>10 Â°C</td><td>14 Â°C</td><td style="text-decoration: underline 
rgba(51,51,51,.35);">17 Â°C</td><td>16 Â°C</td><td>11 Â°C</td><td>6 Â°C</td><td>-2 Â°C</td><td 
style="text-decoration: underline rgba(51,51,51,.35);">-6 Â°C</td></tr>
'''
soup = BeautifulSoup(html)

[t.text.split()[0] for t in soup.select('[title="Temp."] td:not(:has(span))')]

输出

['-6', '-5', '-1', '5', '10', '14', '17', '16', '11', '6', '-2', '-6']

赞(0）回复(0）举报 2023-03-20

我来回答

python-3.x 如何使用BeautifulSoup从复杂的HTML代码中提取温度数据？

2条答案

示例

输出

相关问题

热门标签

最新问答