如何提取表值并加载到pandas数据框中?

2ledvvac  于 2023-06-20  发布在  其他
关注(0)|答案(1)|浏览(107)

我有这个密码。我正在尝试从this website提取数据到pandas中。

from pyquery import PyQuery as pq
import requests
import pandas as pd

url = "https://www.tsa.gov/travel/passenger-volumes"
content = requests.get(url).content
doc = pq(content)
Passengers = doc(".views-align-center").text()

方法一:

df = pd.DataFrame([x.split(' ') for x in Passengers.split(' ')]) 
print(df)

方法二:

Passengers = Passengers.replace(' ',';')
Passengers

对于方法1,是否可以使用pandas数据框unstack来获得正确的表结构?
还是方法2比较好?如何定期拆分字符串并加载到pandas中?

ztmd8pv5

ztmd8pv51#

你可以直接在Pandas中这样做:

import pandas as pd
import numpy as np
import requests
url = "https://www.tsa.gov/travel/passenger-volumes"
html = requests.get(url).content
df_list = pd.read_html(html)   # gives a list of the DFs extracted
print(df_list[0])

这给出了DataFrame:

Date       2023     2022     2021    2020     2019
0     6/1/2023  2463873.0  2228271  1815931  391882  2623947
1    5/31/2023  2255052.0  2023231  1587910  304436  2370152
2    5/30/2023  2342489.0  2114935  1682752  267742  2247421
3    5/29/2023  2577437.0  2319237  1900170  353261  2499002
4    5/28/2023  2257766.0  2103022  1650454  352947  2555578
..         ...        ...      ...      ...     ...      ...
359   6/7/2022        NaN  2052377  1560561  338382  2433189
360   6/6/2022        NaN  2279743  1828396  430414  2644981
361   6/5/2022        NaN  2387196  1984658  441255  2669860
362   6/4/2022        NaN  1981408  1681192  353016  2225952
363   6/3/2022        NaN  2332592  1879885  419675  2649808

[364 rows x 6 columns]

2023中的NaN值强制使用float dtype,但您可以根据需要清理数据。例如:

df = df_list[0]
df['2023']= df['2023'].fillna(0).astype(int)

相关问题