Pandas Dataframe :将read_html表的一部分切片到 Dataframe 中

1aaf6o9v  于 2023-02-02  发布在  其他
关注(0)|答案(1)|浏览(134)

我想导入一部分的html表到一个数据框。
下表为:在下面,我只想导入"Total Electric Industry"

我正在运行下面的代码在谷歌colab:

# Total residential customers
us_res_html = 'https://www.eia.gov/electricity/annual/html/epa_02_01.html'
us_res = pd.read_html(us_res_html)
us_res = us_res[1]
print(type(us_res)) 
print(us_res.columns)
print(us_res.dtypes)
# Consider only the total electric industry table
# The table starts with "Total Electric Industry"
# Next table starts with "Full-Service Providers"
# These texts were repeated in all the columns; Hence, search in the first column
start_idx = us_res[us_res[us_res.columns[0]]=="Total Electric Industry"].index 
end_idx = us_res[us_res[us_res.columns[0]]=="Full-Service Providers"].index

当前输出:

Int64Index([], dtype='int64')
z0qdvdin

z0qdvdin1#

首先删除第一列中的不间断空格\xa0

print (us_res.iloc[:, 0].head().tolist())
['Total\xa0Electric\xa0Industry', '2011', '2012', '2013', '2014']

us_res.iloc[:, 0] = us_res.iloc[:, 0].replace(r'\xa0',' ', regex=True)

然后,可以在第一次匹配之后比较所有值的第一列DataFrame.ilocSeries.cummax,对于反转掩码,使用~并在boolean indexing中过滤,最后通过转换为整数,按iloc的位置移除第一行:

df = (us_res[us_res.iloc[:, 0].eq("Total Electric Industry").cummax() 
             & ~us_res.iloc[:, 0].eq("Full-Service Providers").cummax()]
          .iloc[1:].astype(int))
print (df)
    Year  Residential  Commercial  Industrial  Transportation      Total
1   2011    126143072    17638062      727920              92  144509146
2   2012    126832343    17729029      732385              83  145293840
3   2013    127777153    17679562      831790              75  146288580
4   2014    128680416    17853995      839212              79  147373702
5   2015    129811718    17985690      835536              78  148633022
6   2016    131068760    18148353      838059              86  150055258
7   2017    132579747    18359427      840329              86  151779589
8   2018    133893321    18605393      840321              83  153339118
9   2019    135249616    18694240      954222              83  154898161
10  2020    136682001    18848813      992311              83  156523208
11  2021    138308772    19102304     1022212              82  158433370

print (df.dtypes)
Year              int32
Residential       int32
Commercial        int32
Industrial        int32
Transportation    int32
Total             int32
dtype: object

相关问题