使用pandas读取HTML

pftdvrlh  于 2023-04-28  发布在  其他
关注(0)|答案(3)|浏览(119)

这应该很容易,但我有错误,我不能工作了。我有一些英国的空气污染数据,我想分析一下。
https://uk-air.defra.gov.uk/data/DAQI-regional-data?regionIds%5B%5D=999&aggRegionId%5B%5D=999&datePreset=6&startDay=01&startMonth=01&start Year=2022&endDay=01&endMonth=01&endYear=2023&queryId=&action=step2&go=Next+
但是使用read_html会导致错误:

ParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 2

df = pd.read_html("https://uk-air.defra.gov.uk/data/DAQI-regional-data?regionIds%5B%5D=999&aggRegionId%5B%5D=999&datePreset=6&startDay=01&startMonth=01&startYear=2022&endDay=01&endMonth=01&endYear=2023&queryId=&action=step2&go=Next+")
df

这将以列表的形式返回数据。但是我想把这个列表变成一个 Dataframe 。
解决这个问题的最好方法是什么?

igetnqfo

igetnqfo1#

read_html总是返回一个 DataFrames 的列表,即使只有一个。你需要索引它。
pandas.read_html

  • 将HTML表读入DataFrame对象列表。*
    返回dfs Dataframe 列表。
df = pd.read_html("https://uk-air.defra.gov.uk/...")[0] # <-- add [0] at the end

输出:

print(df)
​
           Date  ...  West Yorkshire Urban Area
0    01/01/2022  ...                          2
1    02/01/2022  ...                          3
2    03/01/2022  ...                          3
..          ...  ...                        ...
362  29/12/2022  ...                          3
363  30/12/2022  ...                          3
364  31/12/2022  ...                          3

[365 rows x 33 columns]
nmpmafwu

nmpmafwu2#

Panadas read_html实际上处理这样的情况:

import pandas as pd

# Specify the URL of the HTML page containing the table
url = "..."

# Use the pandas read_html() method to read the table data into a list of dataframes
tables = pd.read_html(url)

# If there are multiple tables on the page, you can select the one you want by index
table = tables[0]
h6my8fg2

h6my8fg23#

我的准则

import pandas as pd
url = "https://uk-air.defra.gov.uk/data/DAQI-regional-data?regionIds%5B%5D=999&aggRegionId%5B%5D=999&datePreset=6&startDay=01&startMonth=01&startYear=2022&endDay=01&endMonth=01&endYear=2023&queryId=&action=step2&go=Next+"
dfs = pd.read_html(url)
type(dfs)  # Output: list
len(dfs)  # Output: 1
df = pd.DataFrame(dfs)
type(df)  # Output: pandas.core.frame.DataFrame

df.columns
""" Output:
Index(['Date', 'Central Scotland', 'East Midlands', 'Eastern',
   'Greater London', 'Highland', 'North East', 'North East Scotland',
   'North Wales', 'North West & Merseyside', 'Northern Ireland',
   'Scottish Borders', 'South East', 'South Wales', 'South West',
   'West Midlands', 'Yorkshire & Humberside',
   'Belfast Metropolitan Urban Area', 'Brighton/Worthing/Littlehampton',
   'Bristol Urban Area', 'Cardiff Urban Area', 'Edinburgh Urban Area',
   'Glasgow Urban Area', 'Greater Manchester Urban Area',
   'Leicester Urban Area', 'Liverpool Urban Area', 'Nottingham Urban Area',
   'Portsmouth Urban Area', 'Sheffield Urban Area', 'Swansea Urban Area',
   'Tyneside', 'West Midlands Urban Area', 'West Yorkshire Urban Area'],
  dtype='object')
"""

相关问题