我在试着生成一个随时间变化的酒精消费的面板数据集。我们在Python中这样做的方式是这样的:
import pandas as pd
from datetime import datetime
link = 'https://www.beerinstitute.org/wp-content/uploads/2021/12/2021-September-The-Brewers-Almanac-Beer-Institute-2021.xlsx'
df=pd.read_excel(link, sheet_name='Beer Shipments by State', skiprows= lambda x: x<2 or x >54)
df = df. tail(-1)
del df[df.columns[1]]
df.drop(columns=df.columns[1:29], axis=1, inplace=True)
df.rename(columns={'Unnamed: 0': 'State'}, inplace=True)
df = pd.melt(df, id_vars='State')
df.rename(columns={'value': 'shipments'}, inplace=True)
df
它返回 Dataframe
State variable shipments
0 Alabama 1994.1 234030
1 Alaska 1994.1 26799
2 Arizona 1994.1 295000
3 Arkansas 1994.1 127200
4 California 1994.1 1640146
... ... ... ...
16825 Virginia 2021.6 439000
16826 Washington 2021.6 473000
16827 West Virginia 2021.6 119300.985455
16828 Wisconsin 2021.6 469606.08
16829 Wyoming 2021.6 47000
[16830 rows x 3 columns]
但是当我像这样对日期变量进行排序时
import pandas as pd
from datetime import datetime
link = 'https://www.beerinstitute.org/wp-content/uploads/2021/12/2021-September-The-Brewers-Almanac-Beer-Institute-2021.xlsx'
df=pd.read_excel(link, sheet_name='Beer Shipments by State', skiprows= lambda x: x<2 or x >54)
df = df. tail(-1)
del df[df.columns[1]]
df.drop(columns=df.columns[1:29], axis=1, inplace=True)
df.rename(columns={'Unnamed: 0': 'State'}, inplace=True)
df = pd.melt(df, id_vars='State')
df.rename(columns={'variable': 'date', 'value': 'shipments'}, inplace=True)
df.sort_values(by=['State', 'date'])
我们得到
State date shipments
561 Alabama 1995 212299
1173 Alabama 1996 224432
1785 Alabama 1997 207508
2397 Alabama 1998 232107
3009 Alabama 1999 239510
... ... ... ...
16625 Wyoming 2021.2 30000
16676 Wyoming 2021.3 33000
16727 Wyoming 2021.4 37000
16778 Wyoming 2021.5 48000
16829 Wyoming 2021.6 47000
[16830 rows x 3 columns]
为什么亚拉巴马州的值四舍五入到1995年,但怀俄明州值保持不变(2021年6月为2021.6)?我希望最终的数据集具有州名、年/月变量(1994.1...2021.6),并按原样进行估计。我到底做错了什么?
1条答案
按热度按时间mbjcgjjk1#
不,那是因为你处理电子表格的方式。当两个列名重复时(* 在您的示例中,每年有一个月,您忽略了 *),pandas会添加一个后缀(一个递增的数字):
你可以试试这个:
输出: