pandas 排序时更改日期

x8diyxa7  于 2023-05-05  发布在  其他
关注(0)|答案(1)|浏览(114)

我在试着生成一个随时间变化的酒精消费的面板数据集。我们在Python中这样做的方式是这样的:

import pandas as pd
from datetime import datetime

link = 'https://www.beerinstitute.org/wp-content/uploads/2021/12/2021-September-The-Brewers-Almanac-Beer-Institute-2021.xlsx'

df=pd.read_excel(link, sheet_name='Beer Shipments by State', skiprows= lambda x: x<2 or x >54)

df = df. tail(-1)

del df[df.columns[1]]

df.drop(columns=df.columns[1:29], axis=1,  inplace=True)

df.rename(columns={'Unnamed: 0': 'State'}, inplace=True)

df = pd.melt(df, id_vars='State')

df.rename(columns={'value': 'shipments'}, inplace=True)

df

它返回 Dataframe

State variable      shipments
0            Alabama   1994.1         234030
1             Alaska   1994.1          26799
2            Arizona   1994.1         295000
3           Arkansas   1994.1         127200
4         California   1994.1        1640146
...              ...      ...            ...
16825       Virginia   2021.6         439000
16826     Washington   2021.6         473000
16827  West Virginia   2021.6  119300.985455
16828      Wisconsin   2021.6      469606.08
16829        Wyoming   2021.6          47000

[16830 rows x 3 columns]

但是当我像这样对日期变量进行排序时

import pandas as pd
from datetime import datetime

link = 'https://www.beerinstitute.org/wp-content/uploads/2021/12/2021-September-The-Brewers-Almanac-Beer-Institute-2021.xlsx'

df=pd.read_excel(link, sheet_name='Beer Shipments by State', skiprows= lambda x: x<2 or x >54)

df = df. tail(-1)

del df[df.columns[1]]

df.drop(columns=df.columns[1:29], axis=1,  inplace=True)

df.rename(columns={'Unnamed: 0': 'State'}, inplace=True)

df = pd.melt(df, id_vars='State')

df.rename(columns={'variable': 'date', 'value': 'shipments'}, inplace=True)

df.sort_values(by=['State', 'date'])

我们得到

State    date shipments
561    Alabama    1995    212299
1173   Alabama    1996    224432
1785   Alabama    1997    207508
2397   Alabama    1998    232107
3009   Alabama    1999    239510
...        ...     ...       ...
16625  Wyoming  2021.2     30000
16676  Wyoming  2021.3     33000
16727  Wyoming  2021.4     37000
16778  Wyoming  2021.5     48000
16829  Wyoming  2021.6     47000

[16830 rows x 3 columns]

为什么亚拉巴马州的值四舍五入到1995年,但怀俄明州值保持不变(2021年6月为2021.6)?我希望最终的数据集具有州名、年/月变量(1994.1...2021.6),并按原样进行估计。我到底做错了什么?

mbjcgjjk

mbjcgjjk1#

不,那是因为你处理电子表格的方式。当两个列名重复时(* 在您的示例中,每年有一个月,您忽略了 *),pandas会添加一个后缀(一个递增的数字):

print(df["date"].unique())

array(['1994.1', '1994.2', '1994.3', '1994.4', '1994.5', '1994.6',
       '1994.7', '1994.8', '1994.9', '1994.10', '1994.11', 1995, '1995.1',
       '1995.2', '1995.3', '1995.4', '1995.5', '1995.6', '1995.7',
       '1995.8', '1995.9', '1995.10', '1995.11', 1996, '1996.1', '1996.2',
       '1996.3', '1996.4', '1996.5', '1996.6', '1996.7', '1996.8',
       '1996.9', '1996.10', '1996.11', 1997, '1997.1', '1997.2', '1997.3',
       ...

你可以试试这个:

df = (pd.read_excel(link, sheet_name="Beer Shipments by State",
                    skiprows= lambda x: x<2 or x>54, header=[0, 1], index_col=[0, 1])
          .iloc[:, 27:].rename_axis(index=["state", "census region"], columns=[None, None])
          .melt(ignore_index=False, var_name=["year", "month"], value_name="shipments")
          .reset_index().astype({"year": int}).sort_values(by=["state", "year"], ignore_index=True)
     )

输出:

print(df)

         state       census region  year month  shipments
0      Alabama  East South Central  1994   Jan  210152.00
1      Alabama  East South Central  1994   Feb  234030.00
2      Alabama  East South Central  1994   Mar  253624.00
...        ...                 ...   ...   ...        ...
16878  Wyoming            Mountain  2021   May   37000.00
16879  Wyoming            Mountain  2021   Jun   48000.00
16880  Wyoming            Mountain  2021   Jul   47000.00

[16881 rows x 5 columns]

相关问题