按如下方式给予框架df
:
date indicator_name region_code date_freq data_type value
0 2023/3/31 GDP BJ Q actual_value 8.200000
1 2023/6/30 GDP BJ Q actual_value 8.400000
2 2023/3/31 GDP BJ Q arima_mape 0.331838
3 2023/6/30 GDP BJ Q arima_mape 0.269978
4 2023/9/30 GDP BJ Q arima_mape 0.208118
5 2023/12/31 GDP BJ Q arima_mape 0.146257
6 2023/3/31 GDP BJ Q es_mape 0.084397
7 2023/6/30 GDP BJ Q es_mape 0.022536
8 2023/9/30 GDP BJ Q es_mape -0.039324
9 2023/12/31 GDP BJ Q es_mape -0.101184
10 2023/1/31 CPI SH M actual_value 1.900000
11 2023/2/28 CPI SH M actual_value 1.700000
12 2023/3/31 CPI SH M actual_value 1.400000
13 2023/4/30 CPI SH M actual_value 1.100000
14 2023/5/31 CPI SH M actual_value 0.800000
15 2023/6/30 CPI SH M actual_value 0.600000
16 2023/7/31 CPI SH M actual_value 0.400000
17 2023/1/31 CPI SH M arima_mape 1.964911
18 2023/2/28 CPI SH M arima_mape 1.934797
19 2023/3/31 CPI SH M arima_mape 1.569138
20 2023/4/30 CPI SH M arima_mape 1.284494
21 2023/5/31 CPI SH M arima_mape 0.940008
22 2023/6/30 CPI SH M arima_mape 0.698891
23 2023/7/31 CPI SH M arima_mape 0.557101
24 2023/8/31 CPI SH M arima_mape 0.331838
25 2023/9/30 CPI SH M arima_mape 0.269978
26 2023/10/31 CPI SH M arima_mape 0.232883
27 2023/11/30 CPI SH M arima_mape 0.212634
28 2023/12/31 CPI SH M arima_mape 0.219534
29 2023/1/31 CPI SH M es_mape 1.425995
30 2023/2/28 CPI SH M es_mape 1.897687
31 2023/3/31 CPI SH M es_mape 1.697817
32 2023/4/30 CPI SH M es_mape 1.410244
33 2023/5/31 CPI SH M es_mape 1.085284
34 2023/6/30 CPI SH M es_mape 0.810315
35 2023/7/31 CPI SH M es_mape 0.585306
36 2023/8/31 CPI SH M es_mape 0.440354
37 2023/9/30 CPI SH M es_mape 0.452097
38 2023/10/31 CPI SH M es_mape 0.492415
39 2023/11/30 CPI SH M es_mape 0.504193
40 2023/12/31 CPI SH M es_mape 0.487353
假设df
包含两组数据:BJ
的季度GDP
的actual_value
、arima_mape
和es_mape
,SH
的月度CPI's
、actual_value
、arima_mape
和es_mape
。对于每组indicator_name
、region_code
、date_freq
,我需要用实际值的最新日期作为当前时间节点进行子集,如果该组是季度数据(date_freq=='Q'
),则保留最近两个季度的历史值,最近两个季度和下一个季度的arima_mape
和es_mape
。如果是月度数据(date_freq=='M'
),则需要子集arima_mape
和es_mape
最近4个月(2023/4/30 ~ 2023/7/31)、最近4个月(2023/4/30~2023/7/31)和次月(2023/8/31)的 * 历史数据 *。我如何修改下面的函数以获得如下所示的预期输出?谢谢.
date indicator_name region_code date_freq data_type value
0 2023/3/31 GDP BJ Q actual_value 8.200000
1 2023/6/30 GDP BJ Q actual_value 8.400000
2 2023/3/31 GDP BJ Q arima_mape 0.331838
3 2023/6/30 GDP BJ Q arima_mape 0.269978
4 2023/9/30 GDP BJ Q arima_mape 0.208118
5 2023/3/31 GDP BJ Q es_mape 0.084397
6 2023/6/30 GDP BJ Q es_mape 0.022536
7 2023/9/30 GDP BJ Q es_mape -0.039324
8 2023/4/30 CPI SH M actual_value 1.100000
9 2023/5/31 CPI SH M actual_value 0.800000
10 2023/6/30 CPI SH M actual_value 0.600000
11 2023/7/31 CPI SH M actual_value 0.400000
12 2023/4/30 CPI SH M arima_mape 1.284494
13 2023/5/31 CPI SH M arima_mape 0.940008
14 2023/6/30 CPI SH M arima_mape 0.698891
15 2023/7/31 CPI SH M arima_mape 0.557101
16 2023/8/31 CPI SH M arima_mape 0.331838
17 2023/4/30 CPI SH M es_mape 1.410244
18 2023/5/31 CPI SH M es_mape 1.085284
19 2023/6/30 CPI SH M es_mape 0.810315
20 2023/7/31 CPI SH M es_mape 0.585306
21 2023/8/31 CPI SH M es_mape 0.440354
到目前为止,我的试用代码生成错误:
import pandas as pd
def get_rows(group):
group['date'] = pd.to_datetime(group['date'])
group = group.set_index('date').sort_index()
if group['date_freq'].iloc[0] == 'Q':
actual_value_max_date = group[group['data_type'] == 'actual_value'].index.max()
actual_value_dates = pd.date_range(end=actual_value_max_date, periods=2, freq='Q')
mape_dates = pd.date_range(start=actual_value_max_date - pd.offsets.QuarterEnd(1), periods=3, freq='Q')
else: # 'M'
actual_value_max_date = group[group['data_type'] == 'actual_value'].index.max()
actual_value_dates = pd.date_range(end=actual_value_max_date, periods=4, freq='M')
mape_dates = pd.date_range(start=actual_value_max_date - pd.offsets.MonthEnd(3), periods=5, freq='M')
actual_value = group.loc[group.index.intersection(actual_value_dates) & (group['data_type'] == 'actual_value')]
arima_mape = group.loc[group.index.intersection(mape_dates) & (group['data_type'] == 'arima_mape')]
es_mape = group.loc[group.index.intersection(mape_dates) & (group['data_type'] == 'es_mape')]
return pd.concat([actual_value, arima_mape, es_mape])
df.groupby(['indicator_name', 'region_code', 'date_freq']).apply(get_rows).reset_index()
输出:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3553, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-17-6689e28504c3>", line 20, in <cell line: 20>
df.groupby(['indicator_name', 'region_code', 'date_freq']).apply(get_rows).reset_index()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py", line 6361, in reset_index
new_obj.insert(
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py", line 4817, in insert
raise ValueError(f"cannot insert {column}, already exists")
ValueError: cannot insert date_freq, already exists
1条答案
按热度按时间oknrviil1#
出现错误的原因是,在
groupby
和apply
操作之后,分组的列(indicator_name
,region_code
,date_freq
)成为生成的嵌套框的索引。当您调用reset_index()
时,pandas尝试将这些列作为列添加回数据框,但它们已经存在,从而导致错误。这是更正后的CO这导致