numpy groupby多列,基于多个条件子集行

huus2vyu  于 2023-10-19  发布在  其他
关注(0)|答案(1)|浏览(121)

按如下方式给予框架df

date indicator_name region_code date_freq     data_type     value
0    2023/3/31            GDP          BJ         Q  actual_value  8.200000
1    2023/6/30            GDP          BJ         Q  actual_value  8.400000
2    2023/3/31            GDP          BJ         Q    arima_mape  0.331838
3    2023/6/30            GDP          BJ         Q    arima_mape  0.269978
4    2023/9/30            GDP          BJ         Q    arima_mape  0.208118
5   2023/12/31            GDP          BJ         Q    arima_mape  0.146257
6    2023/3/31            GDP          BJ         Q       es_mape  0.084397
7    2023/6/30            GDP          BJ         Q       es_mape  0.022536
8    2023/9/30            GDP          BJ         Q       es_mape -0.039324
9   2023/12/31            GDP          BJ         Q       es_mape -0.101184
10   2023/1/31            CPI          SH         M  actual_value  1.900000
11   2023/2/28            CPI          SH         M  actual_value  1.700000
12   2023/3/31            CPI          SH         M  actual_value  1.400000
13   2023/4/30            CPI          SH         M  actual_value  1.100000
14   2023/5/31            CPI          SH         M  actual_value  0.800000
15   2023/6/30            CPI          SH         M  actual_value  0.600000
16   2023/7/31            CPI          SH         M  actual_value  0.400000
17   2023/1/31            CPI          SH         M    arima_mape  1.964911
18   2023/2/28            CPI          SH         M    arima_mape  1.934797
19   2023/3/31            CPI          SH         M    arima_mape  1.569138
20   2023/4/30            CPI          SH         M    arima_mape  1.284494
21   2023/5/31            CPI          SH         M    arima_mape  0.940008
22   2023/6/30            CPI          SH         M    arima_mape  0.698891
23   2023/7/31            CPI          SH         M    arima_mape  0.557101
24   2023/8/31            CPI          SH         M    arima_mape  0.331838
25   2023/9/30            CPI          SH         M    arima_mape  0.269978
26  2023/10/31            CPI          SH         M    arima_mape  0.232883
27  2023/11/30            CPI          SH         M    arima_mape  0.212634
28  2023/12/31            CPI          SH         M    arima_mape  0.219534
29   2023/1/31            CPI          SH         M       es_mape  1.425995
30   2023/2/28            CPI          SH         M       es_mape  1.897687
31   2023/3/31            CPI          SH         M       es_mape  1.697817
32   2023/4/30            CPI          SH         M       es_mape  1.410244
33   2023/5/31            CPI          SH         M       es_mape  1.085284
34   2023/6/30            CPI          SH         M       es_mape  0.810315
35   2023/7/31            CPI          SH         M       es_mape  0.585306
36   2023/8/31            CPI          SH         M       es_mape  0.440354
37   2023/9/30            CPI          SH         M       es_mape  0.452097
38  2023/10/31            CPI          SH         M       es_mape  0.492415
39  2023/11/30            CPI          SH         M       es_mape  0.504193
40  2023/12/31            CPI          SH         M       es_mape  0.487353

假设df包含两组数据:BJ的季度GDPactual_valuearima_mapees_mapeSH的月度CPI'sactual_valuearima_mapees_mape。对于每组indicator_nameregion_codedate_freq,我需要用实际值的最新日期作为当前时间节点进行子集,如果该组是季度数据(date_freq=='Q'),则保留最近两个季度的历史值,最近两个季度和下一个季度的arima_mapees_mape。如果是月度数据(date_freq=='M'),则需要子集arima_mapees_mape最近4个月(2023/4/30 ~ 2023/7/31)、最近4个月(2023/4/30~2023/7/31)和次月(2023/8/31)的 * 历史数据 *。我如何修改下面的函数以获得如下所示的预期输出?谢谢.

date indicator_name region_code date_freq     data_type     value
0   2023/3/31            GDP          BJ         Q  actual_value  8.200000
1   2023/6/30            GDP          BJ         Q  actual_value  8.400000
2   2023/3/31            GDP          BJ         Q    arima_mape  0.331838
3   2023/6/30            GDP          BJ         Q    arima_mape  0.269978
4   2023/9/30            GDP          BJ         Q    arima_mape  0.208118
5   2023/3/31            GDP          BJ         Q       es_mape  0.084397
6   2023/6/30            GDP          BJ         Q       es_mape  0.022536
7   2023/9/30            GDP          BJ         Q       es_mape -0.039324
8   2023/4/30            CPI          SH         M  actual_value  1.100000
9   2023/5/31            CPI          SH         M  actual_value  0.800000
10  2023/6/30            CPI          SH         M  actual_value  0.600000
11  2023/7/31            CPI          SH         M  actual_value  0.400000
12  2023/4/30            CPI          SH         M    arima_mape  1.284494
13  2023/5/31            CPI          SH         M    arima_mape  0.940008
14  2023/6/30            CPI          SH         M    arima_mape  0.698891
15  2023/7/31            CPI          SH         M    arima_mape  0.557101
16  2023/8/31            CPI          SH         M    arima_mape  0.331838
17  2023/4/30            CPI          SH         M       es_mape  1.410244
18  2023/5/31            CPI          SH         M       es_mape  1.085284
19  2023/6/30            CPI          SH         M       es_mape  0.810315
20  2023/7/31            CPI          SH         M       es_mape  0.585306
21  2023/8/31            CPI          SH         M       es_mape  0.440354

到目前为止,我的试用代码生成错误:

import pandas as pd

def get_rows(group):
    group['date'] = pd.to_datetime(group['date'])
    group = group.set_index('date').sort_index()

    if group['date_freq'].iloc[0] == 'Q':
        actual_value_max_date = group[group['data_type'] == 'actual_value'].index.max()
        actual_value_dates = pd.date_range(end=actual_value_max_date, periods=2, freq='Q')
        mape_dates = pd.date_range(start=actual_value_max_date - pd.offsets.QuarterEnd(1), periods=3, freq='Q')
    else:  # 'M'
        actual_value_max_date = group[group['data_type'] == 'actual_value'].index.max()
        actual_value_dates = pd.date_range(end=actual_value_max_date, periods=4, freq='M')
        mape_dates = pd.date_range(start=actual_value_max_date - pd.offsets.MonthEnd(3), periods=5, freq='M')

    actual_value = group.loc[group.index.intersection(actual_value_dates) & (group['data_type'] == 'actual_value')]
    arima_mape = group.loc[group.index.intersection(mape_dates) & (group['data_type'] == 'arima_mape')]
    es_mape = group.loc[group.index.intersection(mape_dates) & (group['data_type'] == 'es_mape')]

    return pd.concat([actual_value, arima_mape, es_mape])

df.groupby(['indicator_name', 'region_code', 'date_freq']).apply(get_rows).reset_index()

输出:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-17-6689e28504c3>", line 20, in <cell line: 20>
    df.groupby(['indicator_name', 'region_code', 'date_freq']).apply(get_rows).reset_index()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py", line 6361, in reset_index
    new_obj.insert(
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py", line 4817, in insert
    raise ValueError(f"cannot insert {column}, already exists")
ValueError: cannot insert date_freq, already exists
oknrviil

oknrviil1#

出现错误的原因是,在groupbyapply操作之后,分组的列(indicator_nameregion_codedate_freq)成为生成的嵌套框的索引。当您调用reset_index()时,pandas尝试将这些列作为列添加回数据框,但它们已经存在,从而导致错误。这是更正后的CO

import pandas as pd

# Sample data
data = {
    'date': ['2023/3/31', '2023/6/30', '2023/3/31', '2023/6/30', '2023/9/30', '2023/12/31', '2023/3/31', '2023/6/30', '2023/9/30', '2023/12/31', '2023/1/31', '2023/2/28', '2023/3/31', '2023/4/30', '2023/5/31', '2023/6/30', '2023/7/31', '2023/1/31', '2023/2/28', '2023/3/31', '2023/4/30', '2023/5/31', '2023/6/30', '2023/7/31', '2023/8/31', '2023/9/30', '2023/10/31', '2023/11/30', '2023/12/31', '2023/1/31', '2023/2/28', '2023/3/31', '2023/4/30', '2023/5/31', '2023/6/30', '2023/7/31', '2023/8/31', '2023/9/30', '2023/10/31', '2023/11/30', '2023/12/31'],
    'indicator_name': ['GDP', 'GDP', 'GDP', 'GDP', 'GDP', 'GDP', 'GDP', 'GDP', 'GDP', 'GDP', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI', 'CPI'],
    'region_code': ['BJ', 'BJ', 'BJ', 'BJ', 'BJ', 'BJ', 'BJ', 'BJ', 'BJ', 'BJ', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH', 'SH'],
    'date_freq': ['Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'Q', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M'],
    'data_type': ['actual_value', 'actual_value', 'arima_mape', 'arima_mape', 'arima_mape', 'arima_mape', 'es_mape', 'es_mape', 'es_mape', 'es_mape', 'actual_value', 'actual_value', 'actual_value', 'actual_value', 'actual_value', 'actual_value', 'actual_value', 'arima_mape', 'arima_mape', 'arima_mape', 'arima_mape', 'arima_mape', 'arima_mape', 'arima_mape', 'arima_mape', 'arima_mape', 'arima_mape', 'arima_mape', 'arima_mape', 'es_mape', 'es_mape', 'es_mape', 'es_mape', 'es_mape', 'es_mape', 'es_mape', 'es_mape', 'es_mape', 'es_mape', 'es_mape', 'es_mape'],
    'value': [8.2, 8.4, 0.331838, 0.269978, 0.208118, 0.146257, 0.084397, 0.022536, -0.039324, -0.101184, 1.9, 1.7, 1.4, 1.1, 0.8, 0.6, 0.4, 1.964911, 1.934797, 1.569138, 1.284494, 0.940008, 0.698891, 0.557101, 0.331838, 0.269978, 0.232883, 0.212634, 0.219534, 1.425995, 1.897687, 1.697817, 1.410244, 1.085284, 0.810315, 0.585306, 0.440354, 0.452097, 0.492415, 0.504193, 0.487353]
}

df = pd.DataFrame(data)

# Convert date column to datetime format directly
df['date'] = pd.to_datetime(df['date'])

def get_rows(group):
    # Set the date column as the index and sort by it
    group = group.set_index('date').sort_index()
    
    # Identify the latest date for the actual values within the group
    max_date_actual = group.loc[group['data_type'] == 'actual_value'].index.max()
    
    # Depending on the frequency ('Q' for quarterly, 'M' for monthly), determine the desired date ranges
    if group['date_freq'].iloc[0] == 'Q':
        actual_value_dates = pd.date_range(end=max_date_actual, periods=2, freq='Q')
        mape_dates = pd.date_range(start=max_date_actual - pd.offsets.QuarterEnd(1), periods=3, freq='Q')
    else:  # 'M'
        actual_value_dates = pd.date_range(end=max_date_actual, periods=4, freq='M')
        mape_dates = pd.date_range(start=max_date_actual - pd.offsets.MonthEnd(3), periods=5, freq='M')

    # Filter the rows of the group based on the specified criteria:
    # 1. Rows with actual values within the desired date range
    rows_actual_value = group.loc[group.index.isin(actual_value_dates) & (group['data_type'] == 'actual_value')]
    # 2. Rows with forecast values (arima_mape and es_mape) within the desired date range
    rows_forecast = group.loc[group.index.isin(mape_dates) & group['data_type'].isin(['arima_mape', 'es_mape'])]

    # Concatenate and return the filtered rows
    return pd.concat([rows_actual_value, rows_forecast])

# Apply the function to groups defined by 'indicator_name', 'region_code', and 'date_freq'
# This ensures that the function is applied separately for each combination of these columns
results = df.groupby(['indicator_name', 'region_code', 'date_freq'], sort=True).apply(get_rows).reset_index(level=[0,1,2], drop=True).reset_index()

# Finally, sort the resultant dataframe by 'region_code' and 'data_type' columns
result_sorted = results.sort_values(by=['region_code', 'data_type']).reset_index(drop=True)

# Display the sorted results
print(result_sorted)

这导致

date indicator_name region_code date_freq     data_type     value
0  2023-03-31            GDP          BJ         Q  actual_value  8.200000
1  2023-06-30            GDP          BJ         Q  actual_value  8.400000
2  2023-03-31            GDP          BJ         Q    arima_mape  0.331838
3  2023-06-30            GDP          BJ         Q    arima_mape  0.269978
4  2023-09-30            GDP          BJ         Q    arima_mape  0.208118
5  2023-03-31            GDP          BJ         Q       es_mape  0.084397
6  2023-06-30            GDP          BJ         Q       es_mape  0.022536
7  2023-09-30            GDP          BJ         Q       es_mape -0.039324
8  2023-04-30            CPI          SH         M  actual_value  1.100000
9  2023-05-31            CPI          SH         M  actual_value  0.800000
10 2023-06-30            CPI          SH         M  actual_value  0.600000
11 2023-07-31            CPI          SH         M  actual_value  0.400000
12 2023-04-30            CPI          SH         M    arima_mape  1.284494
13 2023-05-31            CPI          SH         M    arima_mape  0.940008
14 2023-06-30            CPI          SH         M    arima_mape  0.698891
15 2023-07-31            CPI          SH         M    arima_mape  0.557101
16 2023-08-31            CPI          SH         M    arima_mape  0.331838
17 2023-04-30            CPI          SH         M       es_mape  1.410244
18 2023-05-31            CPI          SH         M       es_mape  1.085284
19 2023-06-30            CPI          SH         M       es_mape  0.810315
20 2023-07-31            CPI          SH         M       es_mape  0.585306
21 2023-08-31            CPI          SH         M       es_mape  0.440354

相关问题