pandas 如何发现价值观从一开始就发生了变化?

iszxjhcz  于 2023-06-20  发布在  其他
关注(0)|答案(2)|浏览(125)

我有一个带有时间索引的 Dataframe 。每年八月开始有一个周期。我想计算每个月的值与前一个8月(其周期的开始)的值之间的差异。目的是了解每个周期中值的变化程度。
这是数据的一个小例子:

import pandas as pd
import numpy as np

df = pd.DataFrame({'date': ['2022-01-01 00:00:00', '2022-02-01 00:00:00', 
                          '2021-03-01 00:00:00', '2021-04-01 00:00:00',
                          '2021-05-01 00:00:00', '2021-06-01 00:00:00',
                          '2021-07-01 00:00:00', '2021-08-01 00:00:00',
                          '2021-09-01 00:00:00', '2021-10-01 00:00:00',
                          '2021-11-01 00:00:00', '2021-12-01 00:00:00',
                          '2022-01-01 00:00:00', '2022-02-01 00:00:00',
                          '2022-03-01 00:00:00', '2022-04-01 00:00:00',
                          '2022-05-01 00:00:00', '2022-06-01 00:00:00',
                          '2022-07-01 00:00:00', '2022-08-01 00:00:00',
                          '2022-09-01 00:00:00', '2022-10-01 00:00:00',
                          '2022-11-01 00:00:00', '2023-12-01 00:00:00',
                          '2023-01-01 00:00:00', '2023-02-01 00:00:00',
                          '2023-03-01 00:00:00', '2023-04-01 00:00:00',
                          '2023-05-01 00:00:00', '2023-06-01 00:00:00',
                          ], 
               'value1': [0.452762281,0.372262281,0.513928948,0.447762281,
                            0.377095615,0.355095615,0.271428948,0.291762281,
                            0.476762281,0.335928948,0.280428948,0.283762281,
                            0.322928948,0.287262281,0.316928948,0.209262281,
                            0.407928948,0.254262281,0.232095615,0.264262281,
                            0.076095615,-0.025237719,-0.042237719,-0.094904385,
                            0.017428948,-0.036071052,-0.094071052,-0.071404385,
                            0.008095615,-0.141571052],
               'value2': [9.6,8,8.4,6.2,6.2,6,3.9,8.5,8.3,5.3,5.6,5.3,
                               6.2,6.3,6.9,4.8,6.7,3.6,3,4.6,2.3,1.3,1,0.3,
                               1.6,0.4,1.5,1.4,2.2,1.2]})
df = df.set_index('date')

我试图得到一个具有相同结构的dataframe,但所有的值都具有其对应的八月的值,然后我只需要从另一个dataframe中减去一个dataframe。但我也找不到办法。
谢谢,请多加关照.

omqzjyyz

omqzjyyz1#

如果我理解正确的话,你可以使用.groupby(),然后计算每组的差异:

df.index = pd.to_datetime(df.index)

tmp = (df.index.month == 8).cumsum()
out = (
    df[tmp != 0]
    .groupby(tmp[tmp > 0], group_keys=False)
    .apply(lambda x: x[["value1", "value2"]] - x[["value1", "value2"]].iloc[0])
).add_prefix('calculated_')

x = df.join(out)
print(x)

图纸:

value1  value2  calculated_value1  calculated_value2
date                                                              
2021-03-01  0.513929     8.4                NaN                NaN
2021-04-01  0.447762     6.2                NaN                NaN
2021-05-01  0.377096     6.2                NaN                NaN
2021-06-01  0.355096     6.0                NaN                NaN
2021-07-01  0.271429     3.9                NaN                NaN
2021-08-01  0.291762     8.5           0.000000                0.0
2021-09-01  0.476762     8.3           0.185000               -0.2
2021-10-01  0.335929     5.3           0.044167               -3.2
2021-11-01  0.280429     5.6          -0.011333               -2.9
2021-12-01  0.283762     5.3          -0.008000               -3.2
2022-01-01  0.452762     9.6           0.031167               -2.3
2022-01-01  0.322929     6.2           0.031167               -2.3
2022-02-01  0.372262     8.0          -0.004500               -2.2
2022-02-01  0.287262     6.3          -0.004500               -2.2
2022-03-01  0.316929     6.9           0.025167               -1.6
2022-04-01  0.209262     4.8          -0.082500               -3.7
2022-05-01  0.407929     6.7           0.116167               -1.8
2022-06-01  0.254262     3.6          -0.037500               -4.9
2022-07-01  0.232096     3.0          -0.059667               -5.5
2022-08-01  0.264262     4.6           0.000000                0.0
2022-09-01  0.076096     2.3          -0.188167               -2.3
2022-10-01 -0.025238     1.3          -0.289500               -3.3
2022-11-01 -0.042238     1.0          -0.306500               -3.6
2023-01-01  0.017429     1.6          -0.246833               -3.0
2023-02-01 -0.036071     0.4          -0.300333               -4.2
2023-03-01 -0.094071     1.5          -0.358333               -3.1
2023-04-01 -0.071404     1.4          -0.335667               -3.2
2023-05-01  0.008096     2.2          -0.256167               -2.4
2023-06-01 -0.141571     1.2          -0.405833               -3.4
2023-12-01 -0.094904     0.3          -0.359167               -4.3
bxgwgixi

bxgwgixi2#

有几种方法可以解决您的问题。解决方案取决于您的数据可能的外观
你的代码中也有一些缺陷。这是我的建议
1.这里不需要set_index('date')
这里是:

df['date'] = pd.to_datetime(df['date'])
df.sort_values(by=['date'], ignore_index=True, inplace=True)

第一行是将date列转换为datetime格式,第二行是按升序对数据进行排序。
1.最常见的方法是使用august only提取子样本,然后将这些值合并回主 Dataframe 并执行计算。这可能会很长,但它会显示任何潜在的错误,如果有在您的数据
完整的解决方案:

import pandas as pd
import numpy as np
from dateutil.relativedelta import *
from pandas.tseries.offsets import *

df = pd.DataFrame({'date': ['2022-01-01 00:00:00', '2022-02-01 00:00:00', 
                          '2021-03-01 00:00:00', '2021-04-01 00:00:00',
                          '2021-05-01 00:00:00', '2021-06-01 00:00:00',
                          '2021-07-01 00:00:00', '2021-08-01 00:00:00',
                          '2021-09-01 00:00:00', '2021-10-01 00:00:00',
                          '2021-11-01 00:00:00', '2021-12-01 00:00:00',
                          '2022-01-01 00:00:00', '2022-02-01 00:00:00',
                          '2022-03-01 00:00:00', '2022-04-01 00:00:00',
                          '2022-05-01 00:00:00', '2022-06-01 00:00:00',
                          '2022-07-01 00:00:00', '2022-08-01 00:00:00',
                          '2022-09-01 00:00:00', '2022-10-01 00:00:00',
                          '2022-11-01 00:00:00', '2023-12-01 00:00:00',
                          '2023-01-01 00:00:00', '2023-02-01 00:00:00',
                          '2023-03-01 00:00:00', '2023-04-01 00:00:00',
                          '2023-05-01 00:00:00', '2023-06-01 00:00:00',
                          ], 
               'value1': [0.452762281,0.372262281,0.513928948,0.447762281,
                            0.377095615,0.355095615,0.271428948,0.291762281,
                            0.476762281,0.335928948,0.280428948,0.283762281,
                            0.322928948,0.287262281,0.316928948,0.209262281,
                            0.407928948,0.254262281,0.232095615,0.264262281,
                            0.076095615,-0.025237719,-0.042237719,-0.094904385,
                            0.017428948,-0.036071052,-0.094071052,-0.071404385,
                            0.008095615,-0.141571052],
               'value2': [9.6,8,8.4,6.2,6.2,6,3.9,8.5,8.3,5.3,5.6,5.3,
                               6.2,6.3,6.9,4.8,6.7,3.6,3,4.6,2.3,1.3,1,0.3,
                               1.6,0.4,1.5,1.4,2.2,1.2]})

# cleaning data
df['date'] = pd.to_datetime(df['date'])
df.sort_values(by=['date'], ignore_index=True, inplace=True)

# creat sub dataframe with august value only
aug = df.loc[df['date'].dt.month==8].copy()
aug.rename(columns={'value1': 'augval1', 'value2': 'augval2', 'date':'date_aug'}, inplace=True)

# create equivalent august date for merge
df['date_aug'] = np.where(df['date'].dt.month < 8, df['date'] + YearBegin(-1) + MonthBegin(7), df['date'] + YearBegin(1) + MonthBegin(7))

# merge aug dataframe to df dataframe
new = pd.merge(df, aug, on='date_aug', how='left')

# perform calculation
new['chg_val1'] = new['value1'].diff()
new['chg_aug'] = new['value1'] - new['augval1']

相关问题