python 分组和平均返回NaN

mzillmmw  于 2023-01-01  发布在  Python
关注(0)|答案(3)|浏览(200)

我尝试使用groupby按符号分组,并返回之前使用panda的高流量日的平均值。
我创建数据:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "date": ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06'],
    "symbol": ['ABC', 'ABC', 'ABC', 'AAA', 'AAA', 'AAA'],
    "change": [20, 1, 2, 3, 50, 100],
    "volume": [20000000, 100, 3000, 500, 40000000, 60000000],
})

按高容量过滤并更改:

high_volume_days = df[(df['volume'] >= 20000000) & (df['change'] >= 20)]

然后我得到最后几天的体积(这工程):

high_volume_days['previous_high_volume_day'] = high_volume_days.groupby('symbol')['volume'].shift(1)

但是当我试着计算每个符号的平均天数时:

high_volume_days['avg_volume_prior_days'] = df.groupby('symbol')['volume'].mean()

我正在获取NaNs:

date symbol  change    volume  previous_high_volume_day  avg_volume_prior_days
0  2022-01-01    ABC      20  20000000                       NaN                    NaN
4  2022-01-05    AAA      50  40000000                       NaN                    NaN
5  2022-01-06    AAA     100  60000000                40000000.0                    NaN

我错过了什么?
预期输出:

date symbol  change    volume  previous_high_volume_day  avg_volume_prior_days
0  2022-01-01    ABC      20  20000000                       NaN                    20000000
4  2022-01-05    AAA      50  40000000                       NaN                    40000000
5  2022-01-06    AAA     100  60000000                40000000.0                    50000000
jpfvwuh4

jpfvwuh41#

high_volume_days['avg_volume_prior_days'] = high_volume_days.groupby('symbol', sort=False)['volume'].expanding().mean().droplevel(0)

第一个月

date        symbol  change  volume  previous_high_volume_day    avg_volume_prior_days
0   2022-01-01  ABC     20      20000000    NaN                     20000000.0
4   2022-01-05  AAA     50      40000000    NaN                     40000000.0
5   2022-01-06  AAA     100     60000000    40000000.0              50000000.0
bvjxkvbb

bvjxkvbb2#

步进未对准:high_volume_days由整数索引。df.groupby(...)由符号索引。
请改用merge

high_volume_days = pd.merge(
    high_volume_days,
    df.groupby("symbol")["volume"].mean().rename("avg_volume_prior_days"),
    left_on="symbol",
    right_index=True,
)
vom3gejh

vom3gejh3#

df.groupby('symbol')['volume'].mean()返回:

symbol
AAA    33333500.0
ABC     6667700.0
Name: volume, dtype: float64

这是每个组到单个值的聚合。注意,组(符号)是此系列的索引。当您尝试将其重新分配给high_volume_days时,存在索引未对齐。
您应该使用转换,而不是聚合(.mean()等效于.agg("mean")):.transform("mean").
====编辑===
你要寻找的不是所有值的平均值,而是"迄今为止"的平均值。你通常可以使用.expanding().mean()来实现这一点,但是由于你要重新分配回high_volume_days中的列,你需要删除包含符号的级别,或者使用lambda:

high_volume_days.groupby('symbol')['volume'].expanding().mean().droplevel(0)
# or
high_volume_days.groupby('symbol')['volume'].transform(lambda x: x.expanding().mean())

相关问题