Pandas groupby非常慢

pftdvrlh  于 11个月前  发布在  其他
关注(0)|答案(1)|浏览(117)

下面是我的框架和代码,当框架的大小增加时,所花费的时间会显著增加。这是怎么回事?如何将其矢量化?

import pandas as pd
import numpy as np

data = {
    'delta_t': np.random.randint(0, 301, 100),
    'specimen': np.random.choice(['X', 'Y', 'Z'], 100),
    'measuremnt': np.random.rand(100),
    'lag': np.random.rand(100)
}

df = pd.DataFrame(data)

# Defining the q75 function
def q75(x):
    return x.quantile(0.75)

# Applying the given code
df_result = df.groupby(['specimen', 'delta_t']).agg({
    'measuremnt': ['mean', q75, 'max'],
    'lag': 'mean'
}).reset_index()

字符串

slmsl1lt

slmsl1lt1#

如评论所述,像这样做一个懒惰的groupby:

%%timeit -n 10
groups = df.groupby(['specimen', 'delta_t'])

df_result = pd.DataFrame({
    'measurement_mean': groups['measuremnt'].mean(),
    'measurement_q75': groups['measuremnt'].quantile(.75),
    'measurement_max': groups['measuremnt'].max(),
    'lag': groups['lag'].mean()
}).reset_index()

> 1.95 ms ± 337 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

字符串
输出量:

specimen  delta_t  measurement_mean  measurement_q75  measurement_max       lag
0        X        9          0.861484         0.861484         0.861484  0.338134
1        X       10          0.675029         0.675029         0.675029  0.573993
2        X       24          0.894738         0.894738         0.894738  0.411725
3        X       41          0.610354         0.610354         0.610354  0.953460
4        X       45          0.271329         0.271329         0.271329  0.931424


与您的代码相比:

%%timeit -n 10
# Applying the given code
df_result = df.groupby(['specimen', 'delta_t']).agg({
    'measuremnt': ['mean', q75, 'max'],
    'lag': 'mean'
}).reset_index()

> 43.2 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


如果你想要一个MultiIndex,传递元组作为键:

df_result = pd.DataFrame({
    ('measurement','mean'): groups['measuremnt'].mean(),
    ('measurement','q75'): groups['measuremnt'].quantile(.75),
    ('measurement','max'): groups['measuremnt'].max(),
    ('lag','mean'): groups['lag'].mean()
}).reset_index()

相关问题