numpy 有没有更有效的方法来按行应用,然后按列?

ccgok5k5  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(113)

我的数据集包含每天进行的5次测量,时间跨度超过700天。我希望能够按照一周中的哪一天对这些值进行分组,然后将scipy.stats中的trim_mean函数应用于5次测量中的每一次,使用1/stddev作为proportiontocut参数。
我的数据:

import pandas as pd
import numpy as np
from scipy.stats import trim_mean

np.random.seed(42)

data = np.random.randint(0, 100, size=(5, 700))
col_names = pd.date_range('11-16-2023', periods=700)
df = pd.DataFrame(data, columns=col_names)

# df
    2023-11-16  2023-11-17 ...  2025-10-15
0   51          92         ...  57
1   88          48         ...  32
2   89          52         ...  96
3   61          99         ...  48
4   0           7          ...  34

字符串
现在,我可以使用以下(不太优雅的)过程来实现这一点:

df_T = df.T
df_T['Day of Week'] = pd.to_datetime(df_T.index).isocalendar().day

## Room for improvement here ##
# Apply calculation to each type of measurement
gb = df_T.groupby('Day of Week')
m0 = gb[0].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m1 = gb[1].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m2 = gb[2].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m3 = gb[3].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m4 = gb[4].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))

results_df = pd.DataFrame([m0, m1, m2, m3, m4])
results_df.columns = columns=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# results_df
    Mon         Tue         Wed         Thu         Fri         Sat         Sun
0   50.936170   51.712766   44.659574   49.117021   48.702128   47.414894   51.223404
1   49.244681   49.000000   49.138298   49.191489   45.872340   49.010638   47.074468
2   49.436170   46.404255   49.021277   46.553191   55.031915   51.265957   50.638298
3   43.744681   47.787234   48.574468   45.882979   47.255319   47.914894   49.606383
4   49.265957   46.255319   50.276596   50.872340   46.723404   45.255319   49.904255


这是非常低效的,如果我有很多测量值,这没有多大意义。有没有一种聪明的方法来应用/Map我的trim_mean函数来实现同样的目标?

zte4gxcn

zte4gxcn1#

一个可能的选择:

from calendar import day_abbr

results_df = (
   (ser:=df.T.stack()).droplevel(0).groupby(
     [ser.index.get_level_values(0).dayofweek, pd.Grouper(level=0)])
      .apply(lambda g: trim_mean(g, proportiontocut=1/np.std(g)))
      .unstack(0).set_axis(list(day_abbr), axis=1)
)

字符串
输出量:

print(results_df)

         Mon        Tue        Wed        Thu        Fri        Sat        Sun
0  50.936170  51.712766  44.659574  49.117021  48.702128  47.414894  51.223404
1  49.244681  49.000000  49.138298  49.191489  45.872340  49.010638  47.074468
2  49.436170  46.404255  49.021277  46.553191  55.031915  51.265957  50.638298
3  43.744681  47.787234  48.574468  45.882979  47.255319  47.914894  49.606383
4  49.265957  46.255319  50.276596  50.872340  46.723404  45.255319  49.904255

[5 rows x 7 columns]

相关问题