按pandas Dataframe 分组并选择序列中的最大值

brqmpdu1  于 2023-05-15  发布在  其他
关注(0)|答案(3)|浏览(211)

我有一个pandas dataframe,它表示几个目标涡轮机的每10度点之间的高程差。我选择了遵循某个条件的高程差,并添加了一列来表示这些高程差是否连续(metDegDiff = 10表示连续点)。
如何通过targTurb在3个或更多连续的10度点中选择elevDif的最大值?

ridgeDF2 = pd.DataFrame(data = {
    'MetID':['A06_40','A06_50','A06_60','A06_70','A06_80','A06_100','A06_110','A06_140','A07_110','A07_130','A07_140','A08_100','A08_110','A08_120','A08_130','A08_220'],
    'targTurb':['A06','A06','A06','A06','A06','A06','A06','A06','A07','A07','A07','A08','A08','A08','A08','A08'],
    'metDeg':[30,50,60,70,80,100,110,140,110,130,140,100,110,120,130,220],
    'elevDif':[1.433234, 1.602997,3.227997,2.002991,2.414001,2.96402,1.513,1.793976,1.612,2.429993,1.639008,1.500977,3.048004,2.174011,1.813995,1.527008],
    'metDegDiff':[20,10,10,10,10,20,10,30,-30,20,10,-40,10,10,10,30]})
[Dbg]>>> ridgeDF2
      MetID targTurb  metDeg   elevDif  metDegDiff
0    A06_40      A06      30  1.433234          20
1    A06_50      A06      50  1.602997          10
2    A06_60      A06      60  3.227997          10
3    A06_70      A06      70  2.002991          10
4    A06_80      A06      80  2.414001          10
5   A06_100      A06     100  2.964020          20
6   A06_110      A06     110  1.513000          10
7   A06_140      A06     140  1.793976          30
8   A07_110      A07     110  1.612000         -30
9   A07_130      A07     130  2.429993          20
10  A07_140      A07     140  1.639008          10
11  A08_100      A08     100  1.500977         -40
12  A08_110      A08     110  3.048004          10
13  A08_120      A08     120  2.174011          10
14  A08_130      A08     130  1.813995          10
15  A08_220      A08     220  1.527008          30

在该示例中,对于A06,存在具有连续10个metDeg值的4行(行1、2、3和4),并且对于A8,存在3行(行12、13和14)。请注意,这2个系列的长度为3或更长。
因此,输出将是这两个所选系列内的最大elevDif。就像这样:

MetID  targTurb  metDeg   elevDif  metDegDiff
A06_60      A06      60  3.227997          10
A08_110     A08     110  3.048004          10
wgx48brx

wgx48brx1#

下面的代码应该可以工作。您可以单独运行每一行,看看发生了什么。

ridgeDF2['t/f'] = ridgeDF2['metDegDiff'] != 10
ridgeDF2['t/f'] = ridgeDF2['t/f'].shift().fillna(0).cumsum()
ridgeDF2['count'] = ridgeDF2.groupby('t/f')['t/f'].transform(len)-1
ridgeDF2['count'] = np.where(ridgeDF2['count'] >= 3,True,False)
ridgeDF2.loc[ridgeDF2['metDegDiff'] != 10,'count'] = False
highest = ridgeDF2.loc[ridgeDF2['count'] == True]
highest = highest.loc[highest.groupby(['targTurb','metDegDiff','t/f'])['elevDif'].idxmax()]
highest.drop(columns = ['t/f','count'])

m = ridgeDF2['metDegDiff'].eq(10).diff().ne(0).cumsum()
ridgeDF2.loc[ridgeDF2.groupby(m).transform('size').ge(3) & ridgeDF2['metDegDiff'].eq(10)].sort_values('elevDif',ascending=False).groupby(m).head(1)
sulc1iza

sulc1iza2#

链式解

ridgeDF2.loc[ridgeDF2[((ridgeDF2.assign(group=(ridgeDF2.metDegDiff!=10).cumsum())).groupby('group')['metDegDiff'].transform(lambda x: (x==10)& (x.count()>=3)))].groupby('targTurb')['elevDif'].idxmax()]

分步解决方案

.cumsum()metDegDiff创建第一个元素不为10的组。

ridgeDF2=ridgeDF2.assign(group=(ridgeDF2.metDegDiff!=10).cumsum())

应用多重过滤器以去除上面生成的组中不等于10的metDegDiff,并保留连续值计数=10等于或大于3的组。我链接groupby().transform()和布尔选择来实现这一点

g=ridgeDF2[ridgeDF2.groupby('group')['metDegDiff'].transform(lambda x: (x==10)& (x.count()>=3))]

从上面剩余的内容中,选择具有最大值的索引

g.loc[g.groupby('targTurb')['elevDif'].idxmax()]

结果

MetID targTurb  metDeg   elevDif  metDegDiff
2    A06_60      A06      60  3.227997          10
12  A08_110      A08     110  3.048004          10

定时

%timeit ridgeDF2.loc[ridgeDF2[((ridgeDF2.assign(group=(ridgeDF2.metDegDiff!=10).cumsum())).groupby('group')['metDegDiff'].transform(lambda x: (x==10)& (x.count()>=3)))].groupby('targTurb')['elevDif'].idxmax()]
9.01 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
m3eecexj

m3eecexj3#

您可以做的是使用shiftcumsum为metDegDiff和相同targTurb中的相同连续值创建一个组列。然后,您可以使用此组列来选择组中具有大于或等于(ge)3个值的位置,这些值是通过组号的map与组号的value_counts获得的,并且metDegDiff中的值等于(eq)10。现在只有感兴趣的组,可以在elevDif上使用sort_values,在group列上使用drop_duplicates,以保持每组的最大值。如果需要的话,可以使用drop列gr和sort_values per targTurb结束。

ridgeDF2['metDegDiff'] = ridgeDF2['metDeg'].diff() #I assume calculated this way

#create a group number with same consecutive values and same targTurb
ridgeDF2['gr'] = (ridgeDF2['metDegDiff'].ne(ridgeDF2['metDegDiff'].shift())
                  |(ridgeDF2['targTurb'].ne(ridgeDF2['targTurb'].shift()))
                 ).cumsum()

#get the result dataframe
res_ = (ridgeDF2.loc[ridgeDF2['metDegDiff'].eq(10) #row with 10 in metDegDiff
                     &ridgeDF2['gr'].map(ridgeDF2['gr'].value_counts()).ge(3)] #and row with group of greater equal 3 values
                .sort_values(by='elevDif') # ascending sort of the elevDif
                .drop_duplicates('gr', keep='last') #keep the last row pergroup having higher number
                .drop('gr', axis=1) #remove the extra group column
                .sort_values('targTurb') #if you need
       )

你就能得到你想要的行

print (res_)
      MetID targTurb  metDeg   elevDif  metDegDiff
2    A06_60      A06      60  3.227997        10.0
12  A08_110      A08     110  3.048004        10.0

相关问题