Pandas在条件之后创建分类列

o2rvlv0m  于 2022-11-27  发布在  其他
关注(0)|答案(2)|浏览(140)

我有一个 Dataframe ,如下所示:

DURATION   CLUSTER   COEFF
3          0         0.34
3          1        -0.005
3          2         1
3          3         0.33 
4          0        -0.02
4          1        -0.28
4          2         0.22
4          3         0.48
5          0         0.65
5          1        -0.26
5          2         0.1
5          3         0.15

我想根据每个“DURATION”的“COEFF”系数创建一个RESULT分类列。“COEFF”值最大的列将是“First”,依此类推。
所需输出如下:

DURATION   CLUSTER   COEFF  RESULT
3          0         0.34   Second
3          1        -0.005  Fourth
3          2         1      First
3          3         0.33   Third
4          0        -0.02   Third
4          1        -0.28   Fourth
4          2         0.22   Second
4          3         0.48   First
5          0         0.65   First
5          1        -0.26   Fourth
5          2         0.1    Third
5          3         0.15   Second

你能帮我一下吗?

bvhaajcl

bvhaajcl1#

使用groupby.rankmap

labels = ['First', 'Second', 'Third', 'Fourth', 'Fifth']
df['RESULT'] = (df.groupby('DURATION')['COEFF']
                  .rank('dense', ascending=False).sub(1)
                  .map(dict(enumerate(labels)))
               )

输出量:

DURATION  CLUSTER  COEFF  RESULT
0          3        0  0.340  Second
1          3        1 -0.005  Fourth
2          3        2  1.000   First
3          3        3  0.330   Third
4          4        0 -0.020   Third
5          4        1 -0.280  Fourth
6          4        2  0.220  Second
7          4        3  0.480   First
8          5        0  0.650   First
9          5        1 -0.260  Fourth
10         5        2  0.100   Third
11         5        3  0.150  Second
jhdbpxl9

jhdbpxl92#

基于https://stackoverflow.com/a/74547858/7237062出色的答案(* 我自己也不会这么快找到这个答案 *),我建议使用这个Ordinal numbers replacement来完全自动化这个过程。

import pandas as pd
# see answer https://stackoverflow.com/a/20007730/7237062, others exist
# code golfed version of an "ordinal" function (int -> ordinal string in english)
ordinal = lambda n: "%d%s" % (n,"tsnrhtdd"[(n//10%10!=1)*(n%10<4)*n%10::4])
# copy pasta of OP input data
df = pd.read_clipboard()  # let pandas read the clipboard
df["RESULT"] = (df.groupby('DURATION')['COEFF']
                  .rank('dense', ascending=False)
                  .sub(1) # mozway's answer so far !
                  .astype(int)
                  + 1 # +1 so ordinals start at 1 (instead of 0)
                  ).apply(ordinal)

结果:

DURATION  CLUSTER  COEFF RESULT
0          3        0  0.340    2nd
1          3        1 -0.005    4th
2          3        2  1.000    1st
3          3        3  0.330    3rd
4          4        0 -0.020    3rd
5          4        1 -0.280    4th
6          4        2  0.220    2nd
7          4        3  0.480    1st
8          5        0  0.650    1st
9          5        1 -0.260    4th
10         5        2  0.100    3rd
11         5        3  0.150    2nd

相关问题