pandas 按条件[Low、Hit、High]分组并创建三个新列

iaqfqrcu  于 2023-02-07  发布在  其他
关注(0)|答案(6)|浏览(123)

我有一个包含机器学习训练结果的大型数据集(约500万行)。现在我想检查结果是否达到"目标范围"。假设此范围包含-0.25+0.25之间的所有值。如果在此范围内,则为Hit,如果低于Low,则为High
我现在会创建Hit,Low,High这三列,然后计算每一行的条件,在这列中放入一个1,另外两个就变成0,然后分组求和,但我怀疑一定有更好更快的方法,比如分组的时候直接计算。

数据

import pandas as pd

df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})

+----+--------+---------+
|    | Type   |   Value |
|----+--------+---------|
|  0 | RF     |    -1.5 | <- Low
|  1 | RF     |    -0.1 | <- Hit
|  2 | RF     |     1.7 | <- High
|  3 | MLP    |     0.2 | <- Hit
|  4 | MLP    |    -0.7 | <- Low
|  5 | MLP    |    -0.6 | <- Low
+----+--------+---------+

预期输出

pd.DataFrame({"Type":["RF", "MLP"], "Low":[1,2], "Hit":[1,1], "High":[1,0]})

+----+--------+-------+-------+--------+
|    | Type   |   Low |   Hit |   High |
|----+--------+-------+-------+--------|
|  0 | RF     |     1 |     1 |      1 |
|  1 | MLP    |     2 |     1 |      0 |
+----+--------+-------+-------+--------+
jaql4c8m

jaql4c8m1#

您可以使用cut定义组,使用pivot_table重塑形状:

(df.assign(group=pd.cut(df['Value'],
                        [float('-inf'), -0.25, 0.25, float('inf')],
                        labels=['Low', 'Hit', 'High']))
   .pivot_table(index='Type', columns='group', values='Value', aggfunc='count')
   .reset_index()
   .rename_axis(None, axis=1)
)

或者crosstab

(pd.crosstab(df['Type'],
             pd.cut(df['Value'],
                    [float('-inf'), -0.25, 0.25, float('inf')],
                    labels=['Low', 'Hit', 'High'])
             )
   .reset_index().rename_axis(None, axis=1)
 )

输出:

Type  Low  Hit  High
0  MLP    2    1     0
1   RF    1    1     1
6mzjoqzu

6mzjoqzu2#

你可以先用np.select再用crosstab

c1 = df.Value<=-0.25
c2 = df.Value>=0.25
out = pd.crosstab(df['Type'], np.select([c1,c2], ['Low','High'], default='Hit'))
out
Out[32]: 
col_0  High  Hit  Low
Type                 
MLP       0    1    2
RF        1    1    1
cwdobuhd

cwdobuhd3#

你可以试试这个:

# Your code
import pandas as pd

df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})

# Set your range 
RANGE_MIN = -0.25
RANGE_MAX = 0.25

# --- define functions to be applied to df ---
# evaluate if value is a low
def eval_low(value):
    if value < RANGE_MIN:
        return 1
    else:
        return 0

# evaluate if value is a high
def eval_high(value):
    if value > RANGE_MAX:
        return 1
    else:
        return 0

# evaluate if value is a hit
def eval_hit(value):
    if value >= RANGE_MIN and value <= RANGE_MAX:
        return 1
    else:
        return 0

# Evaluate the functions in new columns
df['Low'] = df.Value.apply(eval_low)
df['Hit'] = df.Value.apply(eval_hit)
df['High'] = df.Value.apply(eval_high)

# get the summary
df.groupby('Type').sum()
q0qdq0h2

q0qdq0h24#

df.assign(Value=pd.cut(df.Value,[-np.inf,-0.25,0.25,np.inf],labels=['Low','Hit','High']))\
    .groupby('Type').value_counts().reset_index()\
    .pivot_table(index='Type',columns='Value',values=0)
Value  Low  Hit  High
Type                 
MLP      2    1     0
RF       1    1     1
hgc7kmma

hgc7kmma5#

import pandas as pd

df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})

df ['target'] = df.agg({'Value' : lambda x: 'hit' if abs(x)<0.25 else 'low' if x < -0.25 else 'high'})
df['Value'] = 1 

r = df.groupby(['Type', 'target']).count().unstack().fillna(0)
r.columns = list(r.columns.droplevel(0))

print(r)

结果

high  hit  low
Type                
MLP    0.0  1.0  2.0
RF     1.0  1.0  1.0
ymzxtsji

ymzxtsji6#

下面是使用pd.cut().str.get_dummies()的方法

(pd.cut(df.set_index('Type')['Value'],bins = [-np.inf,-.25,.25,np.inf],labels = ['low','hit','high'])
.str.get_dummies()
.groupby(level=0).sum()
.reset_index())

输出:

Type  high  hit  low
0  MLP     0    1    2
1   RF     1    1    1

相关问题