pandas 如何将我在子集上训练的预测绑定回原始DF?

plupiseo  于 2022-12-16  发布在  其他
关注(0)|答案(1)|浏览(101)

我正在对一个特征工程训练集进行预测,没有任何标识键。我怎样才能将我的预测合并回原始df?
原始_DF

ID.  ColumnB.   ColumnC.   ColumnD.  Target 
   A        2          3        1          8
   B        2          3        1          9
   C        2          3        1          6

然后,我在ColumnC和ColumnD上训练我的模型,得到:

Subset_to_use = ['ColumnC', 'ColumnD', 'Target']
....
#Creating Train / Test resulting in train and test set, and X and Y:  
X_train, y_train
X_test,   y_test 

# Then doing the modelling, simplified: 
rf = RandomForestRegressor(n_estimators = 100) 
rf.fit(X_train, y_train)

接下来的问题是:如何将预测绑定回original_df?因为中不再有ID列?
培训df:

ColumnC.   ColumnD.  Target 
   3        1          8
   3        1          9
   3        1          6

我的思考方向:

# Add the predictions to the df 
X_train['Prediction_TEST'] = y_train. # to have the original values 
X_test['Prediction_TEST'] = rf.predict(X_test) # to have the predicted values

然后将上述内容结合起来,例如:

all_data = pd.concat(X_train, X_test])

然而,这仅给出具有新预测的训练和测试DF,而没有其它原始列(例如,列A和列B)。
解决这个问题的最好方法是什么?谢谢!
预期结局(预测值是虚构的):

ID.  ColumnB.   ColumnC.   ColumnD.  Target     Predicted
   A        2          3        1          8       8
   B        2          3        1          9       10
   C        2          3        1          6       7
sgtfey8w

sgtfey8w1#

只要输出的大小与输入的大小匹配,索引就不重要:

from io import StringIO
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

# Reproducible data
csv_data = StringIO("""ID,ColumnB,ColumnC,ColumnD,Target
A,2,3,1,8
B,2,3,1,9
C,2,3,1,6""")

df = pd.read_csv(csv_data, index_col=0)

reg = RandomForestRegressor()
reg.fit(df[["ColumnB", "ColumnC", "ColumnD"]], df["Target"])

# Create a `Predicted` column representing testing on the train set
df["Predicted"] = reg.predict(df[["ColumnB", "ColumnC", "ColumnD"]])

print(df)

Predicted列现在包含应用训练的随机森林回归函数的结果。ID值在这里应该不重要。

ColumnB  ColumnC  ColumnD  Target  Predicted
ID                                              
A         2        3        1       8       7.74
B         2        3        1       9       7.74
C         2        3        1       6       7.74

现在考虑这样一种情况:您有单独的traintest拆分,并且每个拆分应该只能访问ColumnCColumnD

from io import StringIO
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Reproducible data
csv_data = StringIO("""ID,ColumnB,ColumnC,ColumnD,Target
A,2,3,1,8
B,2,3,1,9
C,2,3,1,6""")

df = pd.read_csv(csv_data, index_col=0)

X_train, X_test, y_train, y_test = train_test_split(
    df[["ColumnC", "ColumnD"]], df["Target"], random_state=42,
)

reg = RandomForestRegressor()
reg.fit(X_train, y_train)

X_trainX_test仍然是DataFrame对象,所以我们可以添加表示回归变量预测的列:

X_train["train_predictions"] = reg.predict(X_train)
X_test["test_predictions"] = reg.predict(X_test)

X_test现在如下所示,X_train应该类似:

ColumnC  ColumnD  test_predictions
ID                                    
A         3        1             7.545

索引应该在转换的每一步都被保留下来,所以我们可以将重复的列CDjoin放回到原始的df上:

df = df.join([
    X_train.drop(["ColumnC", "ColumnD"], axis=1),
    X_test.drop(["ColumnC", "ColumnD"], axis=1),
])

给我们:

ColumnB  ColumnC  ColumnD  Target  train_predictions  test_predictions
ID                                                                        
A         2        3        1       8                NaN             7.695
B         2        3        1       9              7.695               NaN
C         2        3        1       6              7.695               NaN

相关问题