python sklearn的RandomForest中如何计算特征重要性?

cnjp1d6j  于 2022-12-21  发布在  Python
关注(0)|答案(1)|浏览(112)

从此TutorialFeature Importance
我试着随意画一棵树

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.loc[:, df.columns != 'target']
y = df.loc[:, 'target'].values

X_train, X_test, Y_train, Y_test = train_test_split(X, y, random_state=0)

rf = RandomForestClassifier(n_estimators=1,
                            max_depth=2,
                            max_features=2,
                            random_state=0)
rf.fit(X_train, Y_train)
rf.feature_importances_
array([0.        , 0.11197953, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.88802047, 0.        , 0.        , 0.        ])
fn=data.feature_names
cn=data.target_names
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=800)
tree.plot_tree(rf.estimators_[0],
               feature_names = fn, 
               class_names=cn,
               filled = True);
fig.savefig('rf_individualtree.png')


根据上述Feature Importance手动计算Feature Importance(来自sklearn 0.11197953、0.88802047的结果)
一个三个三个一个
我做错了哪一部分,我的结果与sklearn答案不同,还是sklearn只是不遵循公式?

qxsslcnc

qxsslcnc1#

你有两个问题:
1.舍入误差
1.数学运算,特别是计算到达节点的概率
一旦你更正了它们,你就会得到sklearn的结果:

print(rf.estimators_[0].tree_.impurity)
array([0.45899182, 0.26172737, 0.10250188, 0.45244126, 0.18549346,
       0.17300567, 0.        ])
n1 = 0.45899182261015226 - (310/426)*0.26172736732570234 - (116/426)*0.1854934601664685
n2 = (116/426)*0.1854934601664685 - (115/426)*0.17300567107750475
n3 = (310/426)*0.26172736732570234 - (203/426)*0.10250188065713806 - (107/426)*0.45244126124552364
f1 = n1+n2
f2 = n3
print(f1/(f1+f2), f2/(f1+f2))
(0.888020474590027, 0.11197952540997297)

(You可以阅读更多关于如何重要性计算here由软件包开发人员或在这里通过阅读源代码)
还要注意,RandomForest认为重要的东西对于另一个模型可能不那么重要(反之亦然),也就是说,这里的"重要性"是特定于模型的,并且可能不那么直观地被更习惯于线性可解释性的人理解或期望。

相关问题