python sklearn的RandomForest中如何计算特征重要性？

cnjp1d6j 于 2022-12-21 发布在 Python

关注(0)|答案(1)|浏览(112)

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.loc[:, df.columns != 'target']
y = df.loc[:, 'target'].values

X_train, X_test, Y_train, Y_test = train_test_split(X, y, random_state=0)

rf = RandomForestClassifier(n_estimators=1,
                            max_depth=2,
                            max_features=2,
                            random_state=0)
rf.fit(X_train, Y_train)

rf.feature_importances_
array([0.        , 0.11197953, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.88802047, 0.        , 0.        , 0.        ])

fn=data.feature_names
cn=data.target_names
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=800)
tree.plot_tree(rf.estimators_[0],
               feature_names = fn, 
               class_names=cn,
               filled = True);
fig.savefig('rf_individualtree.png')

根据上述Feature Importance手动计算Feature Importance（来自sklearn 0.11197953、0.88802047的结果）
一个三个三个一个
我做错了哪一部分，我的结果与sklearn答案不同，还是sklearn只是不遵循公式？

python

来源：https://stackoverflow.com/questions/67838367/how-feature-importance-is-calculated-in-sklearns-randomforest

1条答案

按热度按时间

qxsslcnc1#

你有两个问题：
1.舍入误差
1.数学运算，特别是计算到达节点的概率
一旦你更正了它们，你就会得到sklearn的结果：

print(rf.estimators_[0].tree_.impurity)

array([0.45899182, 0.26172737, 0.10250188, 0.45244126, 0.18549346,
       0.17300567, 0.        ])

n1 = 0.45899182261015226 - (310/426)*0.26172736732570234 - (116/426)*0.1854934601664685
n2 = (116/426)*0.1854934601664685 - (115/426)*0.17300567107750475
n3 = (310/426)*0.26172736732570234 - (203/426)*0.10250188065713806 - (107/426)*0.45244126124552364
f1 = n1+n2
f2 = n3
print(f1/(f1+f2), f2/(f1+f2))

(0.888020474590027, 0.11197952540997297)

(You可以阅读更多关于如何重要性计算here由软件包开发人员或在这里通过阅读源代码）
还要注意，RandomForest认为重要的东西对于另一个模型可能不那么重要（反之亦然），也就是说，这里的"重要性"是特定于模型的，并且可能不那么直观地被更习惯于线性可解释性的人理解或期望。

赞(0）回复(0）举报 2022-12-21

我来回答

python sklearn的RandomForest中如何计算特征重要性？

1条答案

相关问题

热门标签

最新问答