scikit-learn决策树算法中特征重要性的计算

Post author:xfxia
Post published:2023年9月21日
Post category:其他

sklearn.tree.DicisionTreeClassifier类中的feature_importances_属性返回的是特征的重要性，feature_importances_越高代表特征越重要，scikit-learn官方文档
¹
中的解释如下：

The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

这是说，树的构建中，每个特征我们都会计算一个它的

criterion

（基尼指数
²
或者信息熵
³
），特征重要性就是这个

criterion

减少量的归一化值。在scikit-learn中的默认实现也就是我们常说的

Gini importance

。

stackover上的帖子
⁴
中答主Seljuk Gülcan指出如果每个特征只被使用一次，那么

feature_importances_

应当就是这个

Gini importance

：

N_t / N * (impurity - N_t_R / N_t * right_impurity
                    - N_t_L / N_t * left_impurity)

其中，N是样本的总数，N_t是当前节点的样本数目，N_t_L是结点左孩子的样本数目，N_t_R是结点右孩子的样本数目。impurity直译为不纯度（基尼指数或信息熵），这里的实现的是基尼指数。

举个例子，下面的决策树

from StringIO import StringIO

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_graphviz
from sklearn.feature_selection import mutual_info_classif

X = [[1,0,0], [0,0,0], [0,0,1], [0,1,0]]

y = [1,0,1,1]

clf = DecisionTreeClassifier()
clf.fit(X, y)

feat_importance = clf.tree_.compute_feature_importances(normalize=False)
print("feat importance = " + str(feat_importance))

out = StringIO()
out = export_graphviz(clf, out_file='test/tree.dot')

最后的输出结果为：

feat importance = [0.25       0.08333333 0.04166667]

该决策树的图示如下：

上面代码对应的决策树图示

对于X[2]特征，

feature_importance = (4 / 4) * (0.375 - (3 / 4 * 0.444)) = 0.042

对于X[1]特征，

feature_importance = (3 / 4) * (0.444 - (2 / 3 * 0.5)) = 0.083

对于X[0]特征，

feature_importance = (2 / 4) * (0.5) = 0.25

与输出结果是吻合的。

scikit-learn官方文档


↩︎
n

t

r

o

p

y

(

p

)

=

−

∑

k

=

1

K

p

k

log

⁡

2

p

k

Entropy(p)=-\displaystyle\sum_{k=1}^Kp_k\log_2p_k

$E n t r o p y (p) = - k = 1 \sum K p_{k} lo g_{2} p_{k}$

↩︎
i

n

i

(

p

)

=

∑

k

=

1

K

p

k

(

1

−

p

k

)

=

1

−

∑

k

=

1

K

p

k

2

Gini(p)=\displaystyle\sum_{k=1}^Kp_k(1-p_k)=1-\displaystyle\sum_{k=1}^Kp_k^2

$G i n i (p) = k = 1 \sum K p_{k} (1 - p_{k}) = 1 - k = 1 \sum K p_{k}^{2}$

↩︎
stackover上的帖子


↩︎

原文链接：https://blog.csdn.net/DKY10/article/details/84843864

你可能也喜欢