sklearn.tree.DicisionTreeClassifier类中的feature_importances_属性返回的是特征的重要性,feature_importances_越高代表特征越重要,scikit-learn官方文档
1
中的解释如下:
The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.
这是说,树的构建中,每个特征我们都会计算一个它的
criterion
(基尼指数
2
或者信息熵
3
),特征重要性就是这个
criterion
减少量的归一化值。在scikit-learn中的默认实现也就是我们常说的
Gini importance
。
stackover上的帖子
4
中答主Seljuk Gülcan指出如果每个特征只被使用一次,那么
feature_importances_
应当就是这个
Gini importance
:
N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)
其中,N是样本的总数,N_t是当前节点的样本数目,N_t_L是结点左孩子的样本数目,N_t_R是结点右孩子的样本数目。impurity直译为不纯度(基尼指数或信息熵),这里的实现的是基尼指数。
举个例子,下面的决策树
from StringIO import StringIO
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_graphviz
from sklearn.feature_selection import mutual_info_classif
X = [[1,0,0], [0,0,0], [0,0,1], [0,1,0]]
y = [1,0,1,1]
clf = DecisionTreeClassifier()
clf.fit(X, y)
feat_importance = clf.tree_.compute_feature_importances(normalize=False)
print("feat importance = " + str(feat_importance))
out = StringIO()
out = export_graphviz(clf, out_file='test/tree.dot')
最后的输出结果为:
feat importance = [0.25 0.08333333 0.04166667]
该决策树的图示如下:
对于X[2]特征,
feature_importance = (4 / 4) * (0.375 - (3 / 4 * 0.444)) = 0.042
对于X[1]特征,
feature_importance = (3 / 4) * (0.444 - (2 / 3 * 0.5)) = 0.083
对于X[0]特征,
feature_importance = (2 / 4) * (0.5) = 0.25
与输出结果是吻合的。
-
En
t
r
o
p
y
(
p
)
=
−
∑
k
=
1
K
p
k
log
2
p
k
Entropy(p)=-\displaystyle\sum_{k=1}^Kp_k\log_2p_k
E
n
t
r
o
p
y
(
p
)
=
−
k
=
1
∑
K
p
k
lo
g
2
p
k
↩︎
-
Gi
n
i
(
p
)
=
∑
k
=
1
K
p
k
(
1
−
p
k
)
=
1
−
∑
k
=
1
K
p
k
2
Gini(p)=\displaystyle\sum_{k=1}^Kp_k(1-p_k)=1-\displaystyle\sum_{k=1}^Kp_k^2
G
i
n
i
(
p
)
=
k
=
1
∑
K
p
k
(
1
−
p
k
)
=
1
−
k
=
1
∑
K
p
k
2
↩︎