目录
参考:
6.3. Preprocessing data — scikit-learn 0.24.2 documentation
为什么要用one-hot编码 – 简书 (jianshu.com)
1、官网解释
6.3. Preprocessing data — scikit-learn 0.24.2 documentation
Such integer representation can, however, not be used directly with all scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired (i.e. the set of browsers was ordered arbitrarily).
Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the
OneHotEncoder
, which transforms each categorical feature with
n_categories
possible values into
n_categories
binary features, with one of them 1, and all others 0.
2、关于距离更合适的解释
将离散型特征使用one-hot编码,会让特征之间的距离计算更加合理。
比如,有一个离散型特征,代表工作类型,该离散型特征,共有三个取值,不使用one-hot编码,其表示分别是x_1 = (1), x_2 = (2), x_3 = (3)。
两个工作之间的距离是,(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2。那么x_1和x_3工作之间就越不相似吗?显然这样的表示,计算出来的特征的距离是不合理。
那如果使用one-hot编码,则得到x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1),那么两个工作之间的距离就都是sqrt(2).即每两个工作之间的距离是一样的,显得更合理。