为什么在LabelEncoder后还要使用onehot?

  • Post author:
  • Post category:其他



目录


1、官网解释


2、关于距离更合适的解释


参考:


6.3. Preprocessing data — scikit-learn 0.24.2 documentation


为什么要用one-hot编码 – 简书 (jianshu.com)

1、官网解释


6.3. Preprocessing data — scikit-learn 0.24.2 documentation

Such integer representation can, however, not be used directly with all scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired (i.e. the set of browsers was ordered arbitrarily).

Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the


OneHotEncoder


, which transforms each categorical feature with

n_categories

possible values into

n_categories

binary features, with one of them 1, and all others 0.

2、关于距离更合适的解释

将离散型特征使用one-hot编码,会让特征之间的距离计算更加合理。

比如,有一个离散型特征,代表工作类型,该离散型特征,共有三个取值,不使用one-hot编码,其表示分别是x_1 = (1), x_2 = (2), x_3 = (3)。

两个工作之间的距离是,(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2。那么x_1和x_3工作之间就越不相似吗?显然这样的表示,计算出来的特征的距离是不合理。

那如果使用one-hot编码,则得到x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1),那么两个工作之间的距离就都是sqrt(2).即每两个工作之间的距离是一样的,显得更合理。



版权声明:本文为qq_41973062原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。