为什么在LabelEncoder后还要使用onehot？

Post author:xfxia
Post published:2023年8月29日
Post category:其他

1、官网解释

2、关于距离更合适的解释

参考：

6.3. Preprocessing data — scikit-learn 0.24.2 documentation

为什么要用one-hot编码 – 简书 (jianshu.com)

1、官网解释

6.3. Preprocessing data — scikit-learn 0.24.2 documentation

Such integer representation can, however, not be used directly with all scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired (i.e. the set of browsers was ordered arbitrarily).

Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the

OneHotEncoder

, which transforms each categorical feature with
n_categories
possible values into
n_categories
binary features, with one of them 1, and all others 0.

2、关于距离更合适的解释

将离散型特征使用one-hot编码，会让特征之间的距离计算更加合理。

比如，有一个离散型特征，代表工作类型，该离散型特征，共有三个取值，不使用one-hot编码，其表示分别是x_1 = (1), x_2 = (2), x_3 = (3)。

两个工作之间的距离是，(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2。那么x_1和x_3工作之间就越不相似吗？显然这样的表示，计算出来的特征的距离是不合理。

那如果使用one-hot编码，则得到x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1)，那么两个工作之间的距离就都是sqrt(2).即每两个工作之间的距离是一样的，显得更合理。

原文链接：https://blog.csdn.net/qq_41973062/article/details/116330846

1、官网解释

2、关于距离更合适的解释

你可能也喜欢