sklearn.preprocessing()详解: 标准化、正则化、最小最大规范化、特征二值化

本文链接：https://blog.csdn.net/MaeveShi/article/details/107834902

本文详细介绍了scikit-learn库中的数据预处理方法，包括StandardScaler进行数据的标准化与归一化，MinMaxScaler实现最小最大规范化，Normalizer进行正则化/归一化操作，以及Binarizer的特征二值化过程。每个方法的原理、参数和实例均进行了阐述，旨在帮助理解并应用这些预处理技术。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一. 数据的标准化与归一化(zero-mean normalization): class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)

官方文档-StandardScaler

standard score(z) of a sample x: z = (x - u) / s
u: the mean of training samples (u = 0 if with_mean = False)
s: the standard deviation of the training samples (s = 1 if with_std = False)
Parameters and Attributes:

例子：

from sklearn.preprocessing import StandardScaler
data = [[0, 0], [0, 0], [1, 1], [1, 1]]
scaler = StandardScaler()
print(scaler.fit(data))

output: StandardScaler()

print(scaler.mean_)
print(scaler.var_)

output:
array([0.5, 0.5])
array([0.25, 0.25])

其中scaler.fit(data)，即StandardScaler.fit(data)计算出数据的平均值和标准差，并存储在StandardScaler()中便于之后的使用；
调用attributes中的mean_和var_求数据的平均值和方差.
除了fit()之外，StandardScaler()还有许多不同的methods：

Popular Methods:

fit(): compute the mean and std to be used for later scaling
fit_transform(): fit to data, then transform it. 即计算出mean和std，并对数据作转换，从而变成标准的正态分布
transform(): perform standardization by centering and scaling. 即只进行转换成标准正态分布的操作

注意：

一般来说先使用fit，再使用transform
将训练集和测试集放在一起做标准化；或者在训练集上作标准化后，用相同的标准化器(scaler())去标准化测试集，必须用同一个scaler进行transform

例：

data = [[0, 0], [0, 0], [1, 1], [1, 1]]
# 基于mean和std的标准化
scaler = preprocessing.StandardScaler().fit(train_data)
scaler.transform(train_data)
scaler.transform(test_data)

二. 最小最大规范化(min-max normalization) class sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), *, copy=True)

官方文档-MinMaxScaler

min-max normalization: transform features by scaling each feature to a given range.

transform features to the range(0,1)

The transform is given by:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
where min, max = feature_range. (default: max = 1, min = 0)
Parameters and Attributes:
Methods:

fit(): compute the min and max for later scaling
transform(): scale features according to feature_range
fit_transform(): fit to data, then transform it.

例：

基于所给的数据创建最小最大转换器

from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
print(scaler.fit(data))

output:
MinMaxScaler()

根据attribute中data_max_求出转换前数据的最大值

print(scaler.data_max_)

output: [ 1. 18.]

对数据集进行最小最大规范化

print(scaler.transform(data))

output：
[[0. 0. ]
[0.25 0.25]
[0.5 0.5 ]
[1. 1. ]]

三. 正则化/归一化(normalization)

normalization: Scale input vectors individually to unit norm (vector length)
首先求出样本的p范数，然后该样本的所有元素都要除以该范数，这样最终使得每个样本的范数都是1。规范化（Normalization）是将不同变化范围的值映射到相同的固定范围，常见的是[0,1]，也成为归一化。
函数：sklearn.preprocessing.normalize(X, norm=‘l2’, , axis=1, copy=True, return_norm=False)
类：sklearn.preprocessing.Normalizer(norm=‘l2’, , copy=True)

sklearn.preprocessing.normalize(X, norm=‘l2’, , axis=1, copy=True, return_norm=False)

Parameters and Returns:

例：

import sklearn.preprocessing
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = sklearn.preprocessing.normalize(X, norm='l2')
print(X_normalized)

output:
[[ 0.40824829 -0.40824829 0.81649658]
[ 1. 0. 0. ]
[ 0. 0.70710678 -0.70710678]]

sklearn.preprocessing.Normalizer(norm=‘l2’, , copy=True)

Parameters:

例：

from sklearn.preprocessing import Normalizer
normalizer = Normalizer().fit(X)#fit method is useless in this case
print(normalizer)

output: Normalizer()

print(normalizer.transform(X))

output:
array([[ 0.40824829, -0.40824829, 0.81649658],
[ 1. , 0. , 0. ],
[ 0. , 0.70710678, -0.70710678]])

和直接利用函数normalize结果相同。

四. 特征二值化（Binarization）class sklearn.preprocessing.Binarizer(*, threshold=0.0, copy=True)

Binarize data (set feature values to 0 or 1) according to a threshold
Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. With the default threshold of 0, only positive values map to 1.

Parameters:

例子：

from sklearn.preprocessing import Binarizer
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
transformer = Binarizer().fit(X) #fit does nothing
print(transformer)