关于数据的预处理,sklearn提供了一个专门的模块
sklearn.preprocessing
,可用于常规的预处理操作,详情可参见这里(英文,中文)。
a. 为什么?
b. 函数scale
:数组的标准化
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
X_scaled = preprocessing.scale(X_train)
X_scaled
# array([[ 0. ..., -1.22..., 1.33...],
# [ 1.22..., 0. ..., -0.26...],
# [-1.22..., 1.22..., -1.06...]])
# 缩放后的数据具有零均值以及标准方差
X_scaled.mean(axis=0)
# array([0., 0., 0.])
X_scaled.std(axis=0)
# array([1., 1., 1.])
c. 类StandardScaler
:
scaler = preprocessing.StandardScaler().fit(X_train)
scaler
# StandardScaler(copy=True, with_mean=True, with_std=True)
scaler.mean_
# array([1. ..., 0. ..., 0.33...])
scaler.scale_
# array([0.81..., 0.81..., 1.24...])
scaler.transform(X_train)
# array([[ 0. ..., -1.22..., 1.33...],
# [ 1.22..., 0. ..., -0.26...],
# [-1.22..., 1.22..., -1.06...]])
X_test = [[-1., 1., 0.]]
scaler.transform(X_test)
# array([[-2.44..., 1.22..., -0.26...]])
d. 将特征缩放至特定范围
MinMaxScaler
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax
# array([[0.5 , 0. , 1. ],
# [1. , 0.5 , 0.33333333],
# [0. , 1. , 0. ]])
# 应用于测试集
X_test = np.array([[ -3., -1., 4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax
# array([[-1.5 , 0. , 1.66666667]])
MaxAbsScaler
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs # doctest +NORMALIZE_WHITESPACE^
# array([[ 0.5, -1. , 1. ],
# [ 1. , 0. , 0. ],
# [ 0. , 1. , -0.5]])
X_test = np.array([[ -3., -1., 4.]])
X_test_maxabs = max_abs_scaler.transform(X_test)
X_test_maxabs
# array([[-1.5, -1. , 2. ]])
max_abs_scaler.scale_
# array([2., 1., 2.])
e. 缩放系数矩阵
scale
,StandardScaler
可以接受稀疏输入,构造时指定with_mean=False
f. 缩放含有离群值的数据
robust_scale
, RobustScaler
可作为替代a. 常见可用:
b. 映射到均匀分布:
QuantileTransformer
quantile_transform
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
X_train_trans = quantile_transformer.fit_transform(X_train)
X_test_trans = quantile_transformer.transform(X_test)
np.percentile(X_train[:, 0], [0, 25, 50, 75, 100])
# array([ 4.3, 5.1, 5.8, 6.5, 7.9])
# 转换后接近于百分位数定义
np.percentile(X_train_trans[:, 0], [0, 25, 50, 75, 100])
# array([ 0.00... , 0.24..., 0.49..., 0.73..., 0.99... ])
c. 映射到高斯分布
PowerTransformer
提供两种:Yeo-Johnson transform
, the Box-Cox transform
pt = preprocessing.PowerTransformer(method='box-cox', standardize=False)
X_lognormal = np.random.RandomState(616).lognormal(size=(3, 3))
X_lognormal
# array([[1.28..., 1.18..., 0.84...],
# [0.94..., 1.60..., 0.38...],
# [1.35..., 0.21..., 1.09...]])
pt.fit_transform(X_lognormal)
# array([[ 0.49..., 0.17..., -0.15...],
# [-0.05..., 0.58..., -0.57...],
# [ 0.69..., -0.84..., 0.10...]])
- 不同的原始分布经过变化,有的可以变为高斯分布,有的可能效果不太好 ![](https://sklearn.apachecn.org/docs/img/sphx_glr_plot_map_data_to_normal_0011.png)
- 缩放单个样本以具有单位范数的过程
- 函数`normalize`:用于数组,`l1`或`l2`范式
X = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')
X_normalized
# array([[ 0.40..., -0.40..., 0.81...],
# [ 1. ..., 0. ..., 0. ...],
# [ 0. ..., 0.70..., -0.70...]])
a. 标称型特征:非连续的数值,可编码为整数
b. 类OrdinalEncoder
:
- 将类别特征值编码为一个新的整数型特征(0到num_category-1之间的一个数)
- 但是这个数值不能直接使用,因为会被认为是有序的(实际是无序的)
- 一般使用独热编码
enc = preprocessing.OrdinalEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
# OrdinalEncoder(categories='auto', dtype=<... 'numpy.float64'>)
enc.transform([['female', 'from US', 'uses Safari']])
# array([[0., 1., 1.]])
c. 独热编码(dummy encoding)
enc = preprocessing.OneHotEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
# OneHotEncoder(categorical_features=None, categories=None, drop=None,
# dtype=<... 'numpy.float64'>, handle_unknown='error',
# n_values=None, sparse=True)
enc.transform([['female', 'from US', 'uses Safari'],
['male', 'from Europe', 'uses Safari']]).toarray()
# array([[1., 0., 0., 1., 0., 1.],
# [0., 1., 1., 0., 0., 1.]])
# 在category_属性中找到,编码后是几维的
enc.categories_
# [array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
可以显示的指定,某个特征需要被编码为几维的,最开始提供一个可能的取值集合,基于这个集合进行编码:
genders = ['female', 'male']
locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
# Note that for there are missing categorical values for the 2nd and 3rd
# feature
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
# array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
a. 离散化(discretization):
b. K-bins离散化:
X = np.array([[ -3., 5., 15 ],
[ 0., 6., 14 ],
[ 6., 3., 11 ]])
est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)
est.transform(X)
# array([[ 0., 1., 1.],
# [ 1., 1., 1.],
# [ 2., 0., 0.]])
d. 特征二值化 - 将数值特征用阈值过滤得到布尔值的过程
X = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
binarizer = preprocessing.Binarizer().fit(X) # fit does nothing
binarizer
binarizer.transform(X)
# array([[1., 0., 1.],
# [1., 0., 0.],
# [0., 1., 0.]])
# 可使用不同的阈值
binarizer = preprocessing.Binarizer(threshold=1.1)
binarizer.transform(X)
# array([[0., 0., 1.],
# [1., 0., 0.],
# [0., 0., 0.]])
a. 为什么? - 添加非线性特征 - 增加模型的复杂度 - 常用:添加多项式
b. 生成多项式类PolynomialFeatures
:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
X = np.arange(6).reshape(3, 2)
X
# array([[0, 1],
# [2, 3],
# [4, 5]])
poly = PolynomialFeatures(2)
poly.fit_transform(X)
# array([[ 1., 0., 1., 0., 0., 1.],
# [ 1., 2., 3., 4., 6., 9.],
# [ 1., 4., 5., 16., 20., 25.]])
X = np.arange(9).reshape(3, 3)
X
# array([[0, 1, 2],
# [3, 4, 5],
# [6, 7, 8]])
# 指定度,且只要具有交叉的项,像上面的自己平方的项不要了
poly = PolynomialFeatures(degree=3, interaction_only=True)
poly.fit_transform(X)
# array([[ 1., 0., 1., 2., 0., 0., 2., 0.],
# [ 1., 3., 4., 5., 12., 15., 20., 60.],
# [ 1., 6., 7., 8., 42., 48., 56., 336.]])
类FunctionTransformer
:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate=True)
X = np.array([[0, 1], [2, 3]])
transformer.transform(X)
# array([[0. , 0.69314718],
# [1.09861229, 1.38629436]])