机器学习入门,Coursera的课程内容和CS229的内容相似,但是后者的难度更大,有更多对于公式的推导,可以先看Coursera再补充看CS229的。
Webpage notes: http://www.holehouse.org/mlclass/
Stanford video: https://see.stanford.edu/course/cs229
很多介绍的内容都很详细,这里只记录一些自己觉得容易忘记或者难以理解的点
。cross_val_score
>>> from sklearn.model_selection import cross_val_score
>>> clf = svm.SVC(kernel='linear', C=1)
# cv=5指重复5次交叉验证,默认每次是5fold
>>> scores = cross_val_score(clf, iris.data, iris.target, cv=5)
>>> scores
array([0.96..., 1. ..., 0.96..., 0.96..., 1. ])
# 计算平均得分、95%置信区间
>>> print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.98 (+/- 0.03)
scoring
参数指定计算特定的分数值>>> from sklearn import metrics
>>> scores = cross_val_score(
... clf, iris.data, iris.target, cv=5, scoring='f1_macro')
>>> scores
array([0.96..., 1. ..., 0.96..., 0.96..., 1. ])
默认的情况下,cross_val_score
不能直接指定划分的具体细节(比如训练集测试集的比例,初始化值,重复次数)等,可以自行设定好之后,传给其参数cv
,这样能够获得更好的控制:
>>> from sklearn.model_selection import ShuffleSplit
>>> n_samples = iris.data.shape[0]
>>> cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
>>> cross_val_score(clf, iris.data, iris.target, cv=cv)
array([0.977..., 0.977..., 1. ..., 0.955..., 1. ])
默认的,cross_val_score
只能计算一个类型的分数,要想获得多个度量值,可用函数cross_validate
:
>>> from sklearn.model_selection import cross_validate
>>> from sklearn.metrics import recall_score
>>> scoring = ['precision_macro', 'recall_macro']
>>> clf = svm.SVC(kernel='linear', C=1, random_state=0)
>>> scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
... cv=5)
# 默认是运行和打分时间+测试集的指标
>>> sorted(scores.keys())
['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro']
>>> scores['test_recall_macro']
array([0.96..., 1. ..., 0.96..., 0.96..., 1. ])
# 可以指定return_train_score参数,同时返回训练集的度量指标值
>>> scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
... cv=5, return_train_score=True)
>>> sorted(scores.keys())
['fit_time', 'score_time', 'test_prec_macro', 'test_rec_macro',
'train_prec_macro', 'train_rec_macro']
>>> scores['train_rec_macro']
array([0.97..., 0.97..., 0.99..., 0.98..., 0.98...])
通常使用k-fold对数据进行切分:
>>> import numpy as np
>>> from sklearn.model_selection import KFold
>>> X = ["a", "b", "c", "d"]
>>> kf = KFold(n_splits=2)
>>> for train, test in kf.split(X):
... print("%s %s" % (train, test))
[2 3] [0 1]
[0 1] [2 3]
一般k-fold需要重复多次,下面是重复2次2-fold:
>>> import numpy as np
>>> from sklearn.model_selection import RepeatedKFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> random_state = 12883823
>>> rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)
>>> for train, test in rkf.split(X):
... print("%s %s" % (train, test))
...
[2 3] [0 1]
[0 1] [2 3]
[0 2] [1 3]
[1 3] [0 2]
每次一个
样本用于测试:
>>> from sklearn.model_selection import LeaveOneOut
>>> X = [1, 2, 3, 4]
>>> loo = LeaveOneOut()
>>> for train, test in loo.split(X):
... print("%s %s" % (train, test))
[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]
每次P个
样本用于测试:
>>> from sklearn.model_selection import LeavePOut
>>> X = np.ones(4)
>>> lpo = LeavePOut(p=2)
>>> for train, test in lpo.split(X):
... print("%s %s" % (train, test))
[2 3] [0 1]
[1 3] [0 2]
[1 2] [0 3]
[0 3] [1 2]
[0 2] [1 3]
[0 1] [2 3]
作为一般规则,大多数作者和经验证据表明, 5- 或者 10- 交叉验证应该优于 留一法交叉验证。
函数ShuffleSplit
,先将样本打散,再换分为一对训练测试集合,这也是上面的3. 自行设定交叉验证器,更好的控制交叉验证过程
提到的,更好的控制交叉验证所用到的函数:
>>> from sklearn.model_selection import ShuffleSplit
>>> X = np.arange(5)
>>> ss = ShuffleSplit(n_splits=3, test_size=0.25,
... random_state=0)
>>> for train_index, test_index in ss.split(X):
... print("%s %s" % (train_index, test_index))
...
[1 3 4] [2 0]
[1 4 3] [0 2]
[4 0 2] [1 3]
分层k-fold:每个fold里面,各类别样本比例大致相当:
>>> from sklearn.model_selection import StratifiedKFold
>>> X = np.ones(10)
>>> y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
>>> skf = StratifiedKFold(n_splits=3)
>>> for train, test in skf.split(X, y):
... print("%s %s" % (train, test))
[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]
分组k-fold:同一组在测试和训练中不被同时表示,某一个组的数据,要么出现在训练集,要么出现在测试集:
>>> from sklearn.model_selection import GroupKFold
>>> X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
>>> y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
>>> groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]
>>> gkf = GroupKFold(n_splits=3)
>>> for train, test in gkf.split(X, y, groups=groups):
... print("%s %s" % (train, test))
[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]
分组:留一组
>>> from sklearn.model_selection import LeaveOneGroupOut
>>> X = [1, 5, 10, 50, 60, 70, 80]
>>> y = [0, 1, 1, 2, 2, 2, 2]
>>> groups = [1, 1, 2, 2, 3, 3, 3]
>>> logo = LeaveOneGroupOut()
>>> for train, test in logo.split(X, y, groups=groups):
... print("%s %s" % (train, test))
[2 3 4 5 6] [0 1]
[0 1 4 5 6] [2 3]
[0 1 2 3] [4 5 6]
分组:留P组:
>>> from sklearn.model_selection import LeavePGroupsOut
>>> X = np.arange(6)
>>> y = [1, 1, 1, 2, 2, 2]
>>> groups = [1, 1, 2, 2, 3, 3]
>>> lpgo = LeavePGroupsOut(n_groups=2)
>>> for train, test in lpgo.split(X, y, groups=groups):
... print("%s %s" % (train, test))
[4 5] [0 1 2 3]
[2 3] [0 1 4 5]
[0 1] [2 3 4 5]
>>> from sklearn.model_selection import TimeSeriesSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4, 5, 6])
>>> tscv = TimeSeriesSplit(n_splits=3)
>>> print(tscv)
TimeSeriesSplit(max_train_size=None, n_splits=3)
>>> for train, test in tscv.split(X):
... print("%s %s" % (train, test))
[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]
参考:手把手教你区分参数和超参数 其原英文版, 超参数优化
会随着训练进行更新,以优化从而减小损失函数值。
一般我们都是通过观察在训练过程中的监测指标如损失函数的值或者测试/验证集上的准确率来判断这个模型的训练状态,并通过修改超参数来提高模型效率。
构造估计器时被提供的任何参数或许都能被优化,搜索包括:
网格搜索函数:GridSearchCV
,候选参数通过参数param_grid
传入,参数值的所有可能组合都会被评估,以估计最佳组合。下面是文本特征提取的例子:
# 设定不同的pipeline
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier(tol=1e-3)),
])
# 设定要搜索的参数及值空间
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
# 'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2)), # unigrams or bigrams
# 'tfidf__use_idf': (True, False),
# 'tfidf__norm': ('l1', 'l2'),
'clf__max_iter': (20,),
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
# 'clf__max_iter': (10, 50, 80),
}
if __name__ == "__main__":
# find the best parameters for both the feature extraction and the classifier
grid_search = GridSearchCV(pipeline, parameters, cv=5,
n_jobs=-1, verbose=1)
随机参数优化: - 实现了参数的随机搜索 - 从可能的参数值的分布中进行取样 - 可以选择独立于参数个数和可能值的预算 - 添加不影响性能的参数不会降低效率
# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
"max_features": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 11),
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=n_iter_search, cv=5, iid=False)
指定目标量度:可以使用不同的度量参数进行评估
指定多个指标:上面的网格搜索和随机搜索均允许指定多个评分指标
X, y = make_hastie_10_2(n_samples=8000, random_state=42)
scoring = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score)}
gs = GridSearchCV(DecisionTreeClassifier(random_state=42),
param_grid={'min_samples_split': range(2, 403, 10)},
scoring=scoring, cv=5, refit='AUC', return_train_score=True)
gs.fit(X, y)
results = gs.cv_results_
模型选择:开发和评估,评估时在hold-out数据集上进行
并发机制:参数设定n_jobs=-1
即可使得计算并行进行
有一些优化的参数搜索方案被开发出来,比如:linear_model.ElasticNetCV
, linear_model.LassoCV
metrics
模块实现了针对特定目的评估预测误差的函数 在评估模型的效果时,可以指定特定的评分策略,通过参数scoring
进行指定:
>>> from sklearn import svm, datasets
>>> from sklearn.model_selection import cross_val_score
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf = svm.SVC(probability=True, random_state=0)
>>> cross_val_score(clf, X, y, scoring='neg_log_loss')
array([-0.07..., -0.16..., -0.06...])
>>> model = svm.SVC()
>>> cross_val_score(model, X, y, scoring='wrong_choice')
sklearn.metrics
模块可提供评估预测分数或者误差_score
:预测分数值,越大越好_error,_loss
:损失值,越小越好scoring
策略中指定?这些评价函数需要额外的参数,不是直接指定一个评估名称字符即可的!需要调用make_scorer
函数把评估的量值变成一个打分对象(scoring object):
>>> from sklearn.metrics import fbeta_score, make_scorer
>>> ftwo_scorer = make_scorer(fbeta_score, beta=2)
>>> from sklearn.model_selection import GridSearchCV
>>> from sklearn.svm import LinearSVC
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)
>>> import numpy as np
>>> def my_custom_loss_func(y_true, y_pred):
... diff = np.abs(y_true - y_pred).max()
... return np.log1p(diff)
>>> score = make_scorer(my_custom_loss_func, greater_is_better=False)
>>> X = [[1], [1]]
>>> y = [0, 1]
>>> from sklearn.dummy import DummyClassifier
>>> clf = DummyClassifier(strategy='most_frequent', random_state=0)
>>> clf = clf.fit(X, y)
>>> my_custom_loss_func(clf.predict(X), y)
0.69...
>>> score(clf, X, y)
-0.69...
在几个函数(GridSearchCV
,RandomizedSearchCV
,cross_validate
)中,均有scoring
参数,可以指定多个评估参数,可以通过列表指定名称,也可以通过字典:
>>> scoring = ['accuracy', 'precision']
>>> scoring = {'accuracy': make_scorer(accuracy_score),
... 'prec': 'precision'}
>>> cv_results = cross_validate(svm.fit(X, y), X, y,
... scoring=scoring, cv=5)
仅二分类:
Metric | Formule |
---|---|
precision_recall_curve | \(precision=\frac{TP}{TP+FP}, recall=\frac{TP}{TP+FN}\) |
roc_curve | \(FPR=\frac{FP}{FP+TP}, TPR=\frac{TP}{TP+FN}\) |
可用于多分类:
Metric | Formule |
---|---|
cohen_kappa_score | \(\kappa = (p_o - p_e) / (1 - p_e)\) |
confusion_matrix | By definition a confusion matrix C is such that C_{i, j} is equal to the number of observations known to be in group i but predicted to be in group j.Thus in binary classification, the count of true negatives is C_{0,0}, false negatives is C_{1,0}, true positives is C_{1,1} and false positives is C_{0,1}. |
hinge_loss(binary) | \(L_\text{Hinge}(y, w) = \max\left\{1 - wy, 0\right\} = \left|1 - wy\right|_+\) |
hinge_loss(multi) | \(L_\text{Hinge}(y_w, y_t) = \max\left\{1 + y_t - y_w, 0\right\}\) |
mattews_corrcoef(binary) | \(MCC = \frac{tp \times tn - fp \times fn}{\sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}\),马修斯相关系数 |
mattews_corrcoef(multi) | \(MCC = \frac{c \times s - \sum_{k}^{K} p_k \times t_k}{\sqrt{(s^2 - \sum_{k}^{K} p_k^2) \times (s^2 - \sum_{k}^{K} t_k^2)}}\) |
可用于multilabel case(有多个标签的):
Metric | Formule |
---|---|
accuracy_score | \(\texttt{accuracy}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples}-1} 1(\hat{y}_i = y_i)\) |
classfication_report | precision,recall,f1-score,support, accuracy,macro avg,weighted avg |
f1_score | \(F_1 = \frac{2 \times \text{precision} \times \text{recall}}{ \text{precision} + \text{recall}}\) |
fbeta_score | \(F_\beta = (1 + \beta^2) \frac{\text{precision} \times \text{recall}}{\beta^2 \text{precision} + \text{recall}}\) |
hamming_loss | \(L_{Hamming}(y, \hat{y}) = \frac{1}{n_\text{labels}} \sum_{j=0}^{n_\text{labels} - 1} 1(\hat{y}_j \not= y_j)\) |
jaccard_similarity_score | \(J(y_i, \hat{y}_i) = \frac{|y_i \cap \hat{y}_i|}{|y_i \cup \hat{y}_i|}\) |
log_loss(binary) | \(L_{\log}(y, p) = -\log \operatorname{Pr}(y|p) \\ = -(y \log (p) + (1 - y) \log (1 - p))\),又被称为 logistic regression loss(logistic 回归损失)或者 cross-entropy loss(交叉熵损失) 定义在 probability estimates (概率估计) |
log_loss(multi) | \(L_{\log}(Y, P) = -\log \operatorname{Pr}(Y|P) \\ = - \frac{1}{N} \sum_{i=0}^{N-1} \sum_{k=0}^{K-1} y_{i,k} \log p_{i,k}\) |
precision_recall_fscore_support | Compute precision, recall, F-measure and support for each class |
recall_score | \(\frac{TP}{TP+FN}\) |
zero_one_loss | \(L_{0-1}(y_i, \hat{y}_i) = 1(\hat{y}_i \not= y_i)\) |
Metric | Formule |
---|---|
explained_variance_score | \(explained\_{}variance(y, \hat{y}) = 1 - \frac{Var\{ y - \hat{y}\}}{Var\{y\}}\) |
max_error | \(\text{Max Error}(y, \hat{y}) = max(| y_i - \hat{y}_i |)\) |
mean_absolute_error | \(\text{MAE}(y, \hat{y}) = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}}-1} \left| y_i - \hat{y}_i \right|\) |
mean_squared_error | \(\text{MSE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (y_i - \hat{y}_i)^2.\) |
mean_squared_log_error | \(\text{MSLE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (\log_e (1 + y_i) - \log_e (1 + \hat{y}_i) )^2\) |
median_absolute_error | \(\text{MedAE}(y, \hat{y}) =\text{median}(\mid y_1 - \hat{y}_1 \mid, \ldots, \mid y_n -\hat{y}_n \mid)\) |
r2_score | \(R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\) |
>>> np.random.seed(0)
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> indices = np.arange(y.shape[0])
>>> np.random.shuffle(indices)
>>> X, y = X[indices], y[indices]
>>> train_scores, valid_scores = validation_curve(Ridge(), X, y, "alpha",
... np.logspace(-7, 3, 3),
... cv=5)
>>> from sklearn.model_selection import learning_curve
>>> from sklearn.svm import SVC
>>> train_sizes, train_scores, valid_scores = learning_curve(
... SVC(kernel='linear'), X, y, train_sizes=[50, 80, 110], cv=5)
>>> train_sizes
array([ 50, 80, 110])
>>> train_scores
array([[0.98..., 0.98 , 0.98..., 0.98..., 0.98...],
[0.98..., 1. , 0.98..., 0.98..., 0.98...],
[0.98..., 1. , 0.98..., 0.98..., 0.99...]])
>>> valid_scores
array([[1. , 0.93..., 1. , 1. , 0.96...],
[1. , 0.96..., 1. , 1. , 0.96...],
[1. , 0.96..., 1. , 1. , 0.96...]])
可以使用pickle
模块操作模型:
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
clf2.predict(X[0:1])
>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0:1])
array([0])
>>> y[0]
0
留一法:计算复杂度太高
宏F1:\(macro-F1 = \frac{2 \times macro-P \times macro-R}{macro-P + macro-R}\)
期望泛化误差分解:\(E(f;D)=E_D[(f(x;D)-\overline{f}(x))^2] + (\overline{f}(x)-y)^2 + E_D[(y_D-y)^2]=var(x) + bias^2(x) + \epsilon^2\),即为方差、偏差、噪声之和。
更好的策略:从已有的数据进行推断,从而进行缺失数据填充(imputation)。
i
个特征维度中的非缺失值来插补这个特征中的缺失值。类函数:impute.SimpleImputer
impute.IterativeImputer
impute.SimpleImputer
>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
# 使用特征列平均值插补
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X))
[[ 4. 2. ]
[ 6. 3.666...]
[ 7. 6. ]]
# 对分类数据,使用频率进行插补
>>> df = pd.DataFrame([["a", "x"],
... [np.nan, "y"],
... ["a", np.nan],
... ["b", "y"]], dtype="category")
...
>>> imp = SimpleImputer(strategy="most_frequent")
>>> print(imp.fit_transform(df))
[['a' 'x']
['a' 'y']
['a' 'y']
['b' 'y']]
IterativeImputer
max_iter
轮,最后一轮的计算结果被返回。>>> import numpy as np
>>> from sklearn.experimental import enable_iterative_imputer
>>> from sklearn.impute import IterativeImputer
>>> imp = IterativeImputer(max_iter=10, random_state=0)
>>> imp.fit([[1, 2], [3, 6], [4, 8], [np.nan, 3], [7, np.nan]])
IterativeImputer(add_indicator=False, estimator=None,
imputation_order='ascending', initial_strategy='mean',
max_iter=10, max_value=None, min_value=None,
missing_values=nan, n_nearest_features=None,
random_state=0, sample_posterior=False, tol=0.001,
verbose=0)
>>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
>>> # the model learns that the second feature is double the first
>>> print(np.round(imp.transform(X_test)))
[[ 1. 2.]
[ 6. 12.]
[ 3. 6.]]
MissingIndicator
,将数据集转换为矩阵,可指示缺失值的存在NaN
是占位符,可指定其他值为占位符>>> from sklearn.impute import MissingIndicator
>>> X = np.array([[-1, -1, 1, 3],
... [4, -1, 0, -1],
... [8, -1, 1, 0]])
# 指定-1为缺失值
# 默认只返回包含缺失值的列,所以这里值显示了3列
>>> indicator = MissingIndicator(missing_values=-1)
>>> mask_missing_values_only = indicator.fit_transform(X)
>>> indicator.features_
array([0, 1, 3])
>>> mask_missing_values_only
array([[ True, True, False],
[False, True, True],
[False, True, False]])
# 设置参数 features="all"可以把其他的列的数据也指示出来
# 这种做法通常是我们想要的
>>> indicator = MissingIndicator(missing_values=-1, features="all")
>>> mask_all = indicator.fit_transform(X)
>>> mask_all
array([[ True, True, False, False],
[False, True, False, True],
[False, True, False, False]])
>>> indicator.features_
array([0, 1, 2, 3])
>>> transformer = FeatureUnion(
... transformer_list=[
... ('features', SimpleImputer(strategy='mean')),
... ('indicators', MissingIndicator())])
>>> transformer = transformer.fit(X_train, y_train)
>>> results = transformer.transform(X_test)
>>> results.shape
(100, 8)
# 使用pipeline,把特征转换放在模型之前
# 这样就会对数据先进行转换,再fitting
>>> clf = make_pipeline(transformer, DecisionTreeClassifier())
>>> clf = clf.fit(X_train, y_train)
>>> results = clf.predict(X_test)
>>> results.shape
(100,)
关于数据的预处理,sklearn提供了一个专门的模块
sklearn.preprocessing
,可用于常规的预处理操作,详情可参见这里(英文,中文)。
a. 为什么?
b. 函数scale
:数组的标准化
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
X_scaled = preprocessing.scale(X_train)
X_scaled
# array([[ 0. ..., -1.22..., 1.33...],
# [ 1.22..., 0. ..., -0.26...],
# [-1.22..., 1.22..., -1.06...]])
# 缩放后的数据具有零均值以及标准方差
X_scaled.mean(axis=0)
# array([0., 0., 0.])
X_scaled.std(axis=0)
# array([1., 1., 1.])
c. 类StandardScaler
:
scaler = preprocessing.StandardScaler().fit(X_train)
scaler
# StandardScaler(copy=True, with_mean=True, with_std=True)
scaler.mean_
# array([1. ..., 0. ..., 0.33...])
scaler.scale_
# array([0.81..., 0.81..., 1.24...])
scaler.transform(X_train)
# array([[ 0. ..., -1.22..., 1.33...],
# [ 1.22..., 0. ..., -0.26...],
# [-1.22..., 1.22..., -1.06...]])
X_test = [[-1., 1., 0.]]
scaler.transform(X_test)
# array([[-2.44..., 1.22..., -0.26...]])
d. 将特征缩放至特定范围
MinMaxScaler
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax
# array([[0.5 , 0. , 1. ],
# [1. , 0.5 , 0.33333333],
# [0. , 1. , 0. ]])
# 应用于测试集
X_test = np.array([[ -3., -1., 4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax
# array([[-1.5 , 0. , 1.66666667]])
MaxAbsScaler
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs # doctest +NORMALIZE_WHITESPACE^
# array([[ 0.5, -1. , 1. ],
# [ 1. , 0. , 0. ],
# [ 0. , 1. , -0.5]])
X_test = np.array([[ -3., -1., 4.]])
X_test_maxabs = max_abs_scaler.transform(X_test)
X_test_maxabs
# array([[-1.5, -1. , 2. ]])
max_abs_scaler.scale_
# array([2., 1., 2.])
e. 缩放系数矩阵
scale
,StandardScaler
可以接受稀疏输入,构造时指定with_mean=False
f. 缩放含有离群值的数据
robust_scale
, RobustScaler
可作为替代a. 常见可用:
b. 映射到均匀分布:
QuantileTransformer
quantile_transform
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
X_train_trans = quantile_transformer.fit_transform(X_train)
X_test_trans = quantile_transformer.transform(X_test)
np.percentile(X_train[:, 0], [0, 25, 50, 75, 100])
# array([ 4.3, 5.1, 5.8, 6.5, 7.9])
# 转换后接近于百分位数定义
np.percentile(X_train_trans[:, 0], [0, 25, 50, 75, 100])
# array([ 0.00... , 0.24..., 0.49..., 0.73..., 0.99... ])
c. 映射到高斯分布
PowerTransformer
提供两种:Yeo-Johnson transform
, the Box-Cox transform
pt = preprocessing.PowerTransformer(method='box-cox', standardize=False)
X_lognormal = np.random.RandomState(616).lognormal(size=(3, 3))
X_lognormal
# array([[1.28..., 1.18..., 0.84...],
# [0.94..., 1.60..., 0.38...],
# [1.35..., 0.21..., 1.09...]])
pt.fit_transform(X_lognormal)
# array([[ 0.49..., 0.17..., -0.15...],
# [-0.05..., 0.58..., -0.57...],
# [ 0.69..., -0.84..., 0.10...]])
- 不同的原始分布经过变化,有的可以变为高斯分布,有的可能效果不太好 ![](https://sklearn.apachecn.org/docs/img/sphx_glr_plot_map_data_to_normal_0011.png)
- 缩放单个样本以具有单位范数的过程
- 函数`normalize`:用于数组,`l1`或`l2`范式
X = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')
X_normalized
# array([[ 0.40..., -0.40..., 0.81...],
# [ 1. ..., 0. ..., 0. ...],
# [ 0. ..., 0.70..., -0.70...]])
a. 标称型特征:非连续的数值,可编码为整数
b. 类OrdinalEncoder
:
- 将类别特征值编码为一个新的整数型特征(0到num_category-1之间的一个数)
- 但是这个数值不能直接使用,因为会被认为是有序的(实际是无序的)
- 一般使用独热编码
enc = preprocessing.OrdinalEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
# OrdinalEncoder(categories='auto', dtype=<... 'numpy.float64'>)
enc.transform([['female', 'from US', 'uses Safari']])
# array([[0., 1., 1.]])
c. 独热编码(dummy encoding)
enc = preprocessing.OneHotEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
# OneHotEncoder(categorical_features=None, categories=None, drop=None,
# dtype=<... 'numpy.float64'>, handle_unknown='error',
# n_values=None, sparse=True)
enc.transform([['female', 'from US', 'uses Safari'],
['male', 'from Europe', 'uses Safari']]).toarray()
# array([[1., 0., 0., 1., 0., 1.],
# [0., 1., 1., 0., 0., 1.]])
# 在category_属性中找到,编码后是几维的
enc.categories_
# [array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
可以显示的指定,某个特征需要被编码为几维的,最开始提供一个可能的取值集合,基于这个集合进行编码:
genders = ['female', 'male']
locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
# Note that for there are missing categorical values for the 2nd and 3rd
# feature
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
# array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
a. 离散化(discretization):
b. K-bins离散化:
X = np.array([[ -3., 5., 15 ],
[ 0., 6., 14 ],
[ 6., 3., 11 ]])
est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)
est.transform(X)
# array([[ 0., 1., 1.],
# [ 1., 1., 1.],
# [ 2., 0., 0.]])
d. 特征二值化 - 将数值特征用阈值过滤得到布尔值的过程
X = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
binarizer = preprocessing.Binarizer().fit(X) # fit does nothing
binarizer
binarizer.transform(X)
# array([[1., 0., 1.],
# [1., 0., 0.],
# [0., 1., 0.]])
# 可使用不同的阈值
binarizer = preprocessing.Binarizer(threshold=1.1)
binarizer.transform(X)
# array([[0., 0., 1.],
# [1., 0., 0.],
# [0., 0., 0.]])
a. 为什么? - 添加非线性特征 - 增加模型的复杂度 - 常用:添加多项式
b. 生成多项式类PolynomialFeatures
:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
X = np.arange(6).reshape(3, 2)
X
# array([[0, 1],
# [2, 3],
# [4, 5]])
poly = PolynomialFeatures(2)
poly.fit_transform(X)
# array([[ 1., 0., 1., 0., 0., 1.],
# [ 1., 2., 3., 4., 6., 9.],
# [ 1., 4., 5., 16., 20., 25.]])
X = np.arange(9).reshape(3, 3)
X
# array([[0, 1, 2],
# [3, 4, 5],
# [6, 7, 8]])
# 指定度,且只要具有交叉的项,像上面的自己平方的项不要了
poly = PolynomialFeatures(degree=3, interaction_only=True)
poly.fit_transform(X)
# array([[ 1., 0., 1., 2., 0., 0., 2., 0.],
# [ 1., 3., 4., 5., 12., 15., 20., 60.],
# [ 1., 6., 7., 8., 42., 48., 56., 336.]])
类FunctionTransformer
:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate=True)
X = np.array([[0, 1], [2, 3]])
transformer.transform(X)
# array([[0. , 0.69314718],
# [1.09861229, 1.38629436]])
如下图所示,通过6步完成数据预处理操作:
import numpy as np
import pandas as pd
数据集通常是.csv格式。CSV文件以文本形式保存表格数据。文件的每一行是一条数据记录。我们使用Pandas的read_csv方法读取本地csv文件为一个数据帧。然后,从数据帧中制作自变量和因变量的矩阵和向量。
查看数据集,前3列是特征,最后1列是否购买(标签):
gongjing@hekekedeiMac ..yn/github/100-Days-Of-ML-Code/datasets (git)-[master] % cat Data.csv
Country,Age,Salary,Purchased
France,44,72000,No
Spain,27,48000,Yes
Germany,30,54000,No
Spain,38,61000,No
Germany,40,,Yes
France,35,58000,Yes
Spain,,52000,No
France,48,79000,Yes
Germany,50,83000,No
France,37,67000,Yes%
读取数据集:
dataset = pd.read_csv('Data.csv')//读取csv文件
X = dataset.iloc[ : , :-1].values//.iloc[行,列]
Y = dataset.iloc[ : , 3].values // : 全部行 or 列;[a]第a行 or 列
// [a,b,c]第 a,b,c 行 or 列
查看读取的数据集:
print("X")
print(X)
print("Y")
print(Y)
X
[['France' 44.0 72000.0]
['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
Y
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
我们得到的数据很少是完整的。数据可能因为各种原因丢失,为了不降低机器学习模型的性能,需要处理数据。我们可以用整列的平均值或中间值替换丢失的数据。我们用sklearn.preprocessing库中的Imputer类完成这项任务。
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
# 对第2、3列进行imputing,因为这两个特征是数值项
imputer = imputer.fit(X[ : , 1:3])
X[ : , 1:3] = imputer.transform(X[ : , 1:3])
分类数据指的是含有标签值而不是数字值的变量。取值范围通常是固定的。例如”Yes”和”No”不能用于模型的数学计算,所以需要解析成数字。为实现这一功能,我们从sklearn.preprocessing库导入LabelEncoder类。
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
查看编码之后的数据。从这里可以看到,原来的第一列是国家(类别特征),包含3个:France、Spain、Germany,在做了转换之后,特征多了两个,相当于原来的1个特征现在用3个特征来表示:
print("Step 4: Encoding categorical data")
print("X")
print(X)
print("Y")
print(Y)
X
[[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
7.20000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
4.80000000e+04]
[0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01
5.40000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
6.10000000e+04]
[0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
6.37777778e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
5.80000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
5.20000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
7.90000000e+04]
[0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01
8.30000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
6.70000000e+04]]
Y
[0 1 0 0 1 1 0 1 0 1]
把数据集拆分成两个:一个是用来训练模型的训练集合,另一个是用来验证模型的测试集合。两者比例一般是80:20。我们导入sklearn.model_selection库中的train_test_split()方法。
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)
查看查分后的数据:
print("Step 5: Splitting the datasets into training sets and Test sets")
print("X_train")
print(X_train)
print("X_test")
print(X_test)
print("Y_train")
print(Y_train)
print("Y_test")
print(Y_test)
Step 5: Splitting the datasets into training sets and Test sets
X_train
[[ 0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
6.37777778e+04]
[ 1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
6.70000000e+04]
[ 0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
4.80000000e+04]
[ 0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
5.20000000e+04]
[ 1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
7.90000000e+04]
[ 0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
6.10000000e+04]
[ 1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
7.20000000e+04]
[ 1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
5.80000000e+04]]
X_test
[[ 0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01
5.40000000e+04]
[ 0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01
8.30000000e+04]]
Y_train
[1 1 1 0 1 0 0 1]
Y_test
[0 0]
大部分模型算法使用两点间的欧氏距离表示,但此特征在幅度、单位和范围姿态问题上变化很大。在距离计算中,高幅度的特征比低幅度特征权重更大。可用特征标准化或Z值归一化解决。导入sklearn.preprocessing库的StandardScalar类
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
print("Step 6: Feature Scaling")
print("X_train")
print(X_train)
print("X_test")
print(X_test)
Step 6: Feature Scaling
X_train
[[-1. 2.64575131 -0.77459667 0.26306757 0.12381479]
[ 1. -0.37796447 -0.77459667 -0.25350148 0.46175632]
[-1. -0.37796447 1.29099445 -1.97539832 -1.53093341]
[-1. -0.37796447 1.29099445 0.05261351 -1.11141978]
[ 1. -0.37796447 -0.77459667 1.64058505 1.7202972 ]
[-1. -0.37796447 1.29099445 -0.0813118 -0.16751412]
[ 1. -0.37796447 -0.77459667 0.95182631 0.98614835]
[ 1. -0.37796447 -0.77459667 -0.59788085 -0.48214934]]
X_test
[[ 0. 0. 0. -1. -1.]
[ 0. 0. 0. 1. 1.]]
不同:对低维子空间的性质有不同的要求,即对\(W\)施加了不同的约束
非线性降维:基于核技巧对线性降维方法进行核化(kernelized)
能否学习一个合适的距离度量?
同理:对决策树的每个分支进行如此计算。复杂不易求解!!!
先由单个数据的权重更新,到整体样本的更新,推出需要使得最小化的函数:
通过梯度递减,求解使得满足函数最小时的模型\(g_t\)和每个决策树的权重\(\alpha_t\):
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier
X, y = load_iris(return_X_y=True)
clf = AdaBoostClassifier(n_estimators=100)
scores = cross_val_score(clf, X, y, cv=5)
scores.mean()
具体步骤:
# 分类
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
max_depth=1, random_state=0).fit(X_train, y_train)
clf.score(X_test, y_test)
# 回归
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_friedman1
from sklearn.ensemble import GradientBoostingRegressor
X, y = make_friedman1(n_samples=1200, random_state=0, noise=1.0)
X_train, X_test = X[:200], X[200:]
y_train, y_test = y[:200], y[200:]
est = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,
max_depth=1, random_state=0, loss='ls').fit(X_train, y_train)
mean_squared_error(y_test, est.predict(X_test))
blending:将所有已知的\(g_t\) aggregate起来,得到G,通常有3种:
learning:一边学\(g_t\),一边将他们结合起来,通常有3种(与blending对应):
对于aggregate模型,可以进行再次组合,以得到更强大的模型: