scikit-learn的介绍
一、机器学习的一般步骤
链接:机器学习的一般步骤
二、预处理数据
链接:预处理数据
三、交叉验证
链接:交叉验证
四、超参数优化
超参数优化
有时希望调整管道分类器的参数,从而获得最佳精度。可以使用 get_params()
检查管道的参数。
pipe.get_params()
输出:
{'memory': None,
'steps': [('standardscaler',
StandardScaler(copy=True, with_mean=True, with_std=True)),
('sgdclassifier',
SGDClassifier(alpha=0.0001, average=False, class_weight=None,
early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=1000,
n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
power_t=0.5, random_state=None, shuffle=True, tol=None,
validation_fraction=0.1, verbose=0, warm_start=False))],
'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True),
'sgdclassifier': SGDClassifier(alpha=0.0001, average=False, class_weight=None,
early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=1000,
n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
power_t=0.5, random_state=None, shuffle=True, tol=None,
validation_fraction=0.1, verbose=0, warm_start=False),
'standardscaler__copy': True,
'standardscaler__with_mean': True,
'standardscaler__with_std': True,
'sgdclassifier__alpha': 0.0001,
'sgdclassifier__average': False,
'sgdclassifier__class_weight': None,
'sgdclassifier__early_stopping': False,
'sgdclassifier__epsilon': 0.1,
'sgdclassifier__eta0': 0.0,
'sgdclassifier__fit_intercept': True,
'sgdclassifier__l1_ratio': 0.15,
'sgdclassifier__learning_rate': 'optimal',
'sgdclassifier__loss': 'hinge',
'sgdclassifier__max_iter': 1000,
'sgdclassifier__n_iter': None,
'sgdclassifier__n_iter_no_change': 5,
'sgdclassifier__n_jobs': None,
'sgdclassifier__penalty': 'l2',
'sgdclassifier__power_t': 0.5,
'sgdclassifier__random_state': None,
'sgdclassifier__shuffle': True,
'sgdclassifier__tol': None,
'sgdclassifier__validation_fraction': 0.1,
'sgdclassifier__verbose': 0,
'sgdclassifier__warm_start': False}
可以通过穷举搜索来优化超参数。网格搜索 GridSearchCV
提供此类实用程序,并通过参数网格进行交叉验证的网格搜索。 如下例子,我们希望优化 LogisticRegression
分类器的 C
和 penalty
参数。
from sklearn.model_selection import GridSearchCV
pipe=make_pipeline(MinMaxScaler(),
LogisticRegression(solver='saga',multi_class='auto',
random_state=42,max_iter=5000))
param_grid={'logisticregression__C':[0.1,1.0,10],
'logisticregression__penalty':['l2','l1']}
grid=GridSearchCV(pipe,param_grid=param_grid,cv=3,n_jobs=-1,return_train_score=True)
grid.fit(X_train,y_train)
输出:
GridSearchCV(cv=3, error_score='raise-deprecating',
estimator=Pipeline(memory=None,
steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=5000, multi_class='auto',
n_jobs=None, penalty='l2', random_state=42, solver='saga',
tol=0.0001, verbose=0, warm_start=False))]),
fit_params=None, iid='warn', n_jobs=-1,
param_grid={'logisticregression__C': [0.1, 1.0, 10], 'logisticregression__penalty': ['l2', 'l1']},
pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
scoring=None, verbose=0)
在拟合网格搜索对象时,它会在训练集上找到最佳的参数组合(使用交叉验证)。 我们可以通过访问属性 cv_results_
来得到网格搜索的结果。 通过这个属性允许我们可以检查参数对模型性能的影响。
df_grid=pd.DataFrame(grid.cv_results_)
df_grid
输出:
mean_fit_time std_fit_time mean_score_time std_score_time param_logisticregression__C param_logisticregression__penalty params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score mean_train_score std_train_score
0 0.014294 0.003761 0.000333 4.704712e-04 0.1 l2 {'logisticregression__C': 0.1, 'logisticregres... 0.895522 0.931818 0.924242 0.917085 0.015669 5 0.924242 0.917293 0.932331 0.924622 0.006145
1 0.039956 0.024609 0.000998 7.867412e-07 0.1 l1 {'logisticregression__C': 0.1, 'logisticregres... 0.873134 0.909091 0.909091 0.896985 0.016992 6 0.893939 0.883459 0.906015 0.894471 0.009216
2 0.039063 0.010821 0.001008 1.345933e-05 1 l2 {'logisticregression__C': 1.0, 'logisticregres... 0.955224 0.984848 0.954545 0.964824 0.014109 4 0.969697 0.962406 0.962406 0.964836 0.003437
3 0.327151 0.021334 0.000000 0.000000e+00 1 l1 {'logisticregression__C': 1.0, 'logisticregres... 0.962687 1.000000 0.946970 0.969849 0.022190 3 0.969697 0.958647 0.981203 0.969849 0.009209
4 0.125413 0.003730 0.000000 0.000000e+00 10 l2 {'logisticregression__C': 10, 'logisticregress... 0.970149 0.992424 0.977273 0.979899 0.009291 1 0.981061 0.977444 0.988722 0.982409 0.004702
5 0.697837 0.079125 0.001668 2.359100e-03 10 l1 {'logisticregression__C': 10, 'logisticregress... 0.985075 0.977273 0.962121 0.974874 0.009533 2 0.992424 0.981203 0.988722 0.987450 0.004669
默认情况下,网格搜索对象也表现为估计器。 一旦它被 fit 后,调用 score 将超参数固定为找到的最佳参数。
grid.best_params_
输出:
{'logisticregression__C': 10, 'logisticregression__penalty': 'l2'}
此外,可以将网格搜索为任何其他分类器以进行预测。
accuracy=grid.score(X_test,y_test)
print('Accuracy score of the {} is {:.2f}'.format(grid.__class__.__name__,accuracy))
输出:
Accuracy score of the GridSearchCV is 0.98
最重要的是,我们只对单个数据集进行网格搜索。 但是,如前所述,我们可能进行交叉验证,以估不同的数据样本,并检查性能的潜在变化。 由于网格搜索是一个估计器,我们可以直接在 cross_validate
函数中使用它。
scores=cross_validate(grid,digits.data,digits.target,cv=3,n_jobs=-1,return_train_score=True)
df_scores=pd.DataFrame(scores)
df_scores
输出:
fit_time score_time test_score train_score
0 39.138062 0.000998 0.928571 0.985774
1 41.135293 0.000997 0.946578 0.997496
2 38.183267 0.000000 0.924497 0.993339
练习
重复使用乳腺癌数据集的先前管道并进行 网格搜索
以评估 hinge (铰链) 和 log (对数)损失之间的差异。此外,微调 penalty。
pipe=make_pipeline(StandardScaler(),SGDClassifier(max_iter=1000))
param_grid={'sgdclassifier__loss':['hinge','log'],
'sgdclassifier__penalty':['l2','l1']}
grid=GridSearchCV(pipe,param_grid=param_grid,cv=3,n_jobs=-1)
scores=cross_validate(grid,breast.data,breast.target,scoring='balanced_accuracy',cv=3, return_train_score=True)
df_scores=pd.DataFrame(scores)
df_scores[['train_score','test_score']].boxplot()
grid.fit(X_train,y_train)
print(grid.best_params_)
总结
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
pipe = make_pipeline(MinMaxScaler(),
LogisticRegression(solver='saga', multi_class='auto', random_state=42, max_iter=5000))
param_grid = {'logisticregression__C': [0.1, 1.0, 10],
'logisticregression__penalty': ['l2', 'l1']}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1)
scores = pd.DataFrame(cross_validate(grid, X, y, cv=3, n_jobs=-1, return_train_score=True))
scores[['train_score', 'test_score']].boxplot()
(完。)