交叉验证与超参数调优Python机器学习中的核心实践-CSDN博客

本文链接：https://blog.csdn.net/luansj/article/details/149193050

K折交叉验证的原理与实现

K折交叉验证（K-Fold Cross-Validation）是最常见的交叉验证方法。其基本思想是将数据集分成K个大小相等的子集，每次使用其中一个子集作为验证集，其余K-1个子集作为训练集，重复K次，最终平均所有验证结果。

from sklearn.model_selection import KFold
import numpy as np

# 假设X为特征矩阵，y为标签向量
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array([0, 1, 0, 1, 0])

kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(f"Train: {train_index}, Test: {test_index}")

留一交叉验证与留P交叉验证

留一交叉验证（Leave-One-Out Cross-Validation, LOOCV）是K折交叉验证的极端情况，其中K等于数据集的大小。每次只留出一个样本作为验证集，其余样本作为训练集。这种方法计算成本较高，但适用于小数据集。

留P交叉验证（Leave-P-Out Cross-Validation, LPOCV）则是每次留出P个样本作为验证集。这种方法在数据量较大时更为实用。

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(f"Train: {train_index}, Test: {test_index}")

超参数调优：提升模型性能的关键

超参数（Hyperparameters）是模型训练过程中需要手动设置的参数，它们对模型的性能有着重要影响。超参数调优（Hyperparameter Tuning）是通过调整这些参数，找到最优的模型配置。

网格搜索：穷举法的超参数调优

网格搜索（Grid Search）是一种系统的超参数调优方法，它通过遍历所有可能的超参数组合，找到最优的配置。虽然计算成本较高，但能够保证找到全局最优解。

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# 定义参数网格
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# 初始化模型
rf = RandomForestClassifier()

# 初始化网格搜索
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5)

# 执行网格搜索
grid_search.fit(X, y)

# 输出最佳参数和最佳得分
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Score: {grid_search.best_score_}")

随机搜索：高效的超参数调优

随机搜索（Random Search）是一种基于随机采样的超参数调优方法。它通过在参数空间中随机选择一组参数组合进行评估，从而在较短时间内找到接近最优的解。

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# 定义参数分布
param_dist = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10, 15]
}

# 初始化模型
rf = RandomForestClassifier()

# 初始化随机搜索
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_dist, n_iter=10, cv=5)

# 执行随机搜索
random_search.fit(X, y)

# 输出最佳参数和最佳得分
print(f"Best Parameters: {random_search.best_params_}")
print(f"Best Score: {random_search.best_score_}")

贝叶斯优化：智能的超参数调优

贝叶斯优化（Bayesian Optimization）是一种基于概率模型的超参数调优方法。它通过构建目标函数的概率模型，预测不同参数组合的性能，从而更有效地探索参数空间。

from skopt import BayesSearchCV
from sklearn.ensemble import RandomForestClassifier

# 定义参数空间
param_space = {
    'n_estimators': (10, 200),
    'max_depth': (10, 30),
    'min_samples_split': (2, 15)
}

# 初始化模型
rf = RandomForestClassifier()

# 初始化贝叶斯优化
bayes_search = BayesSearchCV(estimator=rf, search_spaces=param_space, n_iter=10, cv=5)

# 执行贝叶斯优化
bayes_search.fit(X, y)

# 输出最佳参数和最佳得分
print(f"Best Parameters: {bayes_search.best_params_}")
print(f"Best Score: {bayes_search.best_score_}")

交叉验证与超参数调优的结合应用

在实际应用中，交叉验证与超参数调优通常是结合使用的。通过交叉验证，可以更准确地评估模型的性能；而通过超参数调优，可以找到最优的模型配置。这种结合能够显著提升模型的泛化能力。

案例分析：使用交叉验证与网格搜索优化SVM模型

支持向量机（SVM）是一种常用的分类算法，其性能对超参数（如核函数、正则化参数等）非常敏感。以下案例展示了如何使用交叉验证与网格搜索来优化SVM模型。

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# 加载数据集
iris = load_iris()
X, y = iris.data, iris.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 定义参数网格
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf']
}

# 初始化模型
svc = SVC()

# 初始化网格搜索
grid_search = GridSearchCV(estimator=svc, param_grid=param_grid, cv=5)

# 执行网格搜索
grid_search.fit(X_train, y_train)

# 输出最佳参数和最佳得分
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Score: {grid_search.best_score_}")

# 在测试集上评估模型
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print(f"Test Score: {test_score}")

案例分析：使用随机搜索优化随机森林模型

随机森林（Random Forest）是一种集成学习算法，其性能对超参数（如树的数量、最大深度等）较为敏感。以下案例展示了如何使用随机搜索来优化随机森林模型。

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

# 加载数据集
wine = load_wine()
X, y = wine.data, wine.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 定义参数分布
param_dist = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# 初始化模型
rf = RandomForestClassifier()

# 初始化随机搜索
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_dist, n_iter=10, cv=5)

# 执行随机搜索
random_search.fit(X_train, y_train)

# 输出最佳参数和最佳得分
print(f"Best Parameters: {random_search.best_params_}")
print(f"Best Score: {random_search.best_score_}")

# 在测试集上评估模型
best_model = random_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print(f"Test Score: {test_score}")