sklearn系列学习（线性模型）之线性回归

原创已于 2023-04-01 14:45:57 修改 · 808 阅读

19 ·

CC 4.0 BY-SA版权

文章标签：

#sklearn #学习 #机器学习

于 2023-03-31 21:58:40 首次发布

本文介绍了线性回归在机器学习中的应用，通过sklearn库展示了如何实现线性回归模型，包括数据预处理、模型训练和评估。文中提供了鸢尾花数据集、电影票房预测等实例，解释了模型验证功能和性能指标的计算。

sklearn系列学习（线性模型）之线性回归

提示：这里只是简单的帮助理解关于机器学习线性回归的使用方法

一、线性回归项目实战（电影票房预测、生成回归数据集预测，鸢尾花数据集举例）

前言

回归分析是确定变量间依赖关系的一种统计分析方法，属于监督学习方法。线性回归（Liner Regression）算法的核心就是线性回归方程，通过输入数据和输出数据之间建立一种直线关系，来完成预测的任务。线性回归也分为一元线性回归和多元线性回归。总之线性回归经常用于实际预测问题，比如：机票价格、股票市场、电影院票房等。

一、线性模型

线性模型原本是一个统计学中的术语，近年来越来越多地应用在机器学习领域。实际上线性模型并不是特指某一个模型，而是一类模型。在机器学习领域，常用的线性模型包括线性回归、岭回归、套索回归、逻辑回归和线性SVC等。这里我们主要讲解的是通过sklearn来实现线性回归。

二、部分代码讲解

* 提前了解 *

1.常用的Sklearn数据集（带*的为常用）

通常在验证模型是否精准的时候，可以用导入的标准数据集去验证，以下数据集都是在线数据集，导入就可以使用

数据集	描述
datasets.load_iris	* 鸢尾花数据集
datasets.load_boston	波士顿房价数据集
datasets.load_breast_cancer	乳腺癌威斯康星州数据集
datasets.load_diabetes	* 糖尿病数据集
datasets.load_wine	* 葡萄酒数据集
datasets.fetch_california_housing	加利福尼亚住房数据集
datasets.fetch_lfw_people	标签人脸数据集

2.常用的SKlearn分类指标

通常用from sklearn import metrics导入计算性能的方法（有分类指标和回归指标），以下为这些性能方法的基本用法
(1) 分类指标

函数名	功能
metrics.f1_score()	计算调和均值f1指数
metrics.precision_score()	计算精确度
metrics.recall_score()	计算召回率
metrics.roc_auc_score()	根据预测分数计算接受机工作特性曲线下的计算区域(ROC/AUC)
metrics.precision_recall_fscore_support()	* 计算每个类的精确度、召回度、f1指数和支持
metrics.classification_report()	* 根据测试标签和预测标签，计算分类的精确度、召回率、f1指数和支持指标

(2) 回归指标

函数名	功能
metrics.mean_absolute_error()	平均绝对误差回归损失
metrics.mean_squared_error()	均方误差回归损失
metrics.r2_score()	R^2（确定系数）回归分数函数

3.常用的SKlearn模型验证功能

模型验证导入一般用from model_selection import * ，这是导入所有的模型验证方法

函数名	功能
model_selection.cross_validate()	通过交叉验证评估指标，并记录合适度/得分时间
model_selection.cross_val_score()	通过交叉验证评估分数
model_selection.learning_curve()	学习曲线
model_selection.validation_curve()	验证曲线
model_selection.train_test_split()	* 划分训练集合测试集

PS：特别常用的方法是用train_test_split()来进行训练集和验证集的拆分操作

（一）引入库

我们用sklearn实现线性回归最重要的是先引入模型，以下为用到的基本库方法

#from sklearn.linear_model import * #这个是导入所有模型的方法，如果记不住模型名字就用这个
from sklearn.linear_model import LinearRegression #这里是就导入线性回归模型 
from sklearn.model_selection import train_test_split #这是划分训练集、测试集的方法
from sklearn.datasets import * #导入所有的数据集，如果要导入鸢尾花数据集，将*改为load_iris
from sklearn import metrics #这是性能函数的方法
import matplotlib.pyplot as plt #导入mlp绘图库命名为plt
import matplotlib #导入mlp，是python的MATLAB库
import numpy as np #导入numpy科学库重命名为np，这是机器学习所用到的基本库
import pandas as pd #导入pandas数据分析库重命名为pd，这也是机器学习所用到的基本库

（二）读入数据和处理数据

这里有份csv文件的内容如下（下载地址:这里用的是wine.txt,在使用时需手动改成后缀为wine.csv文件）

''' 
读取文件，encoding='936'是csv读取方式也可不写
index_col是判断是否要将第几列转换为索引，这里写False，如果是0的话就代表将第一列转变为索引
'''
Data  = pd.read_csv(r'./wine.csv', encoding='cp936',index_col=False) 
X = Data['Alcohol'].values.reshape(-1, 1) #将Alcohol那一列数据读出并转换为列形式
y = Data['Malic_acid'].values #将Malic_acid那一列数据读出
print(X) #查看X数据
print（'-------分割线------\n',y）#查看y数据

可以看到X数据都被转为一列了,y是一个列表存放的数据
在这里插入图片描述

（三）运用模型

LinearRegression的主要参数
1.fit_intercept：是否计算截距b
2.normlize：是否需要数据缩放处理
3.copy_X：是否复制X
4.n_jobs：表示使用CPU的个数。当-1时，代表使用全部CPU

from sklearn.datasets import load_iris #导入鸢尾花数据集
from sklearn.linear_model import LinearRegression #导入线性回归模型
from sklearn.model_selection import train_test_split #这是划分训练集和测试集的方式
from sklearn import metrics #导入

X, y = load_iris(return_X_y=True)
#test_size=0.3意思为将数据集0.7分为训练集，0.3分为测试集，random_state为随机种子（混乱度）
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Model = LinearRegression(fit_intercept=True, copy_X=True, n_jobs=1) #线性回归模型
Model.fit(X_train, y_train) #拟合X_train和y_train
print('coef:\n', Model.coef_) #特征系数，也可以认为是斜率
print('intercept:\n', Model.intercept_) #截距
print('predict first two:\n', Model.predict(X_train[:3, :]))#预测概率选择范围
print('classification score:\n', Model.score(X_train, y_train))#预测得分
predict_y = Model.predict(X_test)#预测一般预测放的都是X_test
#输出精确度，召回率，f1值
print('classfication report:\n', metrics.classification_report(y_test, predict_y))

（四）matplotlib拟合直线绘制方法

（1）简单的一元线性

import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-5, 5, 100)  # x在-5到5间连续生成100个值
y = 0.5 * x + 3  # 表达式y为一元一次线性方程
plt.plot(x, y, c='blue') #画一条蓝色的线
plt.title("line") #标题为line
plt.show() #展示绘图

（2）通过两点确定直线

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression #导入线性回归模型

X = [[3], [8]]  #x是一个列向量
y = [2, 10] #y可以不是列向量
lr = LinearRegression() #导入线性回归模型
lr.fit(X, y) #拟合X和y
z = np.linspace(0, 10, 20).reshape(-1, 1) #生成0到10的20个点并转化为列向量
plt.scatter(X, y, s=100, c='r') #s为散点的大小，c为颜色
plt.plot(z, lr.predict(z), c='y') #画出拟合直线
plt.show() #展示
# coef_[0]代表斜率和interce_代表截距
print(f'斜率:{lr.coef_[0]},截距:{lr.intercept_}')  # 计算斜率和截距

在这里插入图片描述

三、项目实战

（一）电影票房预测

下面用到cinema.csv文件（下载地址:这里用的是cinema.txt,在使用时需手动改成后缀为cinema.csv文件）

from sklearn.linear_model import LinearRegression #导入线性回归模型
import matplotlib.pyplot as plt #导入mlp绘图库
import numpy as np #导入numpy科学库重命名为np，这是机器学习所用到的基本库
import pandas as pd #导入pandas数据分析库重命名为pd，这也是机器学习所用到的基本库
import matplotlib #导入mlp
import warnings #导入警告库，目的是消除红色警告


def drawPlt(): #自定义绘图库
    plt.figure(figsize=(10, 6)) #定义一个画板
    plt.title('票房收入（单位:百万元）') #总标题
    plt.xlabel('成本') #x轴标题
    plt.ylabel('收入') #y轴标题
    plt.axis([0, 25, 0, 60]) #设置x轴为0-25和y轴为0-60
    plt.grid(True) #显示网格线

def main(num):
    global num_d, wan  # 定义局部全局变量
    if num >= 10: #如果输入的数大于等于10
        num_d = int(num/10) #强转换为整数
        wan = '千万' #千万单位
    elif num < 10:
        num_d = num #如果输入的数是个位就不用转换
        wan = '百万' #百万单位
    matplotlib.use('TkAgg') #定义在运行程序时不同时开启绘图（注释则为同时开启绘图）
    warnings.filterwarnings('ignore')  # 忽视在程序中红色警告（不为错，但只是为了控制台显示美观）
    matplotlib.rcParams['font.family'] = 'SimHei'  # 指定中文黑体字体
    matplotlib.rcParams['font.size'] = 10  # 设置字体大小
    matplotlib.rcParams['axes.unicode_minus'] = False  # false修正坐标轴上负号（-）显示方块的问题
    df = pd.read_csv('cinema.csv') #导入文件
    X = df['cost'].values.reshape(-1, 1) #将成本（cost）转至为一列
    y = df['income'].values #.reshape(-1, 1) #将收入（income）设为y
    model = LinearRegression() #导入线性回归模型
    model.fit(X, y) #拟合X和y
    pre = model.predict([[num]]) #预测模型
    coef = model.coef_ #特征系数(斜率)
    inter = model.intercept_ #截距
    print('投资{}{}的电影预计票房收入为:{:.2f}百万元'.format(num_d, wan, pre[0])) #分别为数字，单位，预测值
    print("回归模型的系数是", coef[0])# 因为coef出来的数为[[num]],所以要用[0][0]取出
    print("回归模型的截距是", inter) #intercept得出来的数为[num],用[0]取出
    print(f"最佳拟合线:y={int(inter)} + {int(coef)}x") #回归线定义，用整数
    drawPlt() #调用自定义绘图函数
    plt.scatter(df['cost'], df['income'], c='green', marker='^') #将X和y用三角的方式呈现,散点图
    s, j = inter, (num*coef[0]+inter) #s为截距，j为num * 斜率 + 截距得到的值
    plt.plot([0, num], [s, j]) #拟合直线
    plt.show() #画图

if __name__ == '__main__':
    a = eval(input("输入投入成本(百万):")) #输入值
    main(a) #输入数

在这里插入图片描述

（二）生成回归数据集预测

make_regressions生成回归数据集的用法

n_samples：样本数
n_features：特征数(自变量个数)
n_informative：参与建模特征数
n_targets：因变量个数
noise：噪音
bias：偏差(截距)
coef：是否输出coef标识
random_state：随机状态若为固定值则每次产生的数据都一样

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression #导入生成回归数据集方法
from matplotlib import pyplot as plt
import numpy as np


X, y = make_regression(n_samples=50, n_features=1, n_informative=1, noise=50, random_state=1)
model = LinearRegression()
model.fit(X, y) #线性拟合
z = np.linspace(-3, 3, 200)
plt.scatter(X, y, c='y', s=80) #画出散点图
plt.plot(z, model.predict(z.reshape(-1, 1)), c='k') #画出拟合直线
plt.show()

结果如下
在这里插入图片描述

（三）鸢尾花数据集预测

from sklearn.datasets import load_iris  #导入鸢尾花数据集，这里没用这种方法
from sklearn.model_selection import train_test_split #划分训练集和测试集的
from sklearn.linear_model import LinearRegression #导入线性回归模型
from matplotlib import pyplot as plt #导入绘画库
import pandas as pd #导入数据分析库
import warnings #导入警告库，目的是消除红色警告

def drawPlt():#自定义绘图函数
    # 鸢尾花数据集下载地址，这里用的这种方法
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
    names = ['花萼-length', '花萼-width', '花瓣-length', '花瓣-width', 'class']
    data = pd.read_csv(url, names=names)#读取鸢尾花的原数据地址，并将names设置为列名
    pos = pd.DataFrame(data) #编程pandas二维框图（Series）
    # 获取花瓣的长和宽，转换Series为ndarray
    x = pos['花瓣-length'].values #变为x的值
    y = pos['花瓣-width'].values #变为y值
    x = x.reshape(len(x), 1) #转为长为x的元素个数长度，一列
    y = y.reshape(len(y), 1) #转为长为y的元素个数长度，一列
    clf = LinearRegression() #导入线性回归模型
    clf.fit(x, y) #拟合
    pre = clf.predict(x) #预测
    plt.scatter(x, y, s=80, c='y') #散点图，点大小为80, 颜色为黄色
    plt.plot(x, pre, c='k', linewidth=4) #绘制回归线,颜色为黑色
    for idx, m in enumerate(x): #这是循环得到x的值和索引的方式
        plt.plot([m, m], [y[idx], pre[idx]], 'g-') #画出以x和pre的直线，目的是为了让点与线的距离展示出来
    plt.show() #绘图显示

def main():
    warnings.filterwarnings('ignore')  # 忽视在程序中红色警告（不为错，但只是为了控制台显示美观）
    data = load_iris()  # 加载鸢尾花数据集命名为data 这里没用这种方法
    X = data.data  # 数据特征
    y = data.target  # 标签
    # print(data.keys()) #查看鸢尾花数据集的键
    # 将数据集0.7分为训练集，0.3分为测试集，random_state为随机种子（混乱度）
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    # 查看划分的数据集元素个数的各自是多少
    # print('X_train:{}, X_test:{}'.format(X_train.size, X_test.size))
    # print('y_train:{}, y_test:{}'.format(y_train.size, y_test.size))
    lr = LinearRegression()  # 加载线性回归模型
    lr.fit(X_train, y_train)  # 拟合
    print('训练集得分:', lr.score(X_train, y_train))  # 输出训练集的得分
    print('测试集得分:', lr.score(X_test, y_test))  # 输出测试集的得分
    print('特征系数(斜率):\n', lr.coef_[0])
    print('截距:\n', lr.intercept_)
    print('预测值:', lr.predict(X_test[:3, :])) #只显示前三个预测值
    drawPlt() #调用自定义绘图函数

if __name__ == '__main__':
    main()