【TensorFlow深度学习】八、多层感知机（隐藏层、ReLU）

雯雅千鶴子

已于 2023-11-28 12:34:28 修改

阅读量2.2k

点赞数 50

CC 4.0 BY-SA版权

分类专栏： TensorFlow深度学习文章标签：深度学习 tensorflow 人工智能

于 2023-11-24 14:57:10 首次发布

本文链接：https://blog.csdn.net/qq_43876539/article/details/134598933

TensorFlow深度学习专栏收录该内容

13 篇文章

订阅专栏

文章详细讲解了如何使用TensorFlow实现多层感知机，涉及概念、非线性激活函数及Fashion-MNIST数据集的训练过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本专栏是记录作者学习TensorFlow深度学习的相关内容

多层感知机解决了线性回归不能解决的非线性问题，多层感知机在线性回归模型的基础上增加了隐藏层和激活函数，本章节将以该思路介绍多层感知机(MLP)

本节的 Jupyter 笔记本文件已上传至gitee以供大家学习交流：我的gitee仓库

多层感知机（MLP）是一种特殊的深度神经网络（DNN），它通常指的是只有前馈结构、无循环连接的神经网络。因此，我们可以认为深度神经网络是包含多层感知机的一种更广泛的范畴。
多层感知机在输出层和输入层之间增加一个或多个全连接隐藏层，并通过激活函数转换隐藏层的输出。常用的激活函数包括ReLU函数、sigmoid函数和tanh函数。

1 我们为什么需要多层感知机？

如下图，线性模型无法解决非线性问题，例如下图的XOR异或问题，因为无法找出一条直线能够分割红2红3和绿1绿4两个区域，所以使用线性回归无法解决这个分类问题。

但是我们可以使用两条线（蓝线与黄线）划分这个区域从而实现分类，结果为灰色，分类结果如下

在神经网络中需要在输入与输出中间加入蓝色和黄色两个分类器，通过分类器，就能输出分类结果（灰色）

在输入层和输出层之间新加入的蓝色和黄色这一层就被称为隐藏层(Hidden layer)

2 多层感知机的概念

2.1 隐藏层（单层）

有d个特征的样本： $\mathbf{X} \in \mathbb{R}^{d}$

隐藏层具有 $h$ 个隐藏单元
隐藏层权重： $\mathbf{W}^{(1)} \in \mathbb{R}^{d \times h}$
隐藏层偏置： $\mathbf{b}^{(1)} \in \mathbb{R}^{1 \times h}$
隐藏层输出： $\mathbf{H} \in \mathbb{R}^{n \times h}$

隐藏层输出后对每个隐藏单元应用非线性的激活函数（activation function） $\sigma$ 。激活函数的输出（例如， $\sigma(\cdot)$ ）被称为活性值（activations）。非线性的激活函数可以为模型赋予非线性的特性

输出层输出 $q$ 个维度的值
输出层权重： $\mathbf{W}^{(2)} \in \mathbb{R}^{h \times q}$
输出层偏置： $\mathbf{b}^{(2)} \in \mathbb{R}^{1 \times q}$ 。
输出层输出： $\mathbf{O} \in \mathbb{R}^{n \times q}$ ：

计算过程：
$\begin{aligned} \mathbf{H} & = \sigma(\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)}), \\ \mathbf{O} & = \mathbf{H}\mathbf{W}^{(2)} + \mathbf{b}^{(2)}.\\ \end{aligned}$

同样我们也可以实现多隐藏层

计算过程：
$\begin{aligned} \mathbf{H^{(1)}} & = \sigma(\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)}), \\ \mathbf{H^{(2)}} & = \sigma(\mathbf{H^{(1)}} \mathbf{W}^{(2)} + \mathbf{b}^{(2)}), \\ \mathbf{H^{(3)}} & = \sigma(\mathbf{H^{(2)}} \mathbf{W}^{(3)} + \mathbf{b}^{(3)}), \\ \mathbf{O} & = \mathbf{H^{(3)}}\mathbf{W}^{(4)} + \mathbf{b}^{(4)}.\\ \end{aligned}$

2.2 激活函数

非线性的激活函数可以为模型赋予非线性的特性

2.2.1 ReLU

最常用的是修正线性单元（Rectified linear unit，ReLU），因为计算简单，表现良好。
$\operatorname{ReLU}(x) = \max(x, 0).$

%matplotlib inline
import tensorflow as tf
from d2l import tensorflow as d2l 
x = tf.Variable(tf.range(-8.0, 8.0, 0.1), dtype=tf.float32)
y = tf.nn.relu(x)
d2l.plot(x.numpy(), y.numpy(), 'x', 'relu(x)', figsize=(5, 2.5))

结果：
svg

2.2.2 sigmoid

对于一个定义域在 $\mathbb{R}$ 中的输入，sigmoid函数将输入变换为区间(0, 1)上的输出。
因此，sigmoid通常称为挤压函数（squashing function）：
它将范围（-inf, inf）中的任意输入压缩到区间（0, 1）中的某个值：

$\operatorname{sigmoid}(x) = \frac{1}{1 + \exp(-x)}.$

y = tf.nn.sigmoid(x)
d2l.plot(x.numpy(), y.numpy(), 'x', 'sigmoid(x)', figsize=(5, 2.5))

结果：
jpg

2.2.3 tanh

与sigmoid函数类似，tanh(双曲正切)函数也能将其输入压缩转换到区间(-1, 1)上。
tanh函数的公式如下：
( $\operatorname{tanh}(x) = \frac{1 - \exp(-2x)}{1 + \exp(-2x)}.$ )

y = tf.nn.tanh(x)
d2l.plot(x.numpy(), y.numpy(), 'x', 'tanh(x)', figsize=(5, 2.5))

结果：
jpg

3 单隐藏层的多层感知机的实现（不使用框架）

3.1 获取数据集

数据集仍然使用Fashion-MNIST图像分类数据集。
图像格式：由 $28 \times 28 = 784$ 个灰度像素值组成。
图像类别：10个。
所以图像可以视为具有784个输入特征和10个类的数据集。

def load_data_fashion_mnist(batch_size, resize=None): 
    """下载Fashion-MNIST数据集，然后将其加载到内存中"""
    mnist_train, mnist_test = tf.keras.datasets.fashion_mnist.load_data()
    # 将所有数字除以255，使所有像素值介于0和1之间，在最后添加一个批处理维度，
    # 并将标签转换为int32。
    process = lambda X, y: (tf.expand_dims(X, axis=3) / 255,
                            tf.cast(y, dtype='int32'))
    resize_fn = lambda X, y: (
        tf.image.resize_with_pad(X, resize, resize) if resize else X, y)
    return (
        tf.data.Dataset.from_tensor_slices(process(*mnist_train)).batch(
            batch_size).shuffle(len(mnist_train[0])).map(resize_fn),
        tf.data.Dataset.from_tensor_slices(process(*mnist_test)).batch(
            batch_size).map(resize_fn))

import tensorflow as tf

batch_size = 256
train_iter, test_iter = load_data_fashion_mnist(batch_size)

3.2 初始化模型参数

输入维度为784，输出维度为10，隐藏层单元设为256。通常，隐藏层单元数量我们选择2的若干次幂作为层的宽度。因为内存在硬件中的分配和寻址方式，这么做往往可以在计算上更高效。

num_inputs, num_outputs, num_hiddens = 784, 10, 256

定义隐藏层参数，权重形状为(784,256)，偏置形状为(256,)

W1 = tf.Variable(tf.random.normal(
    shape=(num_inputs, num_hiddens), mean=0, stddev=0.01))
b1 = tf.Variable(tf.zeros(num_hiddens))

定义输出层参数，权重形状为(256,10)，偏置形状为(10,)

W2 = tf.Variable(tf.random.normal(
    shape=(num_hiddens, num_outputs), mean=0, stddev=0.01))
b2 = tf.Variable(tf.zeros(num_outputs))

params = [W1, b1, W2, b2]

3.3 定义ReLU激活函数

def relu(X):
    return tf.math.maximum(X, 0)

3.4 定义网络模型

当使用 TensorFlow 的内置交叉熵损失函数时，不需要手动在网络的输出层添加 softmax 激活函数。内置的损失函数在计算损失时会自动将原始输出 logits 转换为概率分布，而不需要显式地在网络结构中添加 softmax。

def net(X):
    X = tf.reshape(X, (-1, num_inputs)) #输入
    H = relu(tf.matmul(X, W1) + b1) #隐藏层
    O = tf.matmul(H, W2) + b2 #输出层
    return O

3.5 损失函数（交叉熵）

使用高级API中的内置函数使用交叉熵损失。

def loss(y_hat, y):
    return tf.losses.sparse_categorical_crossentropy(
        y, y_hat, from_logits=True)

3.6 定义小批量随机梯度下降

def sgd(params, lr, batch_size):  
    """小批量随机梯度下降"""
    for param in params:
        param[:] = param - lr * param.grad / batch_size

class Updater(): 
    """用小批量随机梯度下降法更新参数"""
    def __init__(self, params, lr):
        self.params = params
        self.lr = lr

    def __call__(self, batch_size, grads):
        d2l.sgd(self.params, grads, self.lr, batch_size)

3.6 训练

多层感知机的训练过程与softmax回归的训练过程完全相同。将迭代周期数epoch设置为10，并将学习率lr设置为0.1。

3.6.1 训练可视化

%matplotlib inline
from IPython import display
from d2l import tensorflow as d2l

class Animator:  
    """在动画中绘制数据"""
    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
                 ylim=None, xscale='linear', yscale='linear',
                 fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
                 figsize=(3.5, 2.5)):
        # 增量地绘制多条线
        if legend is None:
            legend = []
        d2l.use_svg_display()
        self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
        if nrows * ncols == 1:
            self.axes = [self.axes, ]
        # 使用lambda函数捕获参数
        self.config_axes = lambda: d2l.set_axes(
            self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
        self.X, self.Y, self.fmts = None, None, fmts

    def add(self, x, y):
        # 向图表中添加多个数据点
        if not hasattr(y, "__len__"):
            y = [y]
        n = len(y)
        if not hasattr(x, "__len__"):
            x = [x] * n
        if not self.X:
            self.X = [[] for _ in range(n)]
        if not self.Y:
            self.Y = [[] for _ in range(n)]
        for i, (a, b) in enumerate(zip(x, y)):
            if a is not None and b is not None:
                self.X[i].append(a)
                self.Y[i].append(b)
        self.axes[0].cla()
        for x, y, fmt in zip(self.X, self.Y, self.fmts):
            self.axes[0].plot(x, y, fmt)
        self.config_axes()
        display.display(self.fig)
        display.clear_output(wait=True)

class Accumulator:  
    """在n个变量上累加"""
    def __init__(self, n):
        self.data = [0.0] * n

    def add(self, *args):
        self.data = [a + float(b) for a, b in zip(self.data, args)]

    def reset(self):
        self.data = [0.0] * len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

3.6.2 评估

def accuracy(y_hat, y):  
    """计算预测正确的数量"""
    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
        y_hat = tf.argmax(y_hat, axis=1)
    cmp = tf.cast(y_hat, y.dtype) == y
    return float(tf.reduce_sum(tf.cast(cmp, y.dtype)))

def evaluate_accuracy(net, data_iter):
    """计算在指定数据集上模型的精度"""
    metric = Accumulator(2)  # 正确预测数、预测总数
    for X, y in data_iter:
        metric.add(accuracy(net(X), y), d2l.size(y))
    return metric[0] / metric[1]

3.6.3 定义训练

def train_epoch(net, train_iter, loss, updater):  
    """训练模型一个迭代周期"""
    # 训练损失总和、训练准确度总和、样本数
    metric = Accumulator(3)
    for X, y in train_iter:
        # 计算梯度并更新参数
        with tf.GradientTape() as tape:
            y_hat = net(X)
            # Keras内置的损失接受的是（标签，预测）
            if isinstance(loss, tf.keras.losses.Loss):
                l = loss(y, y_hat)
            else:
                l = loss(y_hat, y)
        if isinstance(updater, tf.keras.optimizers.Optimizer):
            params = net.trainable_variables
            grads = tape.gradient(l, params)
            updater.apply_gradients(zip(grads, params))
        else:
            updater(X.shape[0], tape.gradient(l, updater.params))
        # Keras的loss默认返回一个批量的平均损失
        l_sum = l * float(tf.size(y)) if isinstance(
            loss, tf.keras.losses.Loss) else tf.reduce_sum(l)
        metric.add(l_sum, accuracy(y_hat, y), tf.size(y))
    # 返回训练损失和训练精度
    return metric[0] / metric[2], metric[1] / metric[2]

def train(net, train_iter, test_iter, loss, num_epochs, updater):  
    """训练模型"""
    animator = Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0.3, 0.9],
                        legend=['train loss', 'train acc', 'test acc'])
    for epoch in range(num_epochs):
        train_metrics = train_epoch(net, train_iter, loss, updater)
        test_acc = evaluate_accuracy(net, test_iter)
        animator.add(epoch + 1, train_metrics + (test_acc,))
    train_loss, train_acc = train_metrics
    assert train_loss < 0.5, train_loss
    assert train_acc <= 1 and train_acc > 0.7, train_acc
    assert test_acc <= 1 and test_acc > 0.7, test_acc

3.6.4 结果

num_epochs, lr = 10, 0.1
updater = Updater([W1, W2, b1, b2], lr)
train(net, train_iter, test_iter, loss, num_epochs, updater)

结果：
svg

def predict(net, test_iter, n=8):  
    """预测标签"""
    for X, y in test_iter:
        break
    trues = d2l.get_fashion_mnist_labels(y)
    preds = d2l.get_fashion_mnist_labels(tf.argmax(net(X), axis=1))
    titles = [true +'\n' + pred for true, pred in zip(trues, preds)]
    d2l.show_images(
        tf.reshape(X[0:n], (n, 28, 28)), 1, n, titles=titles[0:n])

predict(net, test_iter)

结果：
svg

4 单层多层感知机的实现（使用TensorFlow框架）

若使用框架构造，我们只需要搭建网络，配置超参数，选择损失函数，选择梯度下降优化方法。然后读取数据集，训练即可，而无需关注细节

import tensorflow as tf
from d2l import tensorflow as d2l

net = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),#展开层
    tf.keras.layers.Dense(256, activation='relu'),#隐藏层（256个单元）
    tf.keras.layers.Dense(10)])#输出层（10个类别）

batch_size, lr, num_epochs = 256, 0.1, 10 #配置超参数
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) #损失函数定义为交叉熵
trainer = tf.keras.optimizers.SGD(learning_rate=lr) #优化方法选择SGD

train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
train(net, train_iter, test_iter, loss, num_epochs, trainer) #训练