CMU 11-785 L04 Backpropagation

本文深入探讨了深度学习中网络训练的过程,包括输入输出对的设定、使用一热向量表示输出、以及如何通过不同类型的损失函数如L2范数、交叉熵来衡量模型预测与实际标签之间的差距。特别讲解了反向传播算法,从输出层到输入层逐层计算梯度,并详细解析了在乘法组合层和向量激活函数下梯度的计算方式。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Problem setup

  • Input-output pairs: not to mention

  • Representing the output: one-hot vector

    • yi=exp⁡(zi)∑jexp⁡(zj) y_{i}=\frac{\exp \left(z_{i}\right)}{\sum_{j} \exp \left(z_{j}\right)} yi=jexp(zj)exp(zi)

    • two classes of softmax = sigmoid

  • Divergence: must be differentiable

    • For real-valued output vectors, the (scaled) L2L_2L2 divergence

      • Div⁡(Y,d)=12∥Y−d∥2=12∑i(yi−di)2 \operatorname{Div}(Y, d)=\frac{1}{2}\|Y-d\|^{2}=\frac{1}{2} \sum_{i}\left(y_{i}-d_{i}\right)^{2} Div(Y,d)=21Yd2=21i(yidi)2
    • For binary classifier

      • Div⁡(Y,d)=−dlog⁡Y−(1−d)log⁡(1−Y) \operatorname{Div}(Y, d)=-\operatorname{dlog} Y-(1-d) \log (1-Y) Div(Y,d)=dlogY(1d)log(1Y)

      • Note: the derivative is not zero even d=Yd = Yd=Y, but it can converge very quickly

    • For multi-class classification

      • Div⁡(Y,d)=−∑idilog⁡yi=−log⁡yc \operatorname{Div}(Y, d)=-\sum_{i} d_{i} \log y_{i}=-\log y_{c} Div(Y,d)=idilogyi=logyc

      • If yc<1y_c < 1yc<1 , the slope is negative w.r.t. ycy_cyc, indicates increasing ycy_cyc will reduce divergence

Train the network

Distributed Chain rule

y=f(g1(x),g1(x),…,gM(x)) y=f\left(g_{1}(x), g_{1}(x), \ldots, g_{M}(x)\right) y=f(g1(x),g1(x),,gM(x))

dydx=∂f∂g1(x)dg1(x)dx+∂f∂g2(x)dg2(x)dx+⋯+∂f∂gM(x)dgM(x)dx \frac{d y}{d x}=\frac{\partial f}{\partial g_{1}(x)} \frac{d g_{1}(x)}{d x}+\frac{\partial f}{\partial g_{2}(x)} \frac{d g_{2}(x)}{d x}+\cdots+\frac{\partial f}{\partial g_{M}(x)} \frac{d g_{M}(x)}{d x} dxdy=g1(x)fdxdg1(x)+g2(x)fdxdg2(x)++gM(x)fdxdgM(x)

Backpropagation

在这里插入图片描述

  • For each layer: we caculate ∂Div∂yi\frac{\partial D i v}{\partial y_{i}}yiDiv,∂Dicv∂z\frac{\partial Dicv}{\partial z}zDicv, and ∂Div∂wij\frac{\partial Div}{\partial w_{ij}}wijDiv

  • For ouput layer

    • It is easy to caculate ∂Div∂yi(N)\frac{\partial D i v}{\partial y_{i}^{(N)}}yi(N)Div
    • So: ∂Div∂zi(N)=fN′(zi(N))∂Div∂yi(N)\frac{\partial D i v}{\partial z_{i}^{(N)}}=f_{N}^{\prime}\left(z_{i}^{(N)}\right) \frac{\partial D i v}{\partial y_{i}^{(N)}}zi(N)Div=fN(zi(N))yi(N)Div
    • ∂Div∂wij(N)=∂zj(N)∂wij(N)∂Div∂zj(N)\frac{\partial D i v}{\partial w_{ij}^{(N)}}=\frac{\partial z_{j}^{(N)}}{\partial w_{ij}^{(N)}} \frac{\partial D i v}{\partial z_{j}^{(N)}}wij(N)Div=wij(N)zj(N)zj(N)Div, where ∂zj(N)∂wij(N)=yi(N)\frac{\partial z_{j}^{(N)}}{\partial w_{ij}^{(N)}} = y_i^{(N)}wij(N)zj(N)=yi(N)
  • Pass on

    • zj(N)=∑iwij(2)yi(v−1)z_{j}^{(N)}=\sum_{i} w_{i j}^{(2)} y_{i}^{(v-1)}zj(N)=iwij(2)yi(v1), so ∂zj(N)∂y1(N−1)=wij(N)\frac{\partial z_{j}^{(N)}}{\partial y_{1}^{(N-1)}} = w_{ij}^{(N)}y1(N1)zj(N)=wij(N)
    • ∂Div∂yi(N−1)=∑jwij(N)∂Div∂zj(N)\frac{\partial D i v}{\partial y_{i}^{(N-1)}}=\sum_{j} w_{i j}^{(N)} \frac{\partial D i v}{\partial z_{j}^{(N)}}yi(N1)Div=jwij(N)zj(N)Div
    • ∂Div∂zi(N−1)=fN−1′(zi(N−1))∂Div∂yi(N−1)\frac{\partial D i v}{\partial z_{i}^{(N-1)}}=f_{N-1}^{\prime}(z_{i}^{(N-1)}) \frac{\partial D i v}{\partial y_{i}^{(N-1)}}zi(N1)Div=fN1(zi(N1))yi(N1)Div
    • ∂Div∂wij(N−1)=yi(N−2)∂Div∂zj(N−1)\frac{\partial D i v}{\partial w_{i j}^{(N-1)}}=y_{i}^{(N-2)} \frac{\partial D i v}{\partial z_{j}^{(N-1)}}wij(N1)Div=yi(N2)zj(N1)Div

在这里插入图片描述

在这里插入图片描述

Special case

Vector activations

  • Vector activations: all outputs are functions of all inputs

在这里插入图片描述

  • So the derivatives need to change a little

  • ∂Div∂zi(k)=∑j∂Div∂yj(k)∂yj(k)∂zi(k) \frac{\partial D i v}{\partial z_{i}^{(k)}}=\sum_{j} \frac{\partial D i v}{\partial y_{j}^{(k)}} \frac{\partial y_{j}^{(k)}}{\partial z_{i}^{(k)}} zi(k)Div=jyj(k)Divzi(k)yj(k)

  • Note: derivatives of scalar activations are just a special case of vector activations:

  • ∂yj(k)∂zi(k)=0 for i≠j \frac{\partial y_{j}^{(k)}}{\partial z_{i}^{(k)}}=0 \text { for } i \neq j zi(k)yj(k)=0 for i=j

  • For example, Softmax:

yi(k)=exp⁡(zi(k))∑jexp⁡(zj(k)) y_{i}^{(k)}=\frac{\exp \left(z_{i}^{(k)}\right)}{\sum_{j} \exp \left(z_{j}^{(k)}\right)} yi(k)=jexp(zj(k))exp(zi(k))

∂Div∂zi(k)=∑j∂Div∂yj(k)∂yj(k)∂zi(k) \frac{\partial D i v}{\partial z_{i}^{(k)}}=\sum_{j} \frac{\partial D i v}{\partial y_{j}^{(k)}} \frac{\partial y_{j}^{(k)}}{\partial z_{i}^{(k)}} zi(k)Div=jyj(k)Divzi(k)yj(k)

KaTeX parse error: Got function '\newline' with no arguments as argument to '\left' at position 1: \̲n̲e̲w̲l̲i̲n̲e̲

  • Using Keonecker delta δij=1\delta_{i j}=1δij=1 if i=j,0i=j, \quad 0i=j,0 if i≠ji \neq ji=j

∂Div∂zi(k)=∑j∂Div∂yj(k)yi(k)(δij−yj(k)) \frac{\partial D i v}{\partial z_{i}^{(k)}}=\sum_{j} \frac{\partial D i v}{\partial y_{j}^{(k)}} y_{i}^{(k)}\left(\delta_{i j}-y_{j}^{(k)}\right) zi(k)Div=jyj(k)Divyi(k)(δijyj(k))

Multiplicative networks

  • Some types of networks have multiplicative combination(instead of additive combination)
  • Seen in networks such as LSTMs, GRUs, attention models, etc.

在这里插入图片描述

  • So the derivatives need to change

∂Div∂oi(k)=∑jwij(k+1)∂Div∂zj(k+1) \frac{\partial D i v}{\partial o_{i}^{(k)}}=\sum_{j} w_{i j}^{(k+1)} \frac{\partial D i v}{\partial z_{j}^{(k+1)}} oi(k)Div=jwij(k+1)zj(k+1)Div

∂Div∂yj(k−1)=∂oi(k)∂yj(k−1)∂Div∂oi(k)=yl(k−1)∂Div∂oi(k) \frac{\partial D i v}{\partial y_{j}^{(k-1)}}=\frac{\partial o_{i}^{(k)}}{\partial y_{j}^{(k-1)}} \frac{\partial D i v}{\partial o_{i}^{(k)}}=y_{l}^{(k-1)} \frac{\partial D i v}{\partial o_{i}^{(k)}} yj(k1)Div=yj(k1)oi(k)oi(k)Div=yl(k1)oi(k)Div

  • A layer of multiplicative combination is a special case of vector activation

Non-differentiable activations

  • Activation functions are sometimes not actually differentiable

    • The RELU (Rectified Linear Unit)
      • And its variants: leaky RELU, randomized leaky RELU
    • The “max” function
  • Subgradient

    • (f(x)−f(x0))≥vT(x−x0) \left(f(x)-f\left(x_{0}\right)\right) \geq v^{T}\left(x-x_{0}\right) (f(x)f(x0))vT(xx0)

    • The subgradient is a direction in which the function is guaranteed to increase

    • If the function is differentiable at xxx , the subgradient is the gradient

    • But gradient is not always the subgradient though

Vector formulation

  • Define the vectors:

在这里插入图片描述

Forward pass

在这里插入图片描述

Backward pass

  • Chain rule
    • y=f(g(x))\mathbf{y}=\boldsymbol{f}(\boldsymbol{g}(\mathbf{x}))y=f(g(x))
    • Let z=g(x)z = g(x)z=g(x),y=f(z)y = f(z)y=f(z)
    • So Jy(x)=Jy(z)Jz(x)J_{\mathbf{y}}(\mathbf{x})=J_{\mathbf{y}}(\mathbf{z}) J_{\mathbf{z}}(\mathbf{x})Jy(x)=Jy(z)Jz(x)
  • For scalar functions:
    • D=f(Wy+b)D = f(Wy + b)D=f(Wy+b)
    • Let z=Wy+bz = Wy + bz=Wy+b, D=f(z)D = f(z)D=f(z)
    • ∇xD=∇z(D)Jz(x)\nabla_{x} D = \nabla_z(D)J_z(x)xD=z(D)Jz(x)
  • So for backward process
    • ∇ZNDiv=∇YDiv∇ZNY\nabla_{Z_N} Div = \nabla_Y Div \nabla_{Z_N}YZNDiv=YDivZNY
    • $\nabla_{y_{N-1}}Div = \nabla_{Z_N} Div \nabla_{y_{N-1}} z_N $
    • ∇WNDiv=yN−1∇ZNDiv\nabla_{W_N} Div = y_{N-1} \nabla_{Z_N} DivWNDiv=yN1ZNDiv
    • ∇bNDiv=∇ZNDiv\nabla_{b_N} Div = \nabla_{Z_N} DivbNDiv=ZNDiv
  • For each layer
    • First compute ∇yDiv\nabla_{y} DivyDiv
    • Then compute ∇zDiv\nabla_{z}DivzDiv
    • Finally ∇WDiv\nabla_{W} DivWDiv, ∇WDiv\nabla_{W} DivWDiv

Training

Analogy to forward pass

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值