CMU 11-785 L14 Stability analysis and LSTMs

最新推荐文章于 2020-06-01 09:10:39 发布

原创最新推荐文章于 2020-06-01 09:10:39 发布 · 525 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习 #神经网络 #数据挖掘 #机器学习

CMU 11-785 专栏收录该内容

22 篇文章

订阅专栏

这篇博客探讨了RNN的稳定性问题，包括简单的线性递归、非线性激活函数的影响，以及梯度消失问题。重点介绍了LSTM的架构，如关键组件、门控机制，以及如何通过门控来解决长期依赖问题。总结了LSTM通过常错误回路结构，直接依赖输入而不是网络参数，从而避免梯度消失问题，但可能面临爆炸梯度挑战。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Stability

Will this necessarily be「Bounded Input Bounded Output」?
- Guaranteed if output and hidden activations are bounded
- But will it saturate？

Analyzing Recursion

Sufficient to analyze the behavior of the hidden layer since it carries the relevant information
Assumed linear systems
- $z_{k}=W_{h} h_{k-1}+W_{x} x_{k}, \quad h_{k}=z_{k}$

在这里插入图片描述

Sufficient to analyze the response to a single input at $t = 0$ (else is zero input)

Simple scalar linear recursion

$h (t) = w h (t - 1) + c x (t)$
$h_0(t) = w^tcx(0)$
If $w > 1$ it will blow up

Simple Vector linear recursion

$h (t) = W h (t - 1) + C x (t)$
$h_0(t) = W^tCx(0)$
For any input, for large the length of the hidden vector will expand or contract according to the $t -$ th power of the largest eigen value of the hidden-layer weight matrix
If $|\lambda_{max} > 1|$ it will blow up, otherwise it will contract and shrink to 0 rapidly

在这里插入图片描述

Non-linearities

Sigmoid: Saturates in a limited number of steps, regardless of $w$
- To a value dependent only on $w$ (and bias, if any)
- Rate of saturation depends on $w$
Tanh: Sensitive to $w$ , but eventually saturates
- “Prefers” weights close to 1.0
Relu: Sensitive to $w$ , can blow up

在这里插入图片描述

Lessons

Recurrent networks retain information from the infinite past in principle
In practice, they tend to blow up or forget
- If the largest Eigen value of the recurrent weights matrix is greater than 1, the network response may blow up
- If it’s less than one, the response dies down very quickly
The “memory” of the network also depends on the parameters (and activation) of the hidden units
- Sigmoid activations saturate and the network becomes unable to retain new information
- RELU activations blow up or vanish rapidly
- Tanh activations are the most effective at storing memory
  - And still has very short “memory”
  - Still sensitive to Eigenvalues of $W$

Vanishing gradient

A particular problem with training deep networks is the gradient of the error with respect to weights is unstable
For
- $\operatorname{Div}(X)=D\left(f_{N}\left(W_{N-1} f_{N-1}\left(W_{N-2} f_{N-2}\left(\ldots W_{0} X\right)\right)\right)\right)$
We get
- $\nabla_{f_{k}} \operatorname{Div}=\nabla D . \nabla f_{N} . W_{N-1} . \nabla f_{N-1} . W_{N-2} \ldots \nabla f_{k+1} W_{k}$
Where
- $\nabla{f_{n}}$ is jacobian of $f_N()$ to its current input

For activation

For RNN
- $\nabla f_{t}\left(z_{i}\right)=\left[\begin{array}{cccc}f_{t, 1}^{\prime}\left(z_{1}\right) & 0 & \cdots & 0 \\\\ 0 & f_{t, 2}^{\prime}\left(z_{2}\right) & \cdots & 0 \\\\ \vdots & \vdots & \ddots & \vdots \\\\ 0 & 0 & \cdots & f_{t, N}^{\prime}\left(z_{N}\right)\end{array}\right]$
- For vector activations: A full matrix
- For scalar activations: A matrix where the diagonal entries are the derivatives of the activation of the recurrent hidden layer
The derivative (or subgradient) of the activation function is always bounded
Most common activation functions, such as sigmoid, tanh() and RELU have derivatives that are always less than 1
- Multiplication by the Jacobian is always a shrinking operation
- After a few layers the derivative of the divergence at any time is totally “forgotten”

For weights

In a single-layer RNN, the weight matrices are identical
- The conclusion below holds for any deep network, though
The chain product for $\nabla_{f_k} Div$ will
- Expand $\nabla D$ along directions in which the singular values of the weight matrices are greater than 1
- Shrink $\nabla D$ in directions where the singular values are less than 1
- Repeated multiplication by the weights matrix will result in Exploding or vanishing gradients

LSTM

Problem

Recurrent nets are very deep nets
Stuff gets forgotten in the forward pass too
- Each weights matrix and activation can shrink components of the input
Need the long-term dependency
The memory retention of the network depends on the behavior of the weights and jacobian
Which in turn depends on the parameters $W$ rather than what it is trying to remember
We need
- Not be directly dependent on vagaries of network parameters, but rather on input-based determination of whether it must be remembered
- Retain memories until a switch based on the input flags them as ok to forget
  - 「Curly brace must remember until curly brace is closed」

Architecture

在这里插入图片描述

The $\sigma$ are multiplicative gates that decide if something is important or not

Key component

Remembered cell state

在这里插入图片描述

Mutiply is a switch
- Should I continue remember or not? (scale up / down)
Acddition
- Should I agument the memory?
$C_t$ is the linear history carried by the constant-error carousel
Carries information through, only affected by a gate
- And addition of history, which too is gated…

Gates

在这里插入图片描述

Gates are simple sigmoidal units with outputs in the range (0,1)
Controls how much of the information is to be let through

Forget gate

在这里插入图片描述

The first gate determines whether to carry over the history or to forget it
- More precisely, how much of the history to carry over
- Also called the “forget” gate
- Note, we’re actually distinguishing between the cell memory $C$ and the state $h$ that is coming over time! They’re related though
  - Hidden state is compute from memory (which is stored)

Input gate

在这里插入图片描述

The second input has two parts
- A perceptron layer that determines if there’s something new and interesting in the input
  - 「See a curly brace」
- A gate that decides if its worth remembering
  - 「Curly brace is in comment section, ignore it」

Memory cell update

在这里插入图片描述

If something new and worth remembering
- Added to the current memory cell

Output and Output gate

在这里插入图片描述

The output of the cell
- Simply compress it with tanh to make it lie between 1 and -1
  - Note that this compression no longer affects our ability to carry memory forward
- Controlled by an output gate
  - To decide if the memory contents are worth reporting at this time

The “Peephole” Connection

在这里插入图片描述

The raw memory is informative by itself and can also be input
- Note, we’re using both $C$ and $h$

Forward

在这里插入图片描述

Backward¹

在这里插入图片描述
$\begin{array}{l} \nabla_{C_{t}} D i v=&\nabla_{h_{t}} D i v \circ\left(o_{t} \circ \tanh ^{\prime}(.)+\tanh (.) \circ \sigma^{\prime}(.) W_{C o}\right)+ \\\\ &\nabla_{C_{t+1}} D i v \circ\left(f_{t+1}+C_{t} \circ \sigma^{\prime}(.) W_{C f}+\tilde{C}_{t+1} \circ \sigma^{\prime}(.) W_{C i} \circ \tanh (.) \ldots\right) \end{array}$

在这里插入图片描述
$\begin{aligned} \nabla_{h_{t}} D i v=& \nabla_{z_{t}} D i v \nabla_{h_{t}} z_{t}+\nabla_{C_{t+1}} D i v \circ\left(C_{t} \circ \sigma^{\prime}(.) W_{h f}+\tilde{C}_{t+1} \circ \sigma^{\prime}(.) W_{h i}\right)+\\\\ &\nabla_{C_{t+1}} D i v \circ o_{t+1} \circ \tanh ^{\prime}(.) W_{h i}+\nabla_{h_{t+1}} D i v \circ \tanh (.) \circ \sigma^{\prime}(.) W_{h o} \end{aligned}$
And weights?

Gated Recurrent Units

在这里插入图片描述

Combine forget and input gates
- In new input is to be remembered, then this means old memory is to be forgotten
- No need to compute twice

在这里插入图片描述

Don’t bother to separately maintain compressed and regular memories
- Redundant representation

Summary

LSTMs are an alternative formalism where memory is made more directly dependent on the input, rather than network parameters/structure
Through a “Constant Error Carousel” memory structure with no weights or activations, but instead direct switching and “increment/decrement” from pattern recognizers
Do not suffer from a vanishing gradient problem but do suffer from exploding gradient issue