CMU 11-785 L17 Seq2seq and attention model_automatic generation of pseudocode with attention -CSDN博客

本文介绍了CMU课程11-785中的序列到序列（Seq2seq）和注意力模型。在语言生成中，Seq2seq模型通过LSTM单元捕捉输入序列的上下文信息，输出每个时间步的单词概率分布。为了解决仅依赖当前隐藏状态的问题，引入了延迟Seq2seq模型，使用编码器和解码器结构，将输入信息压缩为隐藏表示，并据此生成输出序列。此外，注意力模型允许模型根据当前输出状态动态地加权所有隐藏输出，提高生成的准确性和连贯性。训练过程涉及随机采样、梯度下降和反向传播更新。该模型在机器翻译、文本生成等领域有广泛应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Generating Language

Synthesis

在这里插入图片描述

Input: symbols as one-hot vectors
- Dimensionality of the vector is the size of the 「vocabulary」
- Projected down to lower-dimensional “embeddings”
The hidden units are (one or more layers of) LSTM units
Output at each time: A probability distribution that ideally assigns peak probability to the next word in the sequence
Divergence

$\operatorname{Div}(\mathbf{Y}_{\text {target}}(1 \ldots T), \mathbf{Y}(1 \ldots T))=\sum\_{t}\operatorname{Xent}(\mathbf{Y}\_{\text {target}}(t), \mathbf{Y}(t))=-\sum\_{t} \log Y(t, w\_{t+1})$

在这里插入图片描述

Feed the drawn word as the next word in the series
And draw the next word from the output probability distribution

Beginnings and ends

A sequence of words by itself does not indicate if it is a complete sentence or not
To make it explicit, we will add two additional symbols (in addition to the words) to the base vocabulary
- <sos>: Indicates start of a sentence
- <eos> : Indicates end of a sentence
When do we stop?
- Continue this process until we draw an <eos>
- Or we decide to terminate generation based on some other criterion

Delayed sequence to sequence

在这里插入图片描述

Pseudocode

在这里插入图片描述

Problem: Each word that is output depends only on current hidden state, and not on previous outputs
The input sequence feeds into a recurrent structure
The input sequence is terminated by an explicit <eos> symbol
- The hidden activation at the <eos> “stores” all information about the sentence
Subsequently a second RNN uses the hidden activation as initial state to produce a sequence of outputs
- The output at each time becomes the input at the next time
- Output production continues until an <eos> is produced

Autoencoder

在这里插入图片描述

The recurrent structure that extracts the hidden representation from the input sequence is the encoder
The recurrent structure that utilizes this representation to produce the output sequence is the decoder

Generating output

在这里插入图片描述

At each time the network produces a probability distribution over words, given the entire input and previous outputs
At each time a word is drawn from the output distribution

$P\left(O_{1}, \ldots, O_{L} \mid W_{1}^{i n}, \ldots, W_{N}^{i n}\right)=y_{1}^{O_{1}} y_{1}^{O_{2}} \ldots y_{1}^{O_{L}}$

The objective of drawing: Produce the most likely output (that ends in an <eos>)

$\underset{O_{1}, \ldots, O_{L}}{\operatorname{argmax}} y_{1}^{O_{1}} y_{1}^{O_{2}} \ldots y_{1}^{O_{L}}$

How to draw words?
- Greedy answer
  - Select the most probable word at each time
  - Not good, making a poor choice at any time commits us to a poor future
- Randomly draw a word at each time according to the output probability distribution
  - Not guaranteed to give you the most likely output
- Beam search
  - Search multiple choices and prune
  - At each time, retain only the top K scoring forks
  - Terminate: When the current most likely path overall ends in <eos>

在这里插入图片描述

Train

在这里插入图片描述

In practice, if we apply SGD, we may randomly sample words from the output to actually use for the backprop and update
- Randomly select training instance: (input, output)
- Forward pass
- Randomly select a single output $y (t)$ and corresponding desired output $d (t)$ for backprop
Trick
- The input sequence is fed in reverse order
  - This happens both for training and during actual decode
Problem
- All the information about the input sequence is embedded into a single vector
- In reality: All hidden values carry information

在这里插入图片描述

Attention model

在这里插入图片描述

Compute a weighted combination of all the hidden outputs into a single vector
- Weights vary by output time
Require a time-varying weight that specifies relationship of output time to input time
- Weights are functions of current output state

$e_{i}(t)=g\left(\boldsymbol{h}_{i}, \boldsymbol{s}_{t-1}\right)$

$w_{i}(t)=\frac{\exp \left(e_{i}(t)\right)}{\sum_{j} \exp \left(e_{j}(t)\right)}$

Attention weight

Typical option for $g ()$
- Inner product
  - $g\left(\boldsymbol{h}\_{i}, \boldsymbol{s}\_{t-1}\right)=\boldsymbol{h}\_{i}^{T} \boldsymbol{s}\_{t-1}$
- Project to the same demension
  - $g\left(\boldsymbol{h}_{i}, \boldsymbol{s}\_{t-1}\right)=\boldsymbol{h}\_{i}^{T} \boldsymbol{W}\_{g} \boldsymbol{s}\_{t-1}$
- Non-linear activation
  - $g\left(\boldsymbol{h}\_{i}, \boldsymbol{s}\_{t-1}\right)=v\_{g}^{T} \boldsymbol{t} \boldsymbol{a} \boldsymbol{n} \boldsymbol{h}\left(\boldsymbol{W}\_{g}\left[\begin{array}{c}\boldsymbol{h}_{i} \\\\ \boldsymbol{s}\_{t-1}\end{array}\right]\right)$
- MLP
  - $g\left(\boldsymbol{h}\_{i}, \boldsymbol{s}\_{t-1}\right)=\operatorname{MLP}\left(\left[\boldsymbol{h}\_{i}, \boldsymbol{s}\_{t-1}\right]\right)$

在这里插入图片描述

Pseudocode

在这里插入图片描述

Train

Back propagation also updates parameters of the “attention” function
Trick: Occasionally pass drawn output instead of ground truth, as input
- Randomly select from output, force network to produce correct word even the prioir word is not correct

variants

Bidirectional processing of input sequence
- Neural Machine Translation by Jointly Learning to Align and Translate
Local attention vs global attention
- Effective Approaches to Attention-based Neural Machine Translation
Multihead attention
- Derive 「value」, and multiple 「keys」 from the encoder
  - $V_{i}, K_{i}^{l}, i=1 \ldots T, l=1 \ldots N_{\text {head}}$
- Derive one or more 「queries」 from decoder
  - $Q_{j}^{l}, j=1 \ldots M, l=1 \ldots N_{\text {head}}$
- Each query-key pair gives you one attention distribution
  - And one context vector
  - $a_{j, i}^{l}=$ attention $\left(Q_{j}^{l}, K_{i}^{l}, i=1 \ldots T\right), \quad C_{j}^{l}=\sum_{i} a_{j, i}^{l} V_{i}$
- Concatenate set of context vectors into one extended context vector
  - $C_{j}=\left[C_{j}^{1} C_{j}^{2} \ldots C_{j}^{N_{\text {head}}}\right]$
- Each 「attender」 focuses on a different aspect of the input that’s important for the decode