吴恩达深度学习课后习题第五课第四周编程作业1:Transformers Architecture with TensorFlow

最新推荐文章于 2024-05-15 10:23:13 发布

原创

最新推荐文章于 2024-05-15 10:23:13 发布 · 1.9k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #大数据 #深度学习

该博客详细介绍了如何使用TensorFlow实现Transformer架构，包括位置编码、掩码处理、自注意力机制、编码器和解码器的构建。通过一系列的编程练习，读者将掌握Transformer的工作原理并能够应用到深度学习项目中。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Packages
1 - Positional Encoding
- 1.1 - Sine and Cosine Angles
  - Exercise 1 - get_angles
- 1.2 - Sine and Cosine Positional Encodings
  - Exercise 2 - positional_encoding
2 - Masking
- 2.1 - Padding Mask
- 2.2 - Look-ahead Mask
3 - Self-Attention
- Exercise 3 - scaled_dot_product_attention
4 - Encoder
- 4.1 Encoder Layer
  - Exercise 4 - EncoderLayer
- 4.2 - Full Encoder
  - Exercise 5 - Encoder
5 - Decoder
- 5.1 - Decoder Layer
  - Exercise 6 - DecoderLayer
- 5.2 - Full Decoder
  - Exercise 7 - Decoder
6 - Transformer
- Exercise 8 - Transformer
7 - References

Packages

Run the following cell to load the packages you'll need.

In [1]:

import tensorflow as tf
import time
import numpy as np
import matplotlib.pyplot as plt

from tensorflow.keras.layers import Embedding, MultiHeadAttention, Dense, Input, Dropout, LayerNormalization
from transformers import DistilBertTokenizerFast #, TFDistilBertModel
from transformers import TFDistilBertForTokenClassification

1 - Positional Encoding

In sequence to sequence tasks, the relative order of your data is extremely important to its meaning. When you were training sequential neural networks such as RNNs, you fed your inputs into the network in order. Information about the order of your data was automatically fed into your model. However, when you train a Transformer network using multi-head attention, you feed your data into the model all at once. While this dramatically reduces training time, there is no information about the order of your data. This is where positional encoding is useful - you can specifically encode the positions of your inputs and pass them into the network using these sine and cosine formulas:

PE(pos,2i)=sin(pos100002id)(1)(1)PE(pos,2i)=sin(pos100002id)

PE(pos,2i+1)=cos(pos100002id)(2)(2)PE(pos,2i+1)=cos(pos100002id)

dd is the dimension of the word embedding and positional encoding
pospos is the position of the word.
ii refers to each of the different dimensions of the positional encoding.

To develop some intuition about positional encodings, you can think of them broadly as a feature that contains the information about the relative positions of words. The sum of the positional encoding and word embedding is ultimately what is fed into the model. If you just hard code the positions in, say by adding a matrix of 1's or whole numbers to the word embedding, the semantic meaning is distorted. Conversely, the values of the sine and cosine equations are small enough (between -1 and 1) that when you add the positional encoding to a word embedding, the word embedding is not significantly distorted, and is instead enriched with positional information. Using a combination of these two equations helps your Transformer network attend to the relative positions of your input data. This was a short discussion on positional encodings, but develop further intuition, check out the Positional Encoding Ungraded Lab.

Note: In the lectures Andrew uses vertical vectors, but in this assignment all vectors are horizontal. All matrix multiplications should be adjusted accordingly.

1.1 - Sine and Cosine Angles

Notice that even though the sine and cosine positional encoding equations take in different arguments (2i versus 2i+1, or even versus odd numbers) the inner terms for both equations are the same:

θ(pos,i,d)=pos100002id(3)(3)θ(pos,i,d)=pos100002id

Consider the inner term as you calculate the positional encoding for a word in a sequence.
PE(pos,0)=sin(pos100000d)PE(pos,0)=sin(pos100000d), since solving 2i = 0 gives i = 0
PE(pos,1)=cos(pos100000d)PE(pos,1)=cos(pos100000d), since solving 2i + 1 = 1 gives i = 0

The angle is the same for both! The angles for PE(pos,2)PE(pos,2) and PE(pos,3)PE(pos,3) are the same as well, since for both, i = 1 and therefore the inner term is (pos100002d)(pos100002d). This relationship holds true for all paired sine and cosine curves:

k	`0`	`1`	`2`	`3`	`...`	`d - 2`	`d - 1`
encoding(0) =	[sin(θ(0,0,d))sin(θ(0,0,d))	cos(θ(0,0,d))cos(θ(0,0,d))	sin(θ(0,1,d))sin(θ(0,1,d))	cos(θ(0,1,d))cos(θ(0,1,d))	...	sin(θ(0,d//2,d))sin(θ(0,d//2,d))	cos(θ(0,d//2,d))cos(θ(0,d//2,d))]
encoding(1) =	[sin(θ(1,0,d))sin(θ(1,0,d))	cos(θ(1,0,d))cos(θ(1,0,d))	sin(θ(1,1,d))sin(θ(1,1,d))	cos(θ(1,1,d))cos(θ(1,1,d))	...	sin(θ(1,d//2,d))sin(θ(1,d//2,d))	cos(θ(1,d//2,d))cos(θ(1,d//2,d))]

Exercise 1 - get_angles

Implement the function get_angles() to calculate the possible angles for the sine and cosine positional encodings

Hints

If k = [0, 1, 2, 3, 4, 5], then, i must be i = [0, 0, 1, 1, 2, 2]
i = k//2

In [2]:

# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION get_angles
def get_angles(pos, k, d):
    """
    Get the angles for the positional encoding
    
    Arguments:
        pos -- Column vector containing the positions [[0], [1], ...,[N-1]]
        k --   Row vector containing the dimension span [[0, 1, 2, ..., d-1]]
        d(integer) -- Encoding size
    
    Returns:
        angles -- (pos, d) numpy array 
    """
    
    # START CODE HERE
    # Get i from dimension span k
    i = k//2
    # Calculate the angles using pos, i and d
    angles = pos/(np.power(10000,2*i/d))
    # END CODE HERE
    
    return angles

In [3]:

from public_tests import *

get_angles_test(get_angles)

# Example
position = 4
d_model = 8
pos_m = np.arange(position)[:, np.newaxis]
dims = np.arange(d_model)[np.newaxis, :]
get_angles(pos_m, dims, d_model)

All tests passed

Out[3]:

array([[0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00],
       [1.e+00, 1.e+00, 1.e-01, 1.e-01, 1.e-02, 1.e-02, 1.e-03, 1.e-03],
       [2.e+00, 2.e+00, 2.e-01, 2.e-01, 2.e-02, 2.e-02, 2.e-03, 2.e-03],
       [3.e+00, 3.e+00, 3.e-01, 3.e-01, 3.e-02, 3.e-02, 3.e-03, 3.e-03]])

1.2 - Sine and Cosine Positional Encodings

Now you can use the angles you computed to calculate the sine and cosine positional encodings.

PE(pos,2i)=sin(pos100002id)PE(pos,2i)=sin(pos100002id)

PE(pos,2i+1)=cos(pos100002id)PE(pos,2i+1)=cos(pos100002id)

Exercise 2 - positional_encoding

Implement the function positional_encoding() to calculate the sine and cosine positional encodings

Reminder: Use the sine equation when ii is an even number and the cosine equation when ii is an odd number.

Additional Hints

You may find np.newaxis useful depending on the implementation you choose.

In [4]:

# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION positional_encoding
def positional_encoding(positions, d):
    """
    Precomputes a matrix with all the positional encodings 
    
    Arguments:
        positions (int) -- Maximum number of positions to be encoded 
        d (int) -- Encoding size 
    
    Returns:
        pos_encoding -- (1, position, d_model) A matrix with the positional encodings
    """
    # START CODE HERE
    # initialize a matrix angle_rads of all the angles 
    angle_rads = get_angles(np.arange(positions)[:, np.newaxis],
                            np.arange(d)[np

最低0.47元/天解锁文章

200万优质内容无限畅学