NLP Sequencing

Last Updated : 24 Oct, 2020

NLP Sequencing is the sequence of numbers that we will generate from a large corpus or body of statements by training a neural network. We will take a set of sentences and assign them numeric tokens based on the training set sentences.

Example:

sentences = [
'I love geeksforgeeks',
'You love geeksforgeeks',
'What do you think about geeksforgeeks?'
]

Word Index: {'geeksforgeeks': 1, 'love': 2, 'you': 3, 'i': 4,
             'what': 5, 'do': 6, 'think': 7, 'about': 8}

Sequences: [[4, 2, 1], [3, 2, 1], [5, 6, 3, 7, 8, 1]]

Now if the test set consists of the word the network has not seen before, or we have to predict the word in the sentence then we can add a simple placeholder token.

Let the test set be :

 test_data = [
'i really love geeksforgeeks',
'Do you like geeksforgeeks'
]

Then we will define an additional placeholder for words it hasn't seen before. The placeholder by default gets index as 1.

Word Index = {'placeholder': 1, 'geeksforgeeks': 2, 'love': 3, 'you': 4, 'i': 5, 'what': 6, 'do': 7, 'think': 8, 'about': 9}

Sequences = [[5, 3, 2], [4, 3, 2], [6, 7, 4, 8, 9, 2]]

As the word 'really' and 'like' has not been encountered before it is simply replaced by the placeholder which is indexed by 1.

So, the test sequence now becomes,

Test Sequence = [[5, 1, 3, 2], [7, 4, 1, 2]]

Code: Implementation with TensorFlow

python3

# importing all the modules required
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# the initial corpus of sentences or the training set
sentences = [
    'I love geeksforgeeks',
    'You love geeksforgeeks',
    'What do you think about geeksforgeeks?'
]

tokenizer = Tokenizer(num_words = 100)

# the tokenizer also removes punctuations
tokenizer.fit_on_texts(sentences)  
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
print("Word Index: ", word_index)
print("Sequences: ", sequences)

# defining a placeholder token and naming it as placeholder
tokenizer = Tokenizer(num_words=100, 
                      oov_token="placeholder")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index


sequences = tokenizer.texts_to_sequences(sentences)
print("\nSequences = ", sequences)


# the training data with words the network hasn't encountered
test_data = [
    'i really love geeksforgeeks',
    'Do you like geeksforgeeks'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

Output:

Word Index:  {'geeksforgeeks': 1, 'love': 2, 'you': 3, 'i': 4, 'what': 5, 'do': 6, 'think': 7, 'about': 8}
Sequences:  [[4, 2, 1], [3, 2, 1], [5, 6, 3, 7, 8, 1]]

Sequences =  [[5, 3, 2], [4, 3, 2], [6, 7, 4, 8, 9, 2]]

Test Sequence =  [[5, 1, 3, 2], [7, 4, 1, 2]]

N-Gram Language Modelling with NLTK

sangy987

Improve

Article Tags :

NLP Sequencing

Similar Reads

Thank You!

What kind of Experience do you want to share?