NLP Sequencing is the sequence of numbers that we will generate from a large corpus or body of statements by training a neural network. We will take a set of sentences and assign them numeric tokens based on the training set sentences.
Example:
sentences = [
'I love geeksforgeeks',
'You love geeksforgeeks',
'What do you think about geeksforgeeks?'
]
Word Index: {'geeksforgeeks': 1, 'love': 2, 'you': 3, 'i': 4,
'what': 5, 'do': 6, 'think': 7, 'about': 8}
Sequences: [[4, 2, 1], [3, 2, 1], [5, 6, 3, 7, 8, 1]]
Now if the test set consists of the word the network has not seen before, or we have to predict the word in the sentence then we can add a simple placeholder token.
Let the test set be :
test_data = [
'i really love geeksforgeeks',
'Do you like geeksforgeeks'
]
Then we will define an additional placeholder for words it hasn't seen before. The placeholder by default gets index as 1.
Word Index = {'placeholder': 1, 'geeksforgeeks': 2, 'love': 3, 'you': 4, 'i': 5, 'what': 6, 'do': 7, 'think': 8, 'about': 9}
Sequences = [[5, 3, 2], [4, 3, 2], [6, 7, 4, 8, 9, 2]]
As the word 'really' and 'like' has not been encountered before it is simply replaced by the placeholder which is indexed by 1.
So, the test sequence now becomes,
Test Sequence = [[5, 1, 3, 2], [7, 4, 1, 2]]
Code: Implementation with TensorFlow
python3
# importing all the modules required
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# the initial corpus of sentences or the training set
sentences = [
'I love geeksforgeeks',
'You love geeksforgeeks',
'What do you think about geeksforgeeks?'
]
tokenizer = Tokenizer(num_words = 100)
# the tokenizer also removes punctuations
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
print("Word Index: ", word_index)
print("Sequences: ", sequences)
# defining a placeholder token and naming it as placeholder
tokenizer = Tokenizer(num_words=100,
oov_token="placeholder")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
print("\nSequences = ", sequences)
# the training data with words the network hasn't encountered
test_data = [
'i really love geeksforgeeks',
'Do you like geeksforgeeks'
]
test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)
Output:
Word Index: {'geeksforgeeks': 1, 'love': 2, 'you': 3, 'i': 4, 'what': 5, 'do': 6, 'think': 7, 'about': 8}
Sequences: [[4, 2, 1], [3, 2, 1], [5, 6, 3, 7, 8, 1]]
Sequences = [[5, 3, 2], [4, 3, 2], [6, 7, 4, 8, 9, 2]]
Test Sequence = [[5, 1, 3, 2], [7, 4, 1, 2]]
Similar Reads
N-Gram Language Modelling with NLTK Language modeling is the way of determining the probability of any sequence of words. Language modeling is used in various applications such as Speech Recognition, Spam filtering, etc. Language modeling is the key aim behind implementing many state-of-the-art Natural Language Processing models.Metho
5 min read
Sequence Alignment problem Given as an input two strings, X = x_{1} x_{2}... x_{m} , and Y = y_{1} y_{2}... y_{m} , output the alignment of the strings, character by character, so that the net penalty is minimized. The penalty is calculated as: A penalty of p_{gap} occurs if a gap is inserted between the string. A penalty of
15+ min read
Johnson's Rule in Sequencing Problems The sequencing problem deals with determining an optimum sequence of performing a number of jobs by a finite number of service facilities (machine) according to some pre-assigned order so as to optimize the output. The objective is to determine the optimal order of performing the jobs in such a way
3 min read
Biopython - Sequence input/output Biopython has an inbuilt Bio.SeqIO module which provides functionalities to read and write sequences from or to a file respectively. Bio.SeqIO supports nearly all file handling formats used in Bioinformatics. Biopython strictly follows single approach to represent the parsed data sequence to the use
3 min read
Look-and-Say Sequence Find the nth term in Look-and-say (Or Count and Say) Sequence. The look-and-say sequence is the sequence of the below integers: 1, 11, 21, 1211, 111221, 312211, 13112221, 1113213211, ... How is the above sequence generated? The nth term is generated by reading (n-1)th term.The first term is "1"Secon
10 min read
Sequence in BioPython module Prerequisite: BioPython module Sequence is basically a special series of letters which is used to represent the protein of an organism, DNA or RNA. Sequences in Biopython are usually handled by the Seq object described in Bio.Seq module. The Seq object has inbuilt functions like complement, reverse_
3 min read