Tokenization is a fundamental step in Natural Language Processing (NLP). It involves dividing a Textual input into smaller units known as tokens. These tokens can be in the form of words, characters, sub-words, or sentences. It helps in improving interpretability of text by different models. Let's understand How Tokenization Works.
Representation of TokenizationWhat is Tokenization in NLP?
Natural Language Processing (NLP) is a subfield of Artificial Intelligence, information engineering, and human-computer interaction. It focuses on how to process and analyze large amounts of natural language data efficiently. It is difficult to perform as the process of reading and understanding languages is far more complex than it seems at first glance.
- Tokenization is a foundation step in NLP pipeline that shapes the entire workflow.
- Involves dividing a string or text into a list of smaller units known as tokens.
- Uses a tokenizer to segment unstructured data and natural language text into distinct chunks of information, treating them as different elements.
- Tokens: Words or Sub-words in the context of natural language processing. Example: A word is a token in a sentence, A character is a token in a word, etc.
- Application: Multiple NLP tasks, text processing, language modelling, and machine translation.
Types of Tokenization
Tokenization can be classified into several types based on how the text is segmented. Here are some types of tokenization:
1. Word Tokenization
Word tokenization is the most commonly used method where text is divided into individual words. It works well for languages with clear word boundaries, like English. For example, "Machine learning is fascinating" becomes:
Input before tokenization: ["Machine Learning is fascinating"]
Output when tokenized by words: ["Machine", "learning", "is", "fascinating"]
2. Character Tokenization
In Character Tokenization, the textual data is split and converted to a sequence of individual characters. This is beneficial for tasks that require a detailed analysis, such as spelling correction or for tasks with unclear boundaries. It can also be useful for modelling character-level language.
Example
Input before tokenization: ["You are helpful"]
Output when tokenized by characters: ["Y", "o", "u", " ", "a", "r", "e", " ", "h", "e", "l", "p", "f", "u", "l"]
3. Sub-word Tokenization
This strikes a balance between word and character tokenization by breaking down text into units that are larger than a single character but smaller than a full word. This is useful when dealing with morphologically rich languages or rare words.
Example
["Time", "table"]
["Rain", "coat"]
["Grace", "fully"]
["Run", "way"]
Sub-word tokenization helps to handle out-of-vocabulary words in NLP tasks and for languages that form words by combining smaller units.
4. Sentence Tokenization
Sentence tokenization is also a common technique used to make a division of paragraphs or large set of sentences into separated sentences as tokens. This is useful for tasks requiring individual sentence analysis or processing.
Input before tokenization: ["Artificial Intelligence is an emerging technology. Machine learning is fascinating. Computer Vision handles images. "]
Output when tokenized by sentences ["Artificial Intelligence is an emerging technology.", "Machine learning is fascinating.", "Computer Vision handles images."]
5. N-gram Tokenization
N-gram tokenization splits words into fixed-sized chunks (size = n) of data.
Input before tokenization: ["Machine learning is powerful"]
Output when tokenized by bigrams: [('Machine', 'learning'), ('learning', 'is'), ('is', 'powerful')]
Need of Tokenization
Tokenization is an essential step in text processing and natural language processing (NLP) for several reasons. Some of these are listed below:
- Effective Text Processing: Reduces the size of raw text, resulting in easy and efficient statistical and computational analysis.
- Feature extraction: Text data can be represented numerically for algorithmic comprehension by using tokens as features in ML models.
- Information Retrieval: Tokenization is essential for indexing and searching in systems that store and retrieve information efficiently based on words or phrases.
- Text Analysis: Used in sentiment analysis and named entity recognition, to determine the function and context of individual words in a sentence.
- Vocabulary Management: Generates a list of distinct tokens, Helps manage a corpus's vocabulary.
- Task-Specific Adaptation: Adapts to need of particular NLP task, Good for summarization and machine translation.
Implementation for Tokenization
Sentence Tokenization using sent_tokenize
The code snippet uses sent_tokenize function from NLTK library. The sent_tokenize
function is used to segment a given text into a list of sentences.
Python
from nltk.tokenize import sent_tokenize
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article."
sent_tokenize(text)
Output:
['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']
How sent_tokenize works: The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module, which is already been trained and thus very well knows to mark the end and beginning of sentence at what characters and punctuation.
Sentence Tokenization using PunktSentenceTokenizer
It is efficient to use 'PunktSentenceTokenizer' to from the NLTK library. The Punkt tokenizer is a data-driven sentence tokenizer that comes with NLTK. It is trained on large corpus of text to identify sentence boundaries.
Python
import nltk.data
# Loading PunktSentenceTokenizer using English pickle file
tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
tokenizer.tokenize(text)
Output:
['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']
Tokenize sentence of different language
Sentences from different languages can also be tokenized using different pickle file other than English.
- In the following code snippet, we have used NLTK library to tokenize a Spanish text into sentences using pre-trained Punkt tokenizer for Spanish.
- The Punkt tokenizer: Data-driven ML-based tokenizer to identify sentence boundaries.
Python
import nltk.data
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')
text = 'Hola amigo. Estoy bien.'
spanish_tokenizer.tokenize(text)
Output:
['Hola amigo.',
'Estoy bien.']
Word Tokenization using work_tokenize
The code snipped uses the word_tokenize function from NLTK library to tokenize a given text into individual words.
- The word_tokenize function is helpful for breaking down a sentence or text into its constituent words.
- Eases analysis or processing at the word level in natural language processing tasks.
Python
from nltk.tokenize import word_tokenize
text = "Hello everyone. Welcome to GeeksforGeeks."
word_tokenize(text)
Output:
['Hello', 'everyone', '.', 'Welcome', 'to', 'GeeksforGeeks', '.']
How word_tokenize works: word_tokenize() function is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class.
Word Tokenization Using TreebankWordTokenizer
The code snippet uses the TreebankWordTokenizer from the Natural Language Toolkit (NLTK) to tokenize a given text into individual words.
Python
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(text)
Output:
['Hello', 'everyone.', 'Welcome', 'to', 'GeeksforGeeks', '.']
These tokenizers work by separating the words using punctuation and spaces. And as mentioned in the code outputs above, it doesn't discard the punctuation, allowing a user to decide what to do with the punctuations at the time of pre-processing.
Word Tokenization using WordPunctTokenizer
The WordPunctTokenizer is one of the NLTK tokenizers that splits words based on punctuation boundaries. Each punctuation mark is treated as a separate token.
Python
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Let's see how it's working.")
Output:
['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working', '.']
Word Tokenization using Regular Expression
The code snippet uses the RegexpTokenizer from the Natural Language Toolkit (NLTK) to tokenize a given text based on a regular expression pattern.
Python
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = "Let's see how it's working."
tokenizer.tokenize(text)
Output:
['Let', 's', 'see', 'how', 'it', 's', 'working']
Using regular expressions allows for more fine-grained control over tokenization, and you can customize the pattern based on your specific requirements.
More Techniques for Tokenization
We have discussed the ways to implement how can we perform tokenization using NLTK library. We can also implement tokenization using following methods and libraries:
- Spacy: Spacy is NLP library that provide robust tokenization capabilities.
- BERT tokenizer: BERT uses Word Piece tokenizer, which is a type of sub-word tokenizer for tokenizing input text. Using regular expressions allows for more fine-grained control over tokenization, and you can customize the pattern based on your specific requirements.
- Byte-Pair Encoding: Byte Pair Encoding (BPE) is a data compression algorithm that has also found applications in the field of natural language processing, specifically for tokenization. It is a Sub-word Tokenization technique that works by iteratively merging the most frequent pairs of consecutive bytes (or characters) in a given corpus.
- Sentence Piece: Sentence Piece is another sub-word tokenization algorithm commonly used for natural language processing tasks. It is designed to be language-agnostic and works by iteratively merging frequent sequences of characters or sub words in a given corpus.
Limitations of Tokenization
- Unable to capture the meaning of the sentence hence, results in ambiguity.
- Chinese, Japanese, Arabic, lack distinct spaces between words. Hence, absence of clear boundaries that complicates the process of tokenization.
- Tough to decide how to tokenize text that may include more than one word, for example email address, URLs and special symbols
Similar Reads
Natural Language Processing (NLP) Tutorial Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that helps machines to understand and process human languages either in text or audio form. It is used across a variety of applications from speech recognition to language translation and text summarization.Natural Languag
5 min read
Introduction to NLP
Natural Language Processing (NLP) - OverviewNatural Language Processing (NLP) is a field that combines computer science, artificial intelligence and language studies. It helps computers understand, process and create human language in a way that makes sense and is useful. With the growing amount of text data from social media, websites and ot
9 min read
NLP vs NLU vs NLGNatural Language Processing(NLP) is a subset of Artificial intelligence which involves communication between a human and a machine using a natural language than a coded or byte language. It provides the ability to give instructions to machines in a more easy and efficient manner. Natural Language Un
3 min read
Applications of NLPAmong the thousands and thousands of species in this world, solely homo sapiens are successful in spoken language. From cave drawings to internet communication, we have come a lengthy way! As we are progressing in the direction of Artificial Intelligence, it only appears logical to impart the bots t
6 min read
Why is NLP important?Natural language processing (NLP) is vital in efficiently and comprehensively analyzing text and speech data. It can navigate the variations in dialects, slang, and grammatical inconsistencies typical of everyday conversations. Table of Content Understanding Natural Language ProcessingReasons Why NL
6 min read
Phases of Natural Language Processing (NLP)Natural Language Processing (NLP) helps computers to understand, analyze and interact with human language. It involves a series of phases that work together to process language and each phase helps in understanding structure and meaning of human language. In this article, we will understand these ph
7 min read
The Future of Natural Language Processing: Trends and InnovationsThere are no reasons why today's world is thrilled to see innovations like ChatGPT and GPT/ NLP(Natural Language Processing) deployments, which is known as the defining moment of the history of technology where we can finally create a machine that can mimic human reaction. If someone would have told
7 min read
Libraries for NLP
Text Normalization in NLP
Normalizing Textual Data with PythonIn this article, we will learn How to Normalizing Textual Data with Python. Let's discuss some concepts : Textual data ask systematically collected material consisting of written, printed, or electronically published words, typically either purposefully written or transcribed from speech.Text normal
7 min read
Regex Tutorial - How to write Regular Expressions?A regular expression (regex) is a sequence of characters that define a search pattern. Here's how to write regular expressions: Start by understanding the special characters used in regex, such as ".", "*", "+", "?", and more.Choose a programming language or tool that supports regex, such as Python,
6 min read
Tokenization in NLPTokenization is a fundamental step in Natural Language Processing (NLP). It involves dividing a Textual input into smaller units known as tokens. These tokens can be in the form of words, characters, sub-words, or sentences. It helps in improving interpretability of text by different models. Let's u
8 min read
Python | Lemmatization with NLTKLemmatization is a fundamental text pre-processing technique widely applied in natural language processing (NLP) and machine learning. Serving a purpose akin to stemming, lemmatization seeks to distill words to their foundational forms. In this linguistic refinement, the resultant base word is refer
6 min read
Introduction to StemmingStemming is a method in text processing that eliminates prefixes and suffixes from words, transforming them into their fundamental or root form, The main objective of stemming is to streamline and standardize words, enhancing the effectiveness of the natural language processing tasks. The article ex
8 min read
Removing stop words with NLTK in PythonIn natural language processing (NLP), stopwords are frequently filtered out to enhance text analysis and computational efficiency. Eliminating stopwords can improve the accuracy and relevance of NLP tasks by drawing attention to the more important words, or content words. The article aims to explore
9 min read
POS(Parts-Of-Speech) Tagging in NLPParts of Speech (PoS) tagging is a core task in NLP, It gives each word a grammatical category such as nouns, verbs, adjectives and adverbs. Through better understanding of phrase structure and semantics, this technique makes it possible for machines to study human language more accurately. PoS tagg
7 min read
Text Representation and Embedding Techniques
One-Hot Encoding in NLPNatural Language Processing (NLP) is a quickly expanding discipline that works with computer-human language exchanges. One of the most basic jobs in NLP is to represent text data numerically so that machine learning algorithms can comprehend it. One common method for accomplishing this is one-hot en
9 min read
Bag of words (BoW) model in NLPIn this article, we are going to discuss a Natural Language Processing technique of text modeling known as Bag of Words model. Whenever we apply any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm. Hence, Bag of Words model is used to preprocess the text b
4 min read
Understanding TF-IDF (Term Frequency-Inverse Document Frequency)TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in natural language processing and information retrieval to evaluate the importance of a word in a document relative to a collection of documents (corpus). Unlike simple word frequency, TF-IDF balances common and rare w
6 min read
N-Gram Language Modelling with NLTKLanguage modeling is the way of determining the probability of any sequence of words. Language modeling is used in various applications such as Speech Recognition, Spam filtering, etc. Language modeling is the key aim behind implementing many state-of-the-art Natural Language Processing models.Metho
5 min read
Word Embedding using Word2VecWord Embedding is a language modelling technique that maps words to vectors (numbers). It represents words or phrases in vector space with several dimensions. Various methods such as neural networks, co-occurrence matrices and probabilistic models can generate word embeddings.. Word2Vec is also a me
6 min read
Pre-trained Word embedding using Glove in NLP modelsIn modern Natural Language Processing (NLP), understanding and processing human language in a machine-readable format is essential. Since machines interpret numbers, it's important to convert textual data into numerical form. One of the most effective and widely used approaches to achieve this is th
7 min read
Overview of Word Embedding using Embeddings from Language Models (ELMo)What is word embeddings? It is the representation of words into vectors. These vectors capture important information about the words such that the words sharing the same neighborhood in the vector space represent similar meaning. There are various methods for creating word embeddings, for example, W
2 min read
NLP Deep Learning Techniques
NLP Projects and Practice
Sentiment Analysis with an Recurrent Neural Networks (RNN)Recurrent Neural Networks (RNNs) are used in sequence tasks such as sentiment analysis due to their ability to capture context from sequential data. In this article we will be apply RNNs to analyze the sentiment of customer reviews from Swiggy food delivery platform. The goal is to classify reviews
5 min read
Text Generation using Recurrent Long Short Term Memory NetworkLSTMs are a type of neural network that are well-suited for tasks involving sequential data such as text generation. They are particularly useful because they can remember long-term dependencies in the data which is crucial when dealing with text that often has context that spans over multiple words
4 min read
Machine Translation with Transformer in PythonMachine translation means converting text from one language into another. Tools like Google Translate use this technology. Many translation systems use transformer models which are good at understanding the meaning of sentences. In this article, we will see how to fine-tune a Transformer model from
6 min read
Building a Rule-Based Chatbot with Natural Language ProcessingA rule-based chatbot follows a set of predefined rules or patterns to match user input and generate an appropriate response. The chatbot canât understand or process input beyond these rules and relies on exact matches making it ideal for handling repetitive tasks or specific queries.Pattern Matching
4 min read
Text Classification using scikit-learn in NLPThe purpose of text classification, a key task in natural language processing (NLP), is to categorise text content into preset groups. Topic categorization, sentiment analysis, and spam detection can all benefit from this. In this article, we will use scikit-learn, a Python machine learning toolkit,
5 min read
Text Summarizations using HuggingFace ModelText summarization is a crucial task in natural language processing (NLP) that involves generating concise and coherent summaries from longer text documents. This task has numerous applications, such as creating summaries for news articles, research papers, and long-form content, making it easier fo
5 min read
Advanced Natural Language Processing Interview QuestionNatural Language Processing (NLP) is a rapidly evolving field at the intersection of computer science and linguistics. As companies increasingly leverage NLP technologies, the demand for skilled professionals in this area has surged. Whether preparing for a job interview or looking to brush up on yo
9 min read