How WordPiece Tokenization Addresses the Rare Words Problem in NLP
Last Updated :
23 Jul, 2025
One of the key challenges in Natural Language Processing is handling words that models have never seen before. Traditional methods often fail to address this effectively making them unsuitable for modern applications. There are several problems with existing methods:
- Word-level tokenization creates massive vocabularies (500,000+ tokens)
- Out-of-vocabulary(OOV) words (rare or new words) break model predictions completely
- Character-level tokenization loses semantic meaning
- Models must learn word formation from scratch which is inefficient
WordPiece tokenization offers a solution that has become the foundation of transformer models like BERT and GPT. It strikes the balance between vocabulary size and semantic preservation.
Vocabulary Explosion and Rare Words
Consider processing the sentence "The bioengineering startup developed unbreakable materials." A traditional word-level tokenizer would need separate entries for "bioengineering", "startup", "unbreakable" and "materials". If any of these words weren't in the training vocabulary, the model would fail.
Key challenges in traditional tokenization:
- Vocabulary grows exponentially with text corpus size
- Technical terms and proper nouns create endless edge cases
- Memory requirements become prohibitive for large vocabularies
- Model training becomes computationally expensive
WordPiece solves this by breaking words into meaningful subunits. Instead of treating "unbreakable" as a single unknown token, it breaks it down into recognizable pieces: ["un", "##break", "##able"]. The "##" prefix indicates that a token continues from the previous piece, preserving word boundaries and also enabling flexible decomposition.
This approach ensures that even completely new words can be understood through their constituent parts which results in improving model robustness and generalization.
How WordPiece Tokenization Works
The algorithm follows a data-driven approach to build its vocabulary. It starts with individual characters and gradually merges the most frequently occurring pairs until reaching a target vocabulary size.
Algorithm steps:
- Initialize vocabulary with all individual characters
- Count frequency of all adjacent symbol pairs in the corpus
- Merge the most frequent pair into a single token
- Update the corpus with the new merged token
- Repeat until reaching desired vocabulary size (typically 30K-50K tokens)
During actual tokenization, WordPiece uses a greedy longest-match strategy. For each word, it finds the longest possible subword that exists in its vocabulary then marks it as a token and repeats for the remaining characters.
Tokenization process:
- Start from the beginning of each word
- Find the longest matching subword in vocabulary
- Add it to the token list with appropriate prefix
- Move to the next unprocessed characters
- Continue until the entire word is processed
This statistical foundation ensures that common patterns naturally emerge as single tokens while rare combinations get broken into more familiar components.
Wordpiece Implementation
Let's implement basic WordPiece tokenization using the transformers library. This example focuses on core functionality without unnecessary complexity.
- Imports BertTokenizer and loads the pre-trained bert-base-uncased model which applies WordPiece tokenization and lowercases input text.
- Defines a function simple_tokenize(text) to show how BERT tokenizes input into subwords and maps them to token IDs.
- Tokenizes the full sentence, prints the resulting WordPiece tokens and converts them into their corresponding vocabulary IDs.
- Splits the input into individual words and shows how each word is broken down into subword tokens by BERT.
- Tests the function on three sample sentences to illustrate handling of rare, compound and complex words, with clear separation between examples.
Python
from transformers import BertTokenizer
# Initialize the tokenizer with pre-trained BERT vocabulary
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def simple_tokenize(text):
"""
Basic WordPiece tokenization example
"""
print(f"Original text: {text}")
# Convert text to WordPiece tokens
tokens = tokenizer.tokenize(text)
print(f"WordPiece tokens: {tokens}")
# Convert tokens to numerical IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Token IDs: {token_ids}")
# Show how individual words break down
words = text.split()
print("\nWord breakdown:")
for word in words:
word_tokens = tokenizer.tokenize(word)
print(f" '{word}' → {word_tokens}")
# Test with different examples
test_sentences = [
"Tokenization helps handle rare words.",
"The unbreakable smartphone survived.",
"Bioengineering revolutionizes manufacturing."
]
for sentence in test_sentences:
simple_tokenize(sentence)
print("-" * 40)
Output:
BERT outputVocabulary Comparison Analysis
To understand WordPiece's efficiency, let's compare it with traditional word-level tokenization using a practical example.
- The function compares WordPiece tokenization (using BERT) with basic word-level tokenization on a list of input texts.
- It collects the total and unique tokens from both methods.
- WordPiece tokens are generated using
tokenizer.tokenize()
while word-level tokens use text.lower().split()
. - It calculates a compression ratio as the total word count divided by the total WordPiece count.
- The results show how WordPiece reduces vocabulary size by reusing subword units.
- Sample sentences are used to demonstrate and print the comparison.
Python
def compare_tokenization_methods(texts):
# Collect all unique tokens for each method
wordpiece_tokens = set()
word_level_tokens = set()
total_wordpiece_count = 0
total_word_count = 0
for text in texts:
# WordPiece tokenization
wp_tokens = tokenizer.tokenize(text)
wordpiece_tokens.update(wp_tokens)
total_wordpiece_count += len(wp_tokens)
# Word-level tokenization
words = text.lower().split()
word_level_tokens.update(words)
total_word_count += len(words)
return {
'unique_wordpiece_tokens': len(wordpiece_tokens),
'unique_word_tokens': len(word_level_tokens),
'total_wordpiece_count': total_wordpiece_count,
'total_word_count': total_word_count,
'compression_ratio': total_word_count / total_wordpiece_count
}
# Test with sample texts
sample_texts = [
"Machine learning algorithms process natural language effectively.",
"Deep neural networks revolutionize artificial intelligence applications.",
"Transformer architectures enable unprecedented language understanding.",
"Biomedical researchers utilize computational linguistics for analysis."
]
results = compare_tokenization_methods(sample_texts)
print("Tokenization Comparison:")
print(f"Unique WordPiece tokens needed: {results['unique_wordpiece_tokens']}")
print(f"Unique word-level tokens needed: {results['unique_word_tokens']}")
print(f"Compression ratio: {results['compression_ratio']:.2f}")
Output:
OutputThe compression ratio shows WordPiece's representational efficiency. While word-level tokenization might need 1000+ unique tokens for a technical corpus, WordPiece achieves the same coverage with 300-400 tokens.
Practical Limitations
WordPiece tokenization has limitations that we should understand before implementation.
1. Character handling issues:
- Unknown Unicode characters map to [UNK] tokens
- Information loss occurs with unsupported character sets
- Emoji and special symbols may not tokenize intuitively
- To resolve this ensure training data covers expected character ranges
2. Language-specific challenges:
- Struggles with languages lacking clear word boundaries (Chinese, Japanese)
- Morphologically rich languages may over-segment
- Languages that form words by combining many smaller units often produce long token sequences during processing.
3. Vocabulary limitations:
- Fixed vocabulary cannot adapt to new domains without retraining
- Specialized terminology may produce many [UNK] tokens
- Domain shift can significantly degrade performance
- Medical/legal texts often require domain-specific vocabularies
WordPiece tokenization has enabled the success of modern transformer models by balancing vocabulary coverage with computational efficiency. Its approach ensures that language patterns emerge naturally while maintaining robustness against unknown words.
Similar Reads
Rule-Based Tokenization in NLP Natural Language Processing (NLP) allows machines to interpret and process human language in a structured way. NLP systems uses tokenization which is a process of breaking text into smaller units called tokens. These tokens serve as the foundation for further linguistic analysis.Tokenization example
4 min read
Tokenization with the SentencePiece Python Library Tokenization is a crucial step in Natural Language Processing (NLP), where text is divided into smaller units, such as words or subwords, that can be further processed by machine learning models. One of the most popular tools for tokenization is the SentencePiece library, developed by Google. This v
5 min read
Dictionary Based Tokenization in NLP Natural Language Processing (NLP) is a subfield of artificial intelligence that aims to enable computers to process, understand, and generate human language. One of the critical tasks in NLP is tokenization, which is the process of splitting text into smaller meaningful units, known as tokens. Dicti
5 min read
Subword Tokenization in NLP Natural Language Processing models often struggle to handle the wide variety of words in human language, especially within limited computing resources. Using traditional word-level tokenization seems like an ideal solution but it doesnât work well for large vocabularies or complex languages. Subword
6 min read
5 Simple Ways to Tokenize Text in Python When we deal with text data in Python sometimes we need to perform tokenization operation on given text data. Tokenization is the process of of breaking down text into smaller pieces, typically words or sentences, which are called tokens. These tokens can then be used for further analysis, such as t
5 min read
Ambiguity in NLP and how to address them Ambiguity in Natural Language Processing (NLP) happens because human language can have multiple meanings. Computers sometimes confuse to understand exactly what we mean unlike humans, who can use intuition and background knowledge to infer meaning, computers rely on precise algorithms and statistica
6 min read