How WordPiece Tokenization Addresses the Rare Words Problem in NLP

Last Updated : 23 Jul, 2025

One of the key challenges in Natural Language Processing is handling words that models have never seen before. Traditional methods often fail to address this effectively making them unsuitable for modern applications. There are several problems with existing methods:

Word-level tokenization creates massive vocabularies (500,000+ tokens)
Out-of-vocabulary(OOV) words (rare or new words) break model predictions completely
Character-level tokenization loses semantic meaning
Models must learn word formation from scratch which is inefficient

WordPiece tokenization offers a solution that has become the foundation of transformer models like BERT and GPT. It strikes the balance between vocabulary size and semantic preservation.

Vocabulary Explosion and Rare Words

Consider processing the sentence "The bioengineering startup developed unbreakable materials." A traditional word-level tokenizer would need separate entries for "bioengineering", "startup", "unbreakable" and "materials". If any of these words weren't in the training vocabulary, the model would fail.

Key challenges in traditional tokenization:

Vocabulary grows exponentially with text corpus size
Technical terms and proper nouns create endless edge cases
Memory requirements become prohibitive for large vocabularies
Model training becomes computationally expensive

WordPiece solves this by breaking words into meaningful subunits. Instead of treating "unbreakable" as a single unknown token, it breaks it down into recognizable pieces: ["un", "##break", "##able"]. The "##" prefix indicates that a token continues from the previous piece, preserving word boundaries and also enabling flexible decomposition.

This approach ensures that even completely new words can be understood through their constituent parts which results in improving model robustness and generalization.

How WordPiece Tokenization Works

The algorithm follows a data-driven approach to build its vocabulary. It starts with individual characters and gradually merges the most frequently occurring pairs until reaching a target vocabulary size.

Algorithm steps:

Initialize vocabulary with all individual characters
Count frequency of all adjacent symbol pairs in the corpus
Merge the most frequent pair into a single token
Update the corpus with the new merged token
Repeat until reaching desired vocabulary size (typically 30K-50K tokens)

During actual tokenization, WordPiece uses a greedy longest-match strategy. For each word, it finds the longest possible subword that exists in its vocabulary then marks it as a token and repeats for the remaining characters.

Tokenization process:

Start from the beginning of each word
Find the longest matching subword in vocabulary
Add it to the token list with appropriate prefix
Move to the next unprocessed characters
Continue until the entire word is processed

This statistical foundation ensures that common patterns naturally emerge as single tokens while rare combinations get broken into more familiar components.

Wordpiece Implementation

Let's implement basic WordPiece tokenization using the transformers library. This example focuses on core functionality without unnecessary complexity.

Imports BertTokenizer and loads the pre-trained bert-base-uncased model which applies WordPiece tokenization and lowercases input text.
Defines a function simple_tokenize(text) to show how BERT tokenizes input into subwords and maps them to token IDs.
Tokenizes the full sentence, prints the resulting WordPiece tokens and converts them into their corresponding vocabulary IDs.
Splits the input into individual words and shows how each word is broken down into subword tokens by BERT.
Tests the function on three sample sentences to illustrate handling of rare, compound and complex words, with clear separation between examples.

Python

from transformers import BertTokenizer

# Initialize the tokenizer with pre-trained BERT vocabulary
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def simple_tokenize(text):
    """
    Basic WordPiece tokenization example
    """
    print(f"Original text: {text}")
    
    # Convert text to WordPiece tokens
    tokens = tokenizer.tokenize(text)
    print(f"WordPiece tokens: {tokens}")
    
    # Convert tokens to numerical IDs
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    print(f"Token IDs: {token_ids}")
    
    # Show how individual words break down
    words = text.split()
    print("\nWord breakdown:")
    for word in words:
        word_tokens = tokenizer.tokenize(word)
        print(f"  '{word}' → {word_tokens}")

# Test with different examples
test_sentences = [
    "Tokenization helps handle rare words.",
    "The unbreakable smartphone survived.",
    "Bioengineering revolutionizes manufacturing."
]

for sentence in test_sentences:
    simple_tokenize(sentence)
    print("-" * 40)

Output:

Vocabulary Comparison Analysis

To understand WordPiece's efficiency, let's compare it with traditional word-level tokenization using a practical example.

The function compares WordPiece tokenization (using BERT) with basic word-level tokenization on a list of input texts.
It collects the total and unique tokens from both methods.
WordPiece tokens are generated using tokenizer.tokenize() while word-level tokens use text.lower().split().
It calculates a compression ratio as the total word count divided by the total WordPiece count.
The results show how WordPiece reduces vocabulary size by reusing subword units.
Sample sentences are used to demonstrate and print the comparison.

Python

def compare_tokenization_methods(texts):
    
    # Collect all unique tokens for each method
    wordpiece_tokens = set()
    word_level_tokens = set()
    
    total_wordpiece_count = 0
    total_word_count = 0
    
    for text in texts:
        # WordPiece tokenization
        wp_tokens = tokenizer.tokenize(text)
        wordpiece_tokens.update(wp_tokens)
        total_wordpiece_count += len(wp_tokens)
        
        # Word-level tokenization
        words = text.lower().split()
        word_level_tokens.update(words)
        total_word_count += len(words)
    
    return {
        'unique_wordpiece_tokens': len(wordpiece_tokens),
        'unique_word_tokens': len(word_level_tokens),
        'total_wordpiece_count': total_wordpiece_count,
        'total_word_count': total_word_count,
        'compression_ratio': total_word_count / total_wordpiece_count
    }

# Test with sample texts
sample_texts = [
    "Machine learning algorithms process natural language effectively.",
    "Deep neural networks revolutionize artificial intelligence applications.", 
    "Transformer architectures enable unprecedented language understanding.",
    "Biomedical researchers utilize computational linguistics for analysis."
]

results = compare_tokenization_methods(sample_texts)
print("Tokenization Comparison:")
print(f"Unique WordPiece tokens needed: {results['unique_wordpiece_tokens']}")
print(f"Unique word-level tokens needed: {results['unique_word_tokens']}")
print(f"Compression ratio: {results['compression_ratio']:.2f}")

Output:

The compression ratio shows WordPiece's representational efficiency. While word-level tokenization might need 1000+ unique tokens for a technical corpus, WordPiece achieves the same coverage with 300-400 tokens.

Practical Limitations

WordPiece tokenization has limitations that we should understand before implementation.

1. Character handling issues:

Unknown Unicode characters map to [UNK] tokens
Information loss occurs with unsupported character sets
Emoji and special symbols may not tokenize intuitively
To resolve this ensure training data covers expected character ranges

2. Language-specific challenges:

Struggles with languages lacking clear word boundaries (Chinese, Japanese)
Morphologically rich languages may over-segment
Languages that form words by combining many smaller units often produce long token sequences during processing.

3. Vocabulary limitations:

Fixed vocabulary cannot adapt to new domains without retraining
Specialized terminology may produce many [UNK] tokens
Domain shift can significantly degrade performance
Medical/legal texts often require domain-specific vocabularies

WordPiece tokenization has enabled the success of modern transformer models by balancing vocabulary coverage with computational efficiency. Its approach ensures that language patterns emerge naturally while maintaining robustness against unknown words.

Rule-Based Tokenization in NLP

sai_teja_anantha

Improve

Article Tags :