Discounting Techniques in Language Models
Last Updated :
13 Aug, 2024
Language models are essential tools in natural language processing (NLP), responsible for predicting the next word in a sequence based on the words that precede it. A common challenge in building language models, particularly n-gram models, is the estimation of probabilities for word sequences that were not observed during training. Discounting techniques are critical in addressing this issue by adjusting the probabilities of observed word sequences and redistributing the remaining probability mass to account for unseen sequences.
This article delves into the various discounting techniques employed in language models, their mathematical underpinnings, and their importance in improving model accuracy.
Understanding the Need for Discounting
In n-gram models, the probability of a word w_n given its preceding words (context) is typically estimated based on frequency counts from a corpus. However, many word sequences might not appear in the training data, leading to zero probabilities for these unseen n-grams. Discounting methods adjust the probability of observed n-grams downward, freeing up probability mass that can be reassigned to unseen n-grams, thereby preventing the model from assigning zero probability to any sequence.
Common Discounting Techniques
1. Absolute Discounting
Absolute discounting is a straightforward approach where a fixed discount D is subtracted from the count of each observed n-gram. The remaining probability mass is then redistributed to account for unseen n-grams.
- Formula: P(w_n | w_{n-1}, w_{n-2}, \dots, w_{n-k+1}) = \frac{\max(c(w_n) - D, 0)}{c(w_{n-1}, w_{n-2}, \dots, w_{n-k+1})} + \lambda P(w_n | w_{n-1}, w_{n-2}, \dots, w_{n-k+2}) where λ is a normalization factor that ensures the probabilities sum to one.
- Explanation: This method ensures that even if an n-gram has been observed, it is slightly discounted to make room for unseen n-grams, which are assigned a small probability.
2. Good-Turing Discounting
Good-Turing discounting is based on the principle of frequency of frequencies. It estimates the probability of an n-gram by considering the likelihood of encountering unseen events, derived from how often n-grams with similar frequencies occur.
- Formula: P_{GT}(r) = \frac{(r+1) \cdot N_{r+1}}{N_r \cdot N} where N_r is the number of n-grams that occur rrr times in the corpus, and NNN is the total number of n-grams.
- Explanation: Good-Turing discounting effectively smooths the probabilities by lowering the estimate for observed n-grams with low counts and reallocating that probability to unseen or rarely seen n-grams.
3. Kneser-Ney Smoothing
Kneser-Ney smoothing is one of the most sophisticated and widely used discounting techniques. It extends absolute discounting by adjusting not only the observed n-gram probabilities but also the back-off distribution.
- Formula: P_{KN}(w_n | w_{n-1}, w_{n-2}, \dots) = \frac{\max(c(w_n) - D, 0)}{c(w_{n-1}, w_{n-2}, \dots)} + \lambda P_{KN}(w_n | w_{n-1}, \dots)
- Explanation: Kneser-Ney is unique in that it takes into account the diversity of contexts in which a word appears. It effectively balances the probability estimates across both frequent and rare n-grams, making it particularly effective for handling rare events.
Implementation of Discounting Techniques
1. Absolute Discounting
This method reduces the count of each observed n-gram by a fixed discount value. It then redistributes the leftover probability mass to unseen n-grams.
Python
from collections import defaultdict
def train_ngram_model(corpus, n):
model = defaultdict(lambda: defaultdict(lambda: 0))
# Count n-grams
for sentence in corpus:
tokens = sentence.split()
for i in range(len(tokens) - n + 1):
ngram = tuple(tokens[i:i+n-1])
next_word = tokens[i+n-1]
model[ngram][next_word] += 1
return model
def absolute_discounting(model, ngram, discount=0.75):
total_count = sum(model[ngram].values())
adjusted_count = max(total_count - discount, 0)
return adjusted_count / total_count if total_count > 0 else 0
def generate_probability_distribution(model, discount=0.75):
probabilities = defaultdict(lambda: defaultdict(lambda: 0))
for ngram in model:
total_count = sum(model[ngram].values())
for next_word in model[ngram]:
probabilities[ngram][next_word] = (model[ngram][next_word] - discount) / total_count
return probabilities
2. Good-Turing Discounting
This method adjusts the counts based on how often n-grams with the same frequency appear. It's useful for handling rare events.
Python
import numpy as np
def good_turing_discounting(model):
freq_of_freqs = defaultdict(lambda: 0)
total_ngrams = 0
# Count frequency of frequencies
for ngram in model:
for count in model[ngram].values():
freq_of_freqs[count] += 1
total_ngrams += count
# Calculate discounted probabilities
probabilities = defaultdict(lambda: defaultdict(lambda: 0))
for ngram in model:
total_count = sum(model[ngram].values())
for next_word, count in model[ngram].items():
discounted_count = (count + 1) * (freq_of_freqs[count + 1] / freq_of_freqs[count])
probabilities[ngram][next_word] = discounted_count / total_ngrams
return probabilities
3. Kneser-Ney Smoothing
This advanced technique considers the diversity of contexts in which a word appears and is particularly effective for handling infrequent n-grams.
Python
def kneser_ney_smoothing(model, discount=0.75):
continuation_counts = defaultdict(lambda: 0)
ngram_counts = defaultdict(lambda: 0)
word_counts = defaultdict(lambda: 0)
probabilities = defaultdict(lambda: defaultdict(lambda: 0))
# Calculate continuation counts
for ngram in model:
for next_word in model[ngram]:
continuation_counts[next_word] += 1
ngram_counts[ngram] += 1
word_counts[ngram[:-1]] += 1
# Apply Kneser-Ney smoothing
for ngram in model:
total_count = sum(model[ngram].values())
for next_word in model[ngram]:
continuation_prob = continuation_counts[next_word] / sum(continuation_counts.values())
discounted_count = max(model[ngram][next_word] - discount, 0)
probabilities[ngram][next_word] = (discounted_count / total_count) + (discount / total_count) * continuation_prob
return probabilities
Example Usage
Here's how you can use these functions with a simple corpus:
Python
corpus = [
"the cat sat on the mat",
"the cat sat on the mat",
"the cat with the hat",
"the hat sat on the mat"
]
n = 2 # Bigram model
model = train_ngram_model(corpus, n)
# Absolute Discounting
abs_discount_probs = generate_probability_distribution(model)
# Good-Turing Discounting
gt_discount_probs = good_turing_discounting(model)
# Kneser-Ney Smoothing
kn_smooth_probs = kneser_ney_smoothing(model)
def convert_defaultdict_to_dict(d):
if isinstance(d, defaultdict):
d = {k: convert_defaultdict_to_dict(v) for k, v in d.items()}
return d
def pretty_print_model(model):
for ngram, next_words in model.items():
print(f"{ngram}:")
for word, prob in next_words.items():
print(f" {word}: {prob:.4f}")
print()
# Convert defaultdicts to normal dicts
abs_discount_dict = convert_defaultdict_to_dict(abs_discount_probs)
gt_discount_dict = convert_defaultdict_to_dict(gt_discount_probs)
kn_smooth_dict = convert_defaultdict_to_dict(kn_smooth_probs)
# Pretty print the models
print("Absolute Discounting:")
pretty_print_model(abs_discount_dict)
print("Good-Turing Discounting:")
pretty_print_model(gt_discount_dict)
print("Kneser-Ney Smoothing:")
pretty_print_model(kn_smooth_dict)
Output:
Absolute Discounting:
('the',):
cat: 0.2812
mat: 0.2812
hat: 0.1562
('cat',):
sat: 0.4167
with: 0.0833
('sat',):
on: 0.7500
('on',):
the: 0.7500
('with',):
the: 0.2500
('hat',):
sat: 0.2500
Good-Turing Discounting:
('the',):
cat: 0.0000
mat: 0.0000
hat: 0.3158
('cat',):
sat: 0.3158
with: 0.0702
('sat',):
on: 0.0000
('on',):
the: 0.0000
('with',):
the: 0.0702
('hat',):
sat: 0.0702
Kneser-Ney Smoothing:
('the',):
cat: 0.2917
mat: 0.2917
hat: 0.1667
('cat',):
sat: 0.4722
with: 0.1111
('sat',):
on: 0.7778
('on',):
the: 0.8056
('with',):
the: 0.4167
('hat',):
sat: 0.4167
The output shows the probability distribution for different n-grams in the corpus under three discounting techniques:
- Absolute Discounting: All observed n-grams have slightly reduced probabilities. For example, "the cat" has a probability of 0.2812, and "the hat" has 0.1562. Unseen n-grams have their probabilities increased slightly to prevent zero probabilities.
- Good-Turing Discounting: Some observed n-grams, like "the cat," are assigned a probability of 0.0000, indicating their probability is too low to be meaningful under this method. In contrast, "the hat" has a probability of 0.3158, showing how this method handles rare events differently.
- Kneser-Ney Smoothing: Provides more balanced probabilities, favoring n-grams that appear in diverse contexts. For example, "the cat" has a probability of 0.2917, and "sat on" has a higher probability of 0.7778, reflecting both the frequency and context diversity.
Overall, Kneser-Ney Smoothing tends to produce more robust and context-aware probabilities, while Absolute Discounting and Good-Turing provide simpler but less nuanced adjustments.
Application of Discounting Techniques in Modern Language Models
While discounting techniques were originally developed for traditional n-gram models, their principles continue to influence modern deep learning-based models, such as transformers. In these models, large datasets help mitigate the issue of unseen n-grams, but smoothing techniques akin to discounting can still play a role in pre-processing steps or in submodules where probability estimates are involved.
Challenges and Future Directions
The main challenge with discounting techniques is finding the right balance between discounting enough to prevent zero probabilities for unseen n-grams and not discounting so much that the model underfits the observed data. Future research may explore hybrid models that combine discounting with other smoothing and interpolation techniques, particularly for domain-specific or low-resource language models.
Conclusion
Discounting techniques are indispensable in the development of robust and accurate language models. By adjusting the probability distribution to account for both observed and unseen word sequences, these techniques help prevent the model from assigning zero probability to any potential n-gram, thereby enhancing its generalizability. As language models evolve, the principles underlying discounting techniques continue to be relevant, ensuring that even the most sophisticated models can effectively handle data sparsity and unseen events.
Similar Reads
Additive Smoothing Techniques in Language Models In natural language processing (NLP), language models are used to predict the likelihood of a sequence of words. However, one of the challenges that arise is dealing with unseen n-grams, which are word combinations that do not appear in the training data. Additive smoothing, also known as Laplace sm
6 min read
Advanced Smoothing Techniques in Language Models Language models predicts the probability of a sequence of words and generate coherent text. These models are used in various applications, including chatbots, translators, and more. However, one of the challenges in building language models is handling the issue of zero probabilities for unseen even
6 min read
Building Language Models in NLP Building language models is a fundamental task in natural language processing (NLP) that involves creating computational models capable of predicting the next word in a sequence of words. These models are essential for various NLP applications, such as machine translation, speech recognition, and te
4 min read
Causal Language Models in NLP Causal language models are a type of machine learning model that generates text by predicting the next word in a sequence based on the words that came before it. Unlike masked language models which predict missing words in a sentence by analyzing both preceding and succeeding words causal models ope
4 min read
Multiturn Deviation in Large Language Model Multiturn deviation in a large language model refers to the loss of context or coherence over multiple interactions within a conversation, leading to irrelevant or incorrect responses.The article explores the challenges of multiturn deviation in conversational AI and present techniques to enhance th
5 min read
Universal Language Model Fine-tuning (ULMFit) in NLP Understanding human language is one of the toughest challenges for computers. ULMFit (Universal Language Model Fine-tuning) is a technique used that helps machines learn language by first studying a large amount of text and then quickly adapting to specific language tasks. This makes building langua
6 min read