Discounting Techniques in Language Models

Last Updated : 13 Aug, 2024

Language models are essential tools in natural language processing (NLP), responsible for predicting the next word in a sequence based on the words that precede it. A common challenge in building language models, particularly n-gram models, is the estimation of probabilities for word sequences that were not observed during training. Discounting techniques are critical in addressing this issue by adjusting the probabilities of observed word sequences and redistributing the remaining probability mass to account for unseen sequences.

This article delves into the various discounting techniques employed in language models, their mathematical underpinnings, and their importance in improving model accuracy.

Understanding the Need for Discounting

In n-gram models, the probability of a word w_n given its preceding words (context) is typically estimated based on frequency counts from a corpus. However, many word sequences might not appear in the training data, leading to zero probabilities for these unseen n-grams. Discounting methods adjust the probability of observed n-grams downward, freeing up probability mass that can be reassigned to unseen n-grams, thereby preventing the model from assigning zero probability to any sequence.

Common Discounting Techniques

1. Absolute Discounting

Absolute discounting is a straightforward approach where a fixed discount D is subtracted from the count of each observed n-gram. The remaining probability mass is then redistributed to account for unseen n-grams.

Formula: P(w_n | w_{n-1}, w_{n-2}, \dots, w_{n-k+1}) = \frac{\max(c(w_n) - D, 0)}{c(w_{n-1}, w_{n-2}, \dots, w_{n-k+1})} + \lambda P(w_n | w_{n-1}, w_{n-2}, \dots, w_{n-k+2}) where λ is a normalization factor that ensures the probabilities sum to one.
Explanation: This method ensures that even if an n-gram has been observed, it is slightly discounted to make room for unseen n-grams, which are assigned a small probability.

2. Good-Turing Discounting

Good-Turing discounting is based on the principle of frequency of frequencies. It estimates the probability of an n-gram by considering the likelihood of encountering unseen events, derived from how often n-grams with similar frequencies occur.

Formula: P_{GT}(r) = \frac{(r+1) \cdot N_{r+1}}{N_r \cdot N} where N_r is the number of n-grams that occur rrr times in the corpus, and NNN is the total number of n-grams.
Explanation: Good-Turing discounting effectively smooths the probabilities by lowering the estimate for observed n-grams with low counts and reallocating that probability to unseen or rarely seen n-grams.

3. Kneser-Ney Smoothing

Kneser-Ney smoothing is one of the most sophisticated and widely used discounting techniques. It extends absolute discounting by adjusting not only the observed n-gram probabilities but also the back-off distribution.

Formula: P_{KN}(w_n | w_{n-1}, w_{n-2}, \dots) = \frac{\max(c(w_n) - D, 0)}{c(w_{n-1}, w_{n-2}, \dots)} + \lambda P_{KN}(w_n | w_{n-1}, \dots)
Explanation: Kneser-Ney is unique in that it takes into account the diversity of contexts in which a word appears. It effectively balances the probability estimates across both frequent and rare n-grams, making it particularly effective for handling rare events.

Implementation of Discounting Techniques

1. Absolute Discounting

This method reduces the count of each observed n-gram by a fixed discount value. It then redistributes the leftover probability mass to unseen n-grams.

Python

from collections import defaultdict

def train_ngram_model(corpus, n):
    model = defaultdict(lambda: defaultdict(lambda: 0))
    
    # Count n-grams
    for sentence in corpus:
        tokens = sentence.split()
        for i in range(len(tokens) - n + 1):
            ngram = tuple(tokens[i:i+n-1])
            next_word = tokens[i+n-1]
            model[ngram][next_word] += 1
    
    return model

def absolute_discounting(model, ngram, discount=0.75):
    total_count = sum(model[ngram].values())
    adjusted_count = max(total_count - discount, 0)
    return adjusted_count / total_count if total_count > 0 else 0

def generate_probability_distribution(model, discount=0.75):
    probabilities = defaultdict(lambda: defaultdict(lambda: 0))
    for ngram in model:
        total_count = sum(model[ngram].values())
        for next_word in model[ngram]:
            probabilities[ngram][next_word] = (model[ngram][next_word] - discount) / total_count
    return probabilities

2. Good-Turing Discounting

This method adjusts the counts based on how often n-grams with the same frequency appear. It's useful for handling rare events.

Python

import numpy as np

def good_turing_discounting(model):
    freq_of_freqs = defaultdict(lambda: 0)
    total_ngrams = 0

    # Count frequency of frequencies
    for ngram in model:
        for count in model[ngram].values():
            freq_of_freqs[count] += 1
            total_ngrams += count

    # Calculate discounted probabilities
    probabilities = defaultdict(lambda: defaultdict(lambda: 0))
    for ngram in model:
        total_count = sum(model[ngram].values())
        for next_word, count in model[ngram].items():
            discounted_count = (count + 1) * (freq_of_freqs[count + 1] / freq_of_freqs[count])
            probabilities[ngram][next_word] = discounted_count / total_ngrams
    
    return probabilities

3. Kneser-Ney Smoothing

This advanced technique considers the diversity of contexts in which a word appears and is particularly effective for handling infrequent n-grams.

Python

def kneser_ney_smoothing(model, discount=0.75):
    continuation_counts = defaultdict(lambda: 0)
    ngram_counts = defaultdict(lambda: 0)
    word_counts = defaultdict(lambda: 0)
    probabilities = defaultdict(lambda: defaultdict(lambda: 0))
    
    # Calculate continuation counts
    for ngram in model:
        for next_word in model[ngram]:
            continuation_counts[next_word] += 1
            ngram_counts[ngram] += 1
            word_counts[ngram[:-1]] += 1
    
    # Apply Kneser-Ney smoothing
    for ngram in model:
        total_count = sum(model[ngram].values())
        for next_word in model[ngram]:
            continuation_prob = continuation_counts[next_word] / sum(continuation_counts.values())
            discounted_count = max(model[ngram][next_word] - discount, 0)
            probabilities[ngram][next_word] = (discounted_count / total_count) + (discount / total_count) * continuation_prob
    
    return probabilities

Example Usage

Here's how you can use these functions with a simple corpus:

Python

corpus = [
    "the cat sat on the mat",
    "the cat sat on the mat",
    "the cat with the hat",
    "the hat sat on the mat"
]

n = 2  # Bigram model
model = train_ngram_model(corpus, n)

# Absolute Discounting
abs_discount_probs = generate_probability_distribution(model)

# Good-Turing Discounting
gt_discount_probs = good_turing_discounting(model)

# Kneser-Ney Smoothing
kn_smooth_probs = kneser_ney_smoothing(model)

def convert_defaultdict_to_dict(d):
    if isinstance(d, defaultdict):
        d = {k: convert_defaultdict_to_dict(v) for k, v in d.items()}
    return d

def pretty_print_model(model):
    for ngram, next_words in model.items():
        print(f"{ngram}:")
        for word, prob in next_words.items():
            print(f"  {word}: {prob:.4f}")
        print()

# Convert defaultdicts to normal dicts
abs_discount_dict = convert_defaultdict_to_dict(abs_discount_probs)
gt_discount_dict = convert_defaultdict_to_dict(gt_discount_probs)
kn_smooth_dict = convert_defaultdict_to_dict(kn_smooth_probs)

# Pretty print the models
print("Absolute Discounting:")
pretty_print_model(abs_discount_dict)

print("Good-Turing Discounting:")
pretty_print_model(gt_discount_dict)

print("Kneser-Ney Smoothing:")
pretty_print_model(kn_smooth_dict)

Output:

Absolute Discounting:
('the',):
  cat: 0.2812
  mat: 0.2812
  hat: 0.1562

('cat',):
  sat: 0.4167
  with: 0.0833

('sat',):
  on: 0.7500

('on',):
  the: 0.7500

('with',):
  the: 0.2500

('hat',):
  sat: 0.2500

Good-Turing Discounting:
('the',):
  cat: 0.0000
  mat: 0.0000
  hat: 0.3158

('cat',):
  sat: 0.3158
  with: 0.0702

('sat',):
  on: 0.0000

('on',):
  the: 0.0000

('with',):
  the: 0.0702

('hat',):
  sat: 0.0702

Kneser-Ney Smoothing:
('the',):
  cat: 0.2917
  mat: 0.2917
  hat: 0.1667

('cat',):
  sat: 0.4722
  with: 0.1111

('sat',):
  on: 0.7778

('on',):
  the: 0.8056

('with',):
  the: 0.4167

('hat',):
  sat: 0.4167

The output shows the probability distribution for different n-grams in the corpus under three discounting techniques:

Absolute Discounting: All observed n-grams have slightly reduced probabilities. For example, "the cat" has a probability of 0.2812, and "the hat" has 0.1562. Unseen n-grams have their probabilities increased slightly to prevent zero probabilities.
Good-Turing Discounting: Some observed n-grams, like "the cat," are assigned a probability of 0.0000, indicating their probability is too low to be meaningful under this method. In contrast, "the hat" has a probability of 0.3158, showing how this method handles rare events differently.
Kneser-Ney Smoothing: Provides more balanced probabilities, favoring n-grams that appear in diverse contexts. For example, "the cat" has a probability of 0.2917, and "sat on" has a higher probability of 0.7778, reflecting both the frequency and context diversity.

Overall, Kneser-Ney Smoothing tends to produce more robust and context-aware probabilities, while Absolute Discounting and Good-Turing provide simpler but less nuanced adjustments.

Application of Discounting Techniques in Modern Language Models

While discounting techniques were originally developed for traditional n-gram models, their principles continue to influence modern deep learning-based models, such as transformers. In these models, large datasets help mitigate the issue of unseen n-grams, but smoothing techniques akin to discounting can still play a role in pre-processing steps or in submodules where probability estimates are involved.

Challenges and Future Directions

The main challenge with discounting techniques is finding the right balance between discounting enough to prevent zero probabilities for unseen n-grams and not discounting so much that the model underfits the observed data. Future research may explore hybrid models that combine discounting with other smoothing and interpolation techniques, particularly for domain-specific or low-resource language models.

Conclusion

Discounting techniques are indispensable in the development of robust and accurate language models. By adjusting the probability distribution to account for both observed and unseen word sequences, these techniques help prevent the model from assigning zero probability to any potential n-gram, thereby enhancing its generalizability. As language models evolve, the principles underlying discounting techniques continue to be relevant, ensuring that even the most sophisticated models can effectively handle data sparsity and unseen events.

Advanced Smoothing Techniques in Language Models

adilnaib

Improve

Article Tags :

Discounting Techniques in Language Models

Understanding the Need for Discounting

Common Discounting Techniques

1. Absolute Discounting

2. Good-Turing Discounting

3. Kneser-Ney Smoothing

Implementation of Discounting Techniques

1. Absolute Discounting

2. Good-Turing Discounting

3. Kneser-Ney Smoothing

Example Usage

Application of Discounting Techniques in Modern Language Models

Challenges and Future Directions

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?