Balancing augmentation and data quality
While data augmentation can significantly improve LLM performance, we need to strike a balance between quantity and quality.
You should limit the proportion of augmented data in your training set. A common practice is to start with a 1:1 ratio of original to augmented data and adjust based on model performance.
Quality filtering
You can implement quality checks to filter out low-quality augmented samples:
def quality_filter( augmented_texts, original_text, similarity_threshold=0.8, perplexity_threshold=100 ): filtered_texts = [] for aug_text in augmented_texts: if ( semantic_similarity( original_text, aug_text, similarity_model...