Semantic preservation in text augmentation
Maintaining semantic integrity is crucial when augmenting data for LLMs. We must ensure that our techniques don’t alter the original meaning of the text.
Use of sentence embeddings
By comparing the sentence embeddings of the original and augmented texts, you can ensure semantic similarity:
def semantic_similarity(original, augmented, model): original_embedding = model.encode(original) augmented_embedding = model.encode(augmented) similarity = cosine_similarity( [original_embedding], [augmented_embedding] )[0][0] return similarity def filter_by_semantic_similarity( original, augmented_list, model, threshold=0.8 ): return [ aug for aug in augmented_list ...