Pre-training and fine-tuning data splits
In LLMs, data splits refer to the division of datasets into training, validation, and test sets to ensure the model learns generalizable patterns rather than memorizing data. This is essential for evaluating performance fairly, tuning model parameters, and preventing data leakage. Proper splitting is especially important in LLMs due to their scale, the diversity of tasks, and the need to assess domain and task generalization.
Stratified sampling for pre-training data
Stratified sampling is a sampling method that first divides the population into smaller subgroups (strata) based on shared characteristics and then randomly samples from within each stratum to ensure proportional representation of all groups in the final sample. This is particularly useful when dealing with imbalanced datasets.
When creating data splits for pre-training, it’s important to ensure that each split represents the diversity of the entire dataset. Here...