NLU benchmarks
NLU is a crucial capability of LLMs. Let’s explore some of the most recent and widely used benchmarks in this domain.
Massive multitask language understanding
Massive multitask language understanding (MMLU) is a comprehensive benchmark that tests models across 57 subjects, including science, mathematics, engineering, and more. It’s designed to assess both breadth and depth of knowledge.
Here’s an example of how you might evaluate an LLM on MMLU using the lm-evaluation-harness
library:
from lm_eval import tasks, evaluator def evaluate_mmlu(model): task_list = tasks.get_task_dict(["mmlu"]) results = evaluator.simple_evaluate( model=model, task_list=task_list, num_fewshot=5, batch_size=1 ...