A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization
Abstract
Open-text responses provide researchers and educators with rich, nuanced insights that multiple-choice questions cannot capture. When reliably assessed, such responses have the potential to enhance teaching and learning. However, scaling and consistently capturing these nuances remain significant challenges, limiting the widespread use of open-text questions in educational research and assessments. In this paper, we introduce and evaluate GradeOpt, a unified multi-agent automatic short-answer grading (ASAG) framework that leverages large language models (LLMs) as graders for short-answer responses. More importantly, GradeOpt incorporates two additional LLM-based agents—the reflector and the refiner—into the multi-agent system. This enables GradeOpt to automatically optimize the original grading guidelines by performing self-reflection on its errors. To assess GradeOpt’s effectiveness, we conducted experiments on two representative ASAG datasets, which include items designed to capture key aspects of teachers’ pedagogical knowledge and students’ learning progress. Our results demonstrate that GradeOpt consistently outperforms representative baselines in both grading accuracy and alignment with human evaluators across different knowledge domains. Finally, comprehensive ablation studies validate the contributions of GradeOpt’s individual components, confirming their impact on overall performance.
keywords:
Large Language Models, Automatic Grading, Mathematics Education, Teacher Knowledge, Assessments, Large-Scale Testing, Mathematical Knowledge for Teaching, Physical Science, Learning Assessments, Content Knowledge, Pedagogical Content Knowledge1 Introduction
Accurate evaluation of assignments and examinations in a timely manner is vital to learning due to the significance of performance measurement in the learning process suzen2020textmining . Traditionally, multiple-choice questions (MCQs), which asks students to select the correct answer from distracting options, dominated learning assessment studies. While this approach makes data available promptly (mohler2011learning, ; burrows2015eras, ), it falls short in proving insights into learners’ thinking. Open-ended short-answer questions (SAQs) can provide deeper insights into students’ answering rationale and knowledge concepts. This is because they are known to elicit the thinking path that describes how a student arrives at their conclusion (leacock2003c-rater, ). Unfortunately, grading open-ended textual answers is tedious as substantial resources and time are needed to train raters to accurately and consistently code responses (roy2015perspective, ). More importantly, the inconsistent or unfair assessments, caused by diverged interpretations, biases, or mistakes create another challenge to SAQs grading in practice (suzen2020textmining, ). To mitigate these issues and provide timely and consistent evaluation, automatic short-answer grading (ASAG) burrows2015eras systems have become appealing. ASAG, which can be traced back to the 1960s, has bloomed in recent years due to advancements in natural language processing (NLP) leacock2003c-rater ; xie2024grade . Early ASAG systems often used pattern-matching techniques and hand-crafted features leacock2003c-rater . Thus, those systems required intensive human labor to build and were limited to a few specific grading tasks. The rise of deep learning (DL) has lessened the amount of burdensome feature designs needed for early ASAG systems. DL provides an end-to-end solution that automatically learns to output grading scores from a large number of graded answer samples hassan2018automatic . Due to the strong data-fitting capability of DL models, DL-based ASAG systems are able to be extended to different tasks if a large number of annotated samples are available. However, when the annotated sample size is limited, DL-based ASAG systems often face serious over-fitting issues. Beyond that, as DL is a black-box model whose results lack interpretation, the application of DL-based ASAG systems is still limited condor2024explainable .
The emergence of pre-trained language models (PLMs) and the more advanced Large Language Models (LLMs) have recently revolutionized the design of ASAG systems due to their human-like language ability and human-interpretable intermediate textual results. Therefore, many recent studies have attempted to build ASAG systems with LLMs. Promising results have been demonstrated that using fine-tuning latif2023finetuning and prompting techniques such as Chain-of-Thought (CoT) cohn2024science and in-context learning lee2024applyingllm . Yet these recent techniques are still limited due to LLMs’ inherent limitations such as sensitivity to prompts, context window restriction, etc., making the complex ASAG task challenging for the LLM grader. In reality, accurate, standardized, and unambiguous guidelines are critical to help human graders formulate a precise interpretation of scoring criteria. For LLM-based ASAG systems, those guidelines also serve as the principal instructions. They teach LLMs to perform the grading task following a similar standard as human graders. However, using guidelines composed by pedagogical experts directly for LLMs is sub-optimal since the general-purposed LLMs lack domain-specific knowledge and can misinterpret the guidelines (leacock2003c-rater, ). Meanwhile, LLMs are often sensitive to various facets of prompts (jiang2020knowlmsknow, ) where minor changes could lead to great differences in LLM’s performance. Optimizing the guidelines manually for LLMs can further take a lot of trial and error. Thus, recent works propose to conduct guideline modification with LLMs to offload human burden cohn2024science . While the modified guidelines yield performance improvement, the prompt search space in these methods is relatively limited. Because of this, the modified guidelines are not necessarily optimal. Additionally, abundant human efforts such as timely feedback or a large amount of labeling are required. Therefore, methods to optimize grading guidelines automatically and effectively are still desired.
In this paper, we propose a unified multi-agent ASAG framework that automatically optimizes grading guidelines. Specifically, it employs an iterative reflection mechanism to generate task prompts (guidelines) that effectively capture learners’ thinking and knowledge from a small dataset of short answers. To achieve this, we innovatively introduce prompt optimization in ASAG, framing grading guideline refinement as an optimization problem aimed at maximizing accuracy. Inspired by APO pryzant2023gradient , we develop novel techniques such as misconfidence-based selection, iterative optimization, and log-probability-based robustness to enhance the framework’s stability in producing accurate and trustworthy score predictions on unseen datasets. To minimize human labeling effort, our mechanism intelligently selects short-answer samples that contribute to optimal guideline refinement. Additionally, the framework supports assessments across varying levels of complexity, offering interpretable evaluations for each learning objective while improving overall scoring accuracy. To validate our approach, we conducted experiments on two real-world grading datasets. The first dataset comprises responses from high school students within a physical sciences curriculum, while the second consists of a national sample of teachers answering questions designed to assess content-specific knowledge required for teaching (copur2022mathematics, ). Experimental results demonstrate that GradeOpt outperforms representative baselines in both accuracy and alignment. Further analysis highlights consistent improvements in test accuracy across iterations, showcasing the framework’s ability to continuously enhance grading guidelines. To the best of our knowledge, we are the first to apply prompt optimization in ASAG by refining grading guidelines akin to generating an optimal task prompt. We believe that our multi-agent reflective mechanism can unlock the full potential of LLMs in learning analytics by providing detailed and accurate assessments while significantly reducing educators’ grading workload.
Illustration of the proposed framework 
2 Related Work
2.1 Automatic Short Answer Grading
Automatic Short Answer Grading (ASAG) is often treated as a text classification or regression problem in NLP studies. Here we mainly focus on classification due to its relevancy to our setting. Traditional ASAG models mainly rely on text similarity and employ classic ML classifiers. They use lexical features such as bag-of-words (BOW) mohler2011learning and TF-IDF del2023gradeaid , or syntactic features indicating the structure of sentences leacock2003c-rater . However, these methods require significant manual design, which makes them hard to be applied to new datasets. To reduce the burden of feature engineering, Deep Neural Networks (DNNs) such as Long-Short-Term-Memory (LSTM) are utilized hassan2018automatic , which produce superior results but suffer from limited generalizability. Pre-trained BERT-based models provide enhanced versatility through transfer learning including on ASAG datasets camus2020investigating . To further enhance grading accuracy, researchers have made attempts to ensemble BERT with statistics-based methods erickson2020automated and data augmentation lun2020multiple . LLMs are increasingly utilized in ASAG and similar assessment tasks yang2024content ; cohn2024science ; chu2025enhancing . However, their prompts are mostly always manually-crafted and thus are unable to properly adapt to new datasets. To solve this issue, several works have shifted attention to assisting educators with guideline creation xie2024grade ; cohn2024science .
2.2 LLM Prompt Optimization and Reflection
Prompts are critical to the success of LLMs zhou2023ape . To tailor LLMs to challenging tasks, manually crafted prompts are adopted to enhance the performance wei2023cot . To automate the generation and optimization of prompts, prompt optimization emerges as a promising method for input prompt refinement.
Using these techniques, LLMs have demonstrated superior performance in many down-stream tasks, particularly in instruction following and reasoning pryzant2023gradient ; zhou2023ape ; yang2024large . However, such automatic methods are risky when directly applied to ASAG tasks considering the limitations of LLMs such as hallucination huang2023surveyhallucinationllm and misalignment kenton2021alignment . To enhance both accuracy and trustworthiness, we adopt the idea of state-of-the-art prompt optimization APO pryzant2023gradient and implement novel techniques for reliability. Similar to how humans gather knowledge from failures, experience-and-reflect pan2023automatically is an important technique for improving LLMs’ alignment with task specifications. By reflection, LLMs learn through failure, which enriches its knowledge base and provides valuable reference in similar scenarios. Self-reflection has demonstrated promising results in improving LLM reasoning shinn2023reflexion ; madaan2023selfrefine . However, LLMs’ reflection ability is relatively limited when it comes to self-correction without human feedback or true labels huang2024largelanguagemodelsselfcorrect . A recent work tyen2024llmscannot divides the task of self-correction into two steps: mistake finding and output correction. They empirically show that while LLMs struggle to find errors, their correction ability is robust when given ground-truth labels. This provides grounding support for our proposed framework due to the similar use of true labels in guiding LLM reflection.
3 Problem Statement
We define ASAG as a text classification task, which grades the short answer text by classifying it into the discrete score categories , where is a ASAG system, is the score prediction, is the score category, and is size of the score category set. When is an LLM, the grading guideline text will be concatenated at the front of as an instructional prompt, and the grading process can be expressed as . In this work, we focuses on leveraging the reflection and refining capabilities of LLMs to automatically generate an optimized grading guideline based on a small amount of graded short answer text , where is the number of graded samples. The goal of our framework can be expressed as: ,where is the potential grading space, is an indicator function that is 1 if and 0 otherwise. Once the optimization process is finished, our framework will concatenate the optimized guidelines at the front of unlabeled short answer text and generate the grading results, .
4 Method
In this section, we introduce our unified multi-agent ASAG framework GradeOpt. It can automatically optimize the grading guidelines and achieve better grading alignment with human experts. Next, we first give an overview of GradeOpt. Then, we detail the LLM-based agent design, and implementation details.
4.1 An Overview
As demonstrated in Figure 1, GradeOpt consists of two stages: training and test-time adaptation. The training stage is supported by three LLM-based agents: Grader, Reflector, and Refiner. They synergically enhance the grading guidelines by optimizing the score classification accuracy using the graded answers to the SAQs (i.e., the training data). In the test-time adaptation stage, the system first performs an out-of-distribution (OOD) test over a small amount of unlabeled answers sampled from the test data. To be specific, by checking the log likelihood score of the predicted grading results, GradeOpt decides whether the optimized guidelines can be applied to the test data directly. If the test failed, the current guideline is not optimal for the test data. Therefore, our framework will improve via test-time training. If the test successes, GradeOpt will perform the auto-grading over the whole test data automatically.
4.2 Training Stage
The training stage is to optimize the guideline for the Grader agent to achieve the optimal grading performance over the training dataset . GradeOpt leverages a multi-agent framework powered by three agents which collaboratively predict scores for , identify errors, and suggest rule modifications to mitigate errors.
Before diving into the details of this stage, we first give a brief introduction to the three key components of a common grading guideline: Question Stem (), Key Concept () and Scoring Rubric (). Specifically, contains the complete question contents, describes the test knowledge concepts, and is the operational guidance instructing human graders how to score responses. As we previously mentioned, directly using as grading guideline for Grader is sub-optimal since the human-based scoring rubrics commonly lack detailed explanations to some concepts. As a result, LLM-based grading methods could provide ambiguous judgments. To solve this issue, GradeOpt focuses on optimizing by appending new Adaption Rules () that provides the detailed explanations regarding reflections from failed predictions and identified errors. In Figure 2, we present an example of optimized grading guideline . Specifically, when given the expert-designed input containing “Task Description", “Question Stem", “Key Concept" and “Scoring Rubrics", GradeOpt automatically generates the additional descriptions in “Adaptation Rules". These new rules help describe how to assign a grade based on answer patterns and details.
The training procedure is shown on the left sub-figure of Figure 1. During training, the optimization is conducted in an iterative manner. In the -th round, GradeOpt first draws a batch of samples from and sends them to the grader agent for grading. GradeOpt compares the grades outputted by LLMs with human-annotated scores, then identifies error samples. These samples are then sent to the reflector agent for error reflections. Based on the reflections generated from those error samples, the reflector agent proposes a series of suggestions for improving , represented by . is then sent to the refiner agent, which fuses with and generates for the next iteration of optimization. Next, we will introduce detailed designs of the three agents in GradeOpt. Then, we will present the implementation details of the iterative optimization process.
4.2.1 Agent Configurations
Grader
The Grader focuses on mapping to based on the given . In GradeOpt, we leverage the exceptional instruction-following capability of LLMs by using a prompt to instruct LLMs to simulate the grading process of human graders. To fully exploit the potential of LLMs, we incorporate the prompt engineering strategy Chain-of-Thought wei2023cot . This encourages LLMs to provide both judgment and intermediate reasoning steps in their outputs. With such design, the Grader becomes better aligned with the human-like grading process. Meanwhile, the intermediate reasoning steps provide support for the Reflector to discover the potential improvements to the given guideline. The prompt for the Grader agent is shown in Figure 3.
Reflector
The role of Reflector is to propose ways to improve the current guideline by reflecting over the error samples returned by Grader. To be specific, we design a two-step instruction prompt for LLMs to achieve this goal. In the first step, LLM is instructed to analyze the individual and shared failure reasons for a set of error samples. Then, in the second step, we ask LLMs to propose suggestions that can help resolve those issues. In general, the two-step improving process is analogous to the gradient descent algorithm used by parameter optimization for machine learning algorithm (ruder2017overviewgradientdescentoptimization, ). In our case, the guideline serves as the parameter of Grader and identifying the error reason is similar to the “gradient". Finally, proposing improving suggestions based on discovered reasons is similar to making a descent down the “gradient" and thus optimizing . The prompt for the Reflector agent is shown in Figure 4.
Refiner
The role of Refiner is to generate a new guideline based on the suggestions from Reflector. Specifically, Refiner is asked to make modifications to the examples and illustrations to the content in . Such edits include adding, removing, or editing. Note that we keep the other components, i.e., , , unchanged since they are composed by human experts, and any small change may distort the scoring logic away from its original design. The refined guideline can be expressed as , where is the text concatenation operator. The prompt for the Refiner agent is given in Figure 5.
4.2.2 Iterative Optimization Designs
Nested Iteration
The high complexity of test questions and grading guidelines makes it nontrivial to implement the optimization directly. Beyond that, the constraint over the input context window size of LLMs forbids it to accept all examples in for processing at once. To resolve that, we propose a nested iterative optimization approach, i.e., inner and outer loop, in GradeOpt. Specifically, during the -th outer loop, GradeOpt selects a batch of samples from and sends them with to Grader for grading. Then, the wrongly graded answers are filtered for reflections. However, due to the input context window size limitation, all errors in cannot be entirely processed by reflector and refiner simultaneously. Thus, we introduce the inner loop, which samples an inner batch from , and updates with the iterative procedure.
To accelerate the optimization process and encourage a wider exploration of all possible combinations of error samples in , we integrate the beam searching strategy (freitag2017beam, ) within both inner and outer loops. The algorithm of the nested iteration is shown in Algorithm 1. To be specific, in the -th inner loop of the -th outer iteration, GradeOpt accepts guidelines beam from -th inner iteration instead of a single guideline for refining (line 5). Then, during the inner iteration, each will be sent for reflection and refinement with independently sampled inner batches in a parallel manner (line 9). After all refined guidelines for the -th inner loop are finished, , each new guideline will be tested over a hold-out validation set (line 14). Meanwhile, the top- performing guidelines will be kept as and passed to the -th inner loop. Finally, the beam output of the last iteration of inner loop will be sent to the -th outer iteration (line 4).
While this procedure helps increase the accuracy and reliability, blindly increasing the iteration could lead to over-fitting and higher computational overheadsjuneja2024taskfacet . This is particularly true for smaller datasets. To help address these challenges, we introduce an early-stopping criteria. Specifically, during the selection for top-K performed in the -th inner loop, we record the performance metric of the best performed guideline. Then, in the next -th inner iteration, we check if is improved. If stops improving for two consecutive inner iterations, it indicates that the current guideline is facing risks to be over-fitted, thus following inner iterations are skipped. Similarly, during the -th outer iterations, if its inner iteration is terminated due to the early-stopping and -th outer iteration’s inner iteration is also terminated by early-stopping, the following outer iterations will also be skipped.
Batch Sampling Strategies
Using self-reflective approaches of LLMs to refine grading guidelines requires the exposure of similar errors in consecutive optimization iterations due to LLMs’ lack of ability in generating appropriate modifications with one attempt ma2024llmsgood . This is especially true for complicated cases involving nuance differences between score categories. However, the randomness of batch sampling in the outer loop fails to guarantee this pre-requisite, which limits the performance of GradeOpt. To solve this, we develop a novel sampling strategy, which leverages the misconfidence metric () (xu2024misconfidence, ) to find challenging examples in . To be specific, given as an input to Grader and as its human grading result, we calculate , where is the prediction of Grader. The misconfidence quantifies the discrepancy between the highest log probability of Grader’s incorrect prediction and the log probability of correct prediction . Intuitively, the larger indicates that the Grader is giving the wrong judgment with a relatively high confidence over the correct one, thereby implying that the sample is more challenging. However, calculating over all is computationally expensive and cannot be directly done in each iteration. To avoid introducing the additional computing cost to the current algorithm, we only calculate for samples in current iteration batch and select the top- samples as seeds to query similar samples from through embedding similarities. In this way, we simplify the selection process and ensure the consecutive appearance of the similar challenging examples between iterations. At last, to avoid the optimization being operated over the same portion of samples from all the time, we only select half batch based on misconfidence, and keep the another half as random samples. The detailed comparisons between the batch sampling strategies are presented in Section 5.7.
4.3 Test-time Adaptation Stage
In this stage, GradeOpt begins to perform the automatic grading to the large scaled unlabeled responses in test data. However, due to the diversity of language expressions existing in open-ended answers and other influence factors such as geography and time that change users’ expression styles, the performance of the auto-graded is not always guaranteed to be the same as during training. Such phenomenon is well-recognized as the out-of-distribution (OOD) issue in many machine learning problems (hendrycks2016baseline, ). Prior work (hendrycks2016baseline, ) has shown that capturing prediction probability statistics about correct or in-sample examples is often sufficient for detecting whether an example is in error or abnormal. Inspired by this, we compose a confidence indicator , where denotes the log likelihood probability given by the LLM. Intuitively, the log probability reflects the confidence that Grader gives to its graded results. By comparing with the average LLM confidence scores on samples in , we can know how serious the OOD phenomenon is. Specifically, when , it indicates that is well-applicable to . When , it suggests that the guideline is facing serious OOD influences, which suggests that grader may struggle to produce reliable and accurate predictions for .
If the test samples are deemed to be OOD, a common solution is to first compose an adaption dataset from the testing scenario. Using this adaption dataset, we then perform test-time training on the existing model. To be specific, test-time training leverages the annotation samples from and fine-tunes the optimized guideline with the same training process introduced in Section 4.2. Unfortunately, in the ASAG scenario, the annotation is usually expensive. Besides, it is challenging to ask pedagogical experts to provide a large amount of annotation samples to help the existing system adapt to any changes in a timely manner. To solve this issue, we propose an incremental labeling approach which checks the marginal performance changes brought by gradually increasing the size of annotation samples. By selecting the size with highest marginal gains in metrics like accuracy and Kappa, GradeOpt only asks pedagogical experts for necessary annotations. This not only reduces the annotation work loads but also increases the adaption efficiency of the framework. Finally, when the passes the OOD test, GradeOpt will be leveraged to finish ASAG over all samples in .
5 Experiment
In this section, we conduct experiments to validate the effectiveness of GradeOpt. Through the experiments, we aim to answer the following research questions. RQ1: Whether the refined guidelines based on prompt optimization match or exceed the performance of human-crafted guidelines? RQ2: Are the optimized guidelines applicable to new datasets of the same or similar questions? RQ3: How does each component contribute to the overall effectiveness of the guideline optimization system?
5.1 Datasets
To address the research questions above, we conduct experiments using two representative datasets for SAQ grading. Unlike existing ASAG studies (dzikovska2013semeval, ; mohler2011learning, ), which focus solely on student responses, our work extends ASAG to the grading of pedagogical answers from both students and teachers. The first dataset, , consists of teachers’ responses to questions designed to assess the knowledge and skills essential for teaching mathematics (copur2022mathematics, ). Since grading pedagogical answers requires a more nuanced interpretation to capture the underlying thought process, evaluating GradeOpt on this dataset allows us to examine its performance on more complex ASAG tasks. Specifically, includes six questions addressing different aspects of teacher knowledge, with responses labeled on a three-point scale: Bad (0), Fair (1), and Good (2). The second dataset, , evaluates GradeOpt on student responses, aligning with prior studies. It comprises 252 high school student responses to 11 assessment items within a physical sciences curriculum. These assessments measure Learning Progress (LP)-aligned scientific text-based explanations, reflecting students’ ability to apply knowledge of electrical interactions in high school Physical Science kaldaras2021developing . Responses in are graded on a binary scale: Fail (0) or Pass (1). All grading labels in both and were assigned by at least two human raters. In cases of disagreement, a third rater provided the final judgment. Detailed statistics for both datasets are presented in Table 1. For our experiments, we split both datasets into training, validation, and test sets using a 7:1:2 ratio.
| Question | Total | / / | Question | Total | / |
| 261 | 36 / 104 / 121 | 252 | 43 / 209 | ||
| 265 | 78 / 47 / 140 | 252 | 123 / 129 | ||
| 236 | 132 / 66 / 38 | 252 | 113 / 139 | ||
| 231 | 180 / 44 / 7 | 252 | 183 / 69 | ||
| 232 | 83 / 112 / 37 | 252 | 242 / 10 | ||
| 229 | 74 / 43 / 112 | 252 | 244 / 8 | ||
| 230 | 64 / 114 / 52 | 252 | 245 / 7 | ||
| 231 | 108 / 24 / 99 | 252 | 227 / 25 | ||
| - | - | - | 252 | 243 / 9 | |
| - | - | - | 252 | 210 / 42 | |
| - | - | - | 252 | 241 / 11 | |
5.2 Baselines
We compared our model with several representative ASAG baselines. Firstly, we choose two popular non-LLM methods, i.e., SBERT (reimers2019sentencebert, ) with Logistic Regression and RoBERTa (liu2019roberta, ) with Fine-tuning. Both of them have demonstrated strong performance in prior studies condor2021automaticSA ; poulton2021explaining . In addition, we adopt GPT-4o with zero-shot prompting, referred to as GPT-4o, as another baseline. Compared with non-LLM methods, LLM’s exceptional instruction and human-like reasoning capabilities make it a powerful method when facing complicated grading cases (henkel2024can, ). To mitigate the manual burden of revising the guidelines in the GPT-4o setting, we implement and compare GradeOpt with APO (pryzant2023gradient, ), which is a state-of-the-art method for automatic prompt optimization tasks.
5.3 Implementations
To implement the nested iterative optimization, we set the outer batch size and inner batch size . The outer loop iteration number and the inner loop iteration number . We implement the beam search selection mechanism with Upper Confidence Bound (UCB) (auer2003using, ), where the guideline beam size . The evaluation metric for UCB is Cohen’s Kappa as it empirically works better than other metrics. The agents in our framework are all powered by GPT-4o openai2024gpt4 with zero-shot prompting. The temperature for Grader is set to 0.0 to decrease the randomness of the result. The temperatures for both Reflector and Refiner are set to 0.5, since we want to encourage the LLMs to be more open in exploring the error reasons and propose the improving suggestions. For each question, we run the algorithm 3 times and report the average results.
5.4 Evaluation Metrics
In this work, we use Accuracy (Acc) and Cohen’s Kappa () as the evaluation metrics to compare the performance of different models. To be specific, accuracy measures the percentage of correct predictions across all cases, while Cohen’s Kappa measures the inter-rater alignment between model’s predictions and expert annotations, accounting for agreement by chance. For the dataset, which involves multi-class classification, we additionally utilize Quadratic Weighted Kappa (), which is particularly suitable for ordinal data as it assigns different weights to disagreements based on their magnitude.
5.5 Main Results
In this section, we address RQ1 by comparing baseline models with GradeOpt on both datasets, and . Table 2 presents the performance of baseline models and GradeOpt on . The results reveal several key observations. While all models achieve relatively high accuracy across questions, the Cohen’s kappa values for baseline models such as RoBERTa and SBERT on some questions are notably low, often close to zero. This indicates a poor alignment between automated and manual grading. A deeper analysis reveals that non-LLM-based models exhibit a uniform majority classification phenomenon, leading to skewed grading patterns. This suggests that LLM-based models provide more reliable grading results compared to their non-LLM counterparts. Comparing GPT-4o with prompt-optimized methods further highlights the importance of optimization. Optimized prompts consistently enhance grading performance while reducing variance across different questions. This finding confirms that directly applying raw human-provided rubrics is suboptimal, and prompt optimization is necessary to fully leverage LLMs in automatic grading. Lastly, GradeOpt outperforms the state-of-the-art (SOTA) automatic prompt optimization method, APO, across all questions. This result demonstrates the superior effectiveness of GradeOpt in improving grading performance, reinforcing its advantage over existing methods.
| Question | RoBERTa | SBERT | GPT-4o | APO | GradeOpt |
| Accuracy (Acc) | |||||
| 0.80 | 0.61 | 0.85 | 0.90 | 0.92 | |
| 0.81 | 0.70 | 0.72 | 0.89 | 0.91 | |
| 0.76 | 0.76 | 0.75 | 0.80 | 0.86 | |
| 0.79 | 0.74 | 0.51 | 0.67 | 0.70 | |
| 0.79 | 0.69 | 0.64 | 0.79 | 0.80 | |
| 0.49 | 0.76 | 0.70 | 0.81 | 0.84 | |
| 0.55 | 0.68 | 0.51 | 0.68 | 0.73 | |
| 0.66 | 0.62 | 0.66 | 0.85 | 0.89 | |
| Cohen’s Kappa () | |||||
| 0.65 | 0.32 | 0.76 | 0.85 | 0.88 | |
| 0.66 | 0.42 | 0.56 | 0.80 | 0.85 | |
| 0.00 | 0.00 | 0.38 | 0.51 | 0.68 | |
| 0.00 | 0.00 | 0.09 | 0.35 | 0.36 | |
| 0.58 | 0.44 | 0.30 | 0.60 | 0.63 | |
| 0.00 | 0.17 | 0.55 | 0.69 | 0.70 | |
| 0.00 | 0.41 | 0.33 | 0.50 | 0.52 | |
| 0.37 | 0.29 | 0.48 | 0.75 | 0.80 | |
| Quadratic Weighted Kappa () | |||||
| 0.70 | 0.31 | 0.82 | 0.88 | 0.89 | |
| 0.81 | 0.52 | 0.75 | 0.93 | 0.94 | |
| 0.00 | 0.00 | 0.61 | 0.67 | 0.76 | |
| 0.00 | 0.00 | 0.24 | 0.41 | 0.54 | |
| 0.59 | 0.32 | 0.56 | 0.68 | 0.71 | |
| 0.00 | 0.17 | 0.62 | 0.77 | 0.80 | |
| 0.00 | 0.48 | 0.58 | 0.64 | 0.70 | |
| 0.44 | 0.33 | 0.68 | 0.84 | 0.87 | |
| Question | RoBERTa | SBERT | GPT-4o | APO | GradeOPT |
| Accuracy (Acc) | |||||
| 0.84 | 0.84 | 0.84 | 0.96 | 0.98 | |
| 0.57 | 0.55 | 0.76 | 0.86 | 0.86 | |
| 0.69 | 0.67 | 0.88 | 0.92 | 0.92 | |
| 0.65 | 0.65 | 0.78 | 0.80 | 0.82 | |
| 0.94 | 0.94 | 0.94 | 0.94 | 0.98 | |
| 0.98 | 0.98 | 0.88 | 0.92 | 0.94 | |
| 0.98 | 0.98 | 1.00 | 1.00 | 1.00 | |
| 0.92 | 0.92 | 0.47 | 0.76 | 0.78 | |
| 0.98 | 0.98 | 0.96 | 0.96 | 0.98 | |
| 0.88 | 0.88 | 0.73 | 0.86 | 0.88 | |
| 0.94 | 0.94 | 0.90 | 0.96 | 0.96 | |
| Cohen’s Kappa () | |||||
| 0.00 | 0.00 | 0.55 | 0.83 | 0.92 | |
| 0.05 | 0.00 | 0.52 | 0.72 | 0.72 | |
| 0.08 | 0.00 | 0.74 | 0.81 | 0.81 | |
| 0.00 | 0.00 | 0.59 | 0.62 | 0.65 | |
| 0.00 | 0.00 | 0.64 | 0.64 | 0.64 | |
| 0.00 | 0.00 | 0.22 | 0.31 | 0.38 | |
| 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | |
| 0.00 | 0.00 | 0.10 | 0.15 | 0.17 | |
| 0.00 | 0.00 | 0.48 | 0.48 | 0.66 | |
| 0.00 | 0.00 | 0.34 | 0.56 | 0.60 | |
| 0.00 | 0.00 | 0.50 | 0.73 | 0.73 | |
Table 3 presents the performance of baseline models and GradeOpt on dataset . Consistent with the findings in Table 2, LLM-based ASAG methods consistently outperform non-LLM-based models, reaffirming the advantages of leveraging LLMs for ASAG tasks. Moreover, GradeOpt consistently achieves the best performance among all baselines in Table 3, demonstrating its effectiveness across different datasets. This confirms that GradeOpt is a generalizable framework suitable for various ASAG tasks. Comparing model performance across the two datasets reveals an expected trend: grading responses related to teacher knowledge yields lower performance metrics. This aligns with our expectations, as evaluating pedagogical knowledge is inherently more complex, requiring deeper expertise and more intricate logical reasoning. Beyond performance improvements, LLM-based ASAG methods enhance grading transparency and comprehensibility. Educators can easily interpret the LLM’s scoring rationale, which can be further broken down at the expectation level. Each key concept or criterion specified in the rubric or learning objectives is individually assessed, allowing educators to evaluate how well student responses align with specific learning goals. This level of explainability fosters greater trust in automated grading and facilitates more informed instructional decisions.
5.6 Adaptation Results
| Question | Metric | ||||
| Acc | 0.67 | 0.7 | 0.78 | +0.08 | |
| Kappa | 0.49 | 0.52 | 0.64 | +0.12 | |
| CI () | - | -0.22 | -0.17 | +0.05 | |
| OOD | - | ✓ | - | ||
| Acc | 0.75 | 0.82 | - | - | |
| Kappa | 0.59 | 0.7 | - | - | |
| CI () | - | -0.16 | - | - | |
| OOD | - | - | - |
To address RQ2, we collect an external dataset, , containing responses to two questions, and , from new teachers across the nation. Specifically, we gather 1,352 responses for and 1,364 responses for . From these, we randomly select 100 responses per question to serve as the test-time training dataset, while the remaining responses are used for evaluation. By applying the optimized guidelines learned from in Section 5.5, we explore its grading performance over . In addition, if the fails to pass the OOD test, where the confidence indicator , we will implement the test-time training with the train split of national dataset . Then, the adapted guideline will be tested over the same again. Overall, the results are reported in Table 4. From the table, we can find that is marked with OOD Flag as its confidence indicator . By comparing its performance between Table 2 and Tabel 4, we can confirm that it suffers great performance drops. Meanwhile, the confidence indicator of , , and its performance gap between and is relatively smaller. These two observations indicate that the proposed confidence indicator is a valid indicator for the OOD detection purpose. On the other hand, by comparing with the raw guidelines provided by , we find that consistently outperforms . This observation indicates that even the automatically optimized guideline suffers from the OOD issue, it is still better than the raw guideline. Finally, by calculating the performance change between the guideline before and after the test-time training, we can find that GradeOpt is able to get adapted to the new examples with only limited available annotated examples. In addition, the performance of guidelines after the test-time training is restored back to the acceptable grading range, e.g., Kappa , which indicates that the test-time training is an effective solution to help a high-performed to quickly get applied to different datasets.
5.7 Ablation Studies
(a) Accuracy Comparison
(b) Cohen’s Kappa Comparison
To answer RQ3, we conduct ablation studies to the nested iteration introduced in Section 4.2. We choose to experiment with from as its relatively simple rubric design of makes the shortest guideline prompt, leaving room for GradeOpt to add in its reflective experience as it iteratively learns from . Thus, experimenting with can better showcase GradeOpt’s optimization power. With the experimental results shown in the following sections, we demonstrate the effectiveness of each component.
(a) Outer Batch Size
(b) Inner Batch Size
First, we demonstrate the superiority of our misconfidence-based batch sampling strategy by comparing it with the random-based one. From Figure 6, we can observe that misconfidence-based batch sampling results are more consistent and accurate. While random selection generates optimal guidelines in 2 to 4 rounds, misconfidence-based selection consistently optimizes guidelines in 3 rounds. This, together with high predictive accuracy and alignment brought by misconfidence-based selection, makes the system reliable in practical educational scenarios as the required training round number is coherent. Then, we conduct experiments over the sizes of outer batch and inner batch , targeting at exploring the influence of those two hyper-parameters on the performance of GradeOpt. From the results in Figure 7, we observe an increasing trend in accuracy and kappa as its outer batch size increases. This observation suggests that increasing the number of examples is always beneficial to the final performance. Similarly, from the Figure 7, we find an consistent increasing trend of performance as error number increases. Based on these two findings, we can conclude that the larger batch size is likely to bring performance gains to the GradeOpt. At last, we study how the iterations number impacts accuracy performance. In our experiment, we explore different iteration numbers, ranging from 1 to 5. Whilst increasing iteration on a minibatch, we utilize the early-stopping signal introduced in Section 4.2.2 to carefully monitor overfitting. As shown in Figure 8, increasing iteration with the help of early-stopping signal can effectively lead to higher test accuracy as well as more stable performance. While five iterations produce higher accuracy, due to limited computational resource, we use three iterations as our default setting.
6 Case Study
In this section, we conduct case study by going through the refinement history and presenting the extraordinary performance of GradeOpt in optimizing the guidelines to match with task specifications and training dataset. We showcase example responses of Reflector and Refiner separately.
6.1 Reflector
The Reflector reflects and gives feedback when provided with a small selection of error samples. For example, as shown in Table 5, given a selection of errors in the prompt, Reflector follows a two-step reflect-and-suggest approach. It first provides an explanation to each error, illustrating the reason for why the mistake is made and analyzing why the response should be assessed differently. Based on this reflection, Reflector next discusses its suggestion of rule modification, based on the evidence from error sample.
| Example Error | Teacher Response: “They are equal proportions because they are situations during the same time and the same angle of the sun.” - Expert Label: 2; AI Prediction: 1 |
| Reflector’s Error Feedback | Step 1: Explanation of Mistake - The classifier predicted a score of 1 because the response mentions equal proportions and references conditions affecting proportionality (same time and same angle of the sun). However, it lacks specific evidence from the student’s work and a thorough interpretation of the student’s mathematical thinking. The correct label is 2 because the response does provide a clear explanation of the proportional relationship and the conditions under which it holds, which is a nuanced understanding of proportionality. |
| Step 2: Rule Modification - Rule: If the response explains equal proportions and references conditions affecting proportionality, classify as 2, even if it lacks specific evidence from the student’s work. |
6.2 Refiner
As introduced in Section 4.2.1, Refiner is tasked with guideline modification given the feedback from Reflector. Refiner can edit by adding, changing, removing, etc. We first showcase an example of Refiner adding a rule in Table 6, in response to Reflector’s feedback in Section 6.1. We observe that Refiner takes Reflector’s rules suggested in “Step 2: Rule Modification”. The new rule explains two components for grading a certain type of answer: first, the answer pattern, by saying “if the teacher’s response explains equal proportions and references conditions affecting proportionality, …, even if it lacks specific evidence from the student’s work”; second, the score assignment, by saying “classify as 2”. It additionally adds in details from the example by noting “(e.g., same time, same angle of the sun)” and includes the whole answer to give an elucidative grasping of to what answers this rule can be applied. To conclude, when adding rules Refiner is to include new types of answers, concluding their patterns and explaining word or phrase details. Next, we show an example of Refiner editing the guidelines. This happens when Refiner determines that a defective rule can be adjusted to give a better explanation or more complete details. As the example in Table 6 shows, part of the given rule is misleading, so Refiner revises this part. The changed rule first redefines the scenarios when this rule can be utilized, then it provides a detailed explanation by citing the answer and illustrating how the answer falls into the pattern category defined in the new rule to receive the correct score.
| Action | Refiner’s generated rules |
| Add | “1. Equal Proportions and Conditions: |
| - Rule: If the teacher’s response explains equal proportions and references conditions affecting proportionality (e.g., same time, same angle of the sun), classify as 2, even if it lacks specific evidence from the student’s work. | |
| - Example: They are equal proportions because they are situations during the same time and the same angle of the sun.” | |
| Edit | Before Edit: “- Award 1 point if … explicit evidence. For instance, if the response mentions that the student understands the unit rate, which is related to the concept of proportionality.” (misleading statement) |
| After Edit: “- Award 1 point if … explicit evidence. For example, if the response states that the student might have a limited understanding of proportionality, it should be awarded 1 point. For instance, in the response “their answer makes sense only if there is a proportional relationship between the height of the object and the length of the shadow," teacher mentioned “limited/partial understanding of the proportional relationship" but lacked depth. (ambiguity resolved) |
7 Conclusion
This paper explores fully automating guideline optimization to leverage LLM techniques including reflection and prompt engineering to solve ASAG tasks. We innovatively decompose the ASAG procedure into two steps: guideline optimization and grading. Specifically, we set our focus on automatic guideline optimization to avoid the manual efforts of composing a task-optimal guideline. To further prevent labeling a large amount of data, we propose a two-phase “train and test-adapt" procedure to maximally tune a guideline on a small training set and securely ensure this optimized output is reliable for large-scale grading. The proposed GradeOpt is a multi-agent guideline optimization system that iteratively leads the LLM to reflect on mistakes, learn answer patterns, and make improving modifications. Empirical experiments on two pedagogical datasets have demonstrated the effectiveness of GradeOpt.
8 Acknowledgments
This work was supported in part by the National Science Foundation under Grant No. 1813760, No. 2405483, No. 2200757 and No. 2234015. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. We thank Dr. Clare Carlson for help in revising one of the scoring rubrics.
References
- [1] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. JMLR, 2003.
- [2] S. Burrows, I. Gurevych, and B. Stein. The eras and trends of automatic short answer grading. IJAIED, 2015.
- [3] L. Camus and A. Filighera. Investigating transformers for automatic short answer grading. In AIED, 2020.
- [4] Y. Chu et al. Enhancing llm-based short answer grading with retrieval-augmented generation. arXiv preprint arXiv:2504.05276, 2025.
- [5] C. Cohn, N. Hutchins, T. Le, and G. Biswas. A chain-of-thought prompting approach with llms for evaluating students’ formative assessment responses in science. In AAAI, 2024.
- [6] A. Condor, M. Litster, and Z. A. Pardos. Automatic short answer grading with sbert on out-of-sample questions. In EDM, 2021.
- [7] A. Condor and Z. Pardos. Explainable automatic grading with neural additive models. In AIED, 2024.
- [8] Y. Copur-Gencturk and T. Tolar. Mathematics teaching expertise: A study of the dimensionality of content knowledge, pedagogical content knowledge, and content-specific noticing skills. Teaching and Teacher Education, 114:103696, 2022.
- [9] M. O. Dzikovska et al. Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. In SemEval, 2013.
- [10] J. A. Erickson et al. The automated grading of student open responses in mathematics. In LAK, 2020.
- [11] M. Freitag and Y. Al-Onaizan. Beam search strategies for neural machine translation. arXiv preprint arXiv:1702.01806, 2017.
- [12] D. Gobbo et al. Gradeaid: a framework for automatic short answers grading in educational contexts—design, implementation and evaluation. KAIS, 2023.
- [13] S. Hassan, A. A. Fahmy, and M. El-Ramly. Automatic short answer scoring based on paragraph embeddings. IJACSA, 2018.
- [14] D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
- [15] O. Henkel et al. Can large language models make the grade? an empirical study evaluating llms ability to mark short answer questions in k-12 education. In Learning@ Scale, 2024.
- [16] J. Huang et al. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023.
- [17] L. Huang et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
- [18] Z. Jiang et al. How can we know what language models know? arXiv preprint arXiv:1911.12543, 2020.
- [19] G. Juneja et al. Task facet learning: A structured approach to prompt optimization. arXiv preprint arXiv:2406.10504, 2024.
- [20] L. Kaldaras, H. Akaeze, and J. Krajcik. Developing and validating next generation science standards-aligned learning progression to track three-dimensional learning of electrical interactions in high school physical science. Journal of Research in Science Teaching, 2021.
- [21] Z. Kenton et al. Alignment of language agents. arXiv preprint arXiv:2103.14659, 2021.
- [22] E. Latif and X. Zhai. Fine-tuning chatgpt for automatic scoring. arXiv preprint arXiv:2310.10072, 2023.
- [23] C. Leacock and M. Chodorow. C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 2003.
- [24] G.-G. Lee et al. Applying large language models and chain-of-thought for automatic scoring. arXiv preprint arXiv:2312.03748, 2024.
- [25] Y. Liu et al. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- [26] J. Lun et al. Multiple data augmentation strategies for improving performance on automatic short answer scoring. In AAAI, 2020.
- [27] R. Ma et al. Are large language models good prompt optimizers? arXiv preprint arXiv:2402.02101, 2024.
- [28] A. Madaan et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
- [29] M. Mohler, R. Bunescu, and R. Mihalcea. Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In ACL, 2011.
- [30] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2024.
- [31] L. Pan et al. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188, 2023.
- [32] A. Poulton and S. Eliens. Explaining transformer-based models for automatic short answer grading. In ICDTE, 2021.
- [33] R. Pryzant et al. Automatic prompt optimization with "gradient descent" and beam search. arXiv preprint arXiv:2305.03495, 2023.
- [34] N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
- [35] S. Roy, Y. Narahari, and O. D. Deshmukh. A perspective on computer assisted assessment techniques for short free-text answers. In CAA, 2015.
- [36] S. Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2017.
- [37] N. Shinn et al. Reflexion: Language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366, 2023.
- [38] N. Süzen et al. Automatic short answer grading and feedback using text mining methods. Procedia Computer Science, 2020.
- [39] G. Tyen et al. LLMs cannot find reasoning errors, but can correct them given the error location. In ACL, 2024.
- [40] J. Wei et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2023.
- [41] W. Xie et al. Grade like a human: Rethinking automated assessment with large language models. arXiv preprint arXiv:2405.19694, 2024.
- [42] S. Xu and C. Zhang. Misconfidence-based demonstration selection for llm in-context learning. arXiv preprint arXiv:2401.06301, 2024.
- [43] C. Yang et al. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2024.
- [44] K. Yang et al. Content knowledge identification with multi-agent large language models (llms). In AIED, 2024.
- [45] Y. Zhou et al. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2023.