A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization

Yucheng Chu

Hang Li

Kaiqi Yang

Michigan State University [email protected] Michigan State University [email protected] Michigan State University [email protected]
   Harry Shomer

Yasemin Copur-Gencturk

Leonora Kaldaras

Michigan State University [email protected] University of Southern California [email protected] Texas Tech University [email protected]
   Kevin Haudek

Joseph Krajcik

Namsoo Shin

Michigan State University [email protected] Michigan State University [email protected] Michigan State University [email protected]
   Hui Liu

Jiliang Tang

Michigan State University [email protected] Michigan State University [email protected]
Abstract

Open-text responses provide researchers and educators with rich, nuanced insights that multiple-choice questions cannot capture. When reliably assessed, such responses have the potential to enhance teaching and learning. However, scaling and consistently capturing these nuances remain significant challenges, limiting the widespread use of open-text questions in educational research and assessments. In this paper, we introduce and evaluate GradeOpt, a unified multi-agent automatic short-answer grading (ASAG) framework that leverages large language models (LLMs) as graders for short-answer responses. More importantly, GradeOpt incorporates two additional LLM-based agents—the reflector and the refiner—into the multi-agent system. This enables GradeOpt to automatically optimize the original grading guidelines by performing self-reflection on its errors. To assess GradeOpt’s effectiveness, we conducted experiments on two representative ASAG datasets, which include items designed to capture key aspects of teachers’ pedagogical knowledge and students’ learning progress. Our results demonstrate that GradeOpt consistently outperforms representative baselines in both grading accuracy and alignment with human evaluators across different knowledge domains. Finally, comprehensive ablation studies validate the contributions of GradeOpt’s individual components, confirming their impact on overall performance.

keywords:
Large Language Models, Automatic Grading, Mathematics Education, Teacher Knowledge, Assessments, Large-Scale Testing, Mathematical Knowledge for Teaching, Physical Science, Learning Assessments, Content Knowledge, Pedagogical Content Knowledge
titlenote: Both authors contribute equally.

1 Introduction

Accurate evaluation of assignments and examinations in a timely manner is vital to learning due to the significance of performance measurement in the learning process suzen2020textmining . Traditionally, multiple-choice questions (MCQs), which asks students to select the correct answer from distracting options, dominated learning assessment studies. While this approach makes data available promptly  (mohler2011learning, ; burrows2015eras, ), it falls short in proving insights into learners’ thinking. Open-ended short-answer questions (SAQs) can provide deeper insights into students’ answering rationale and knowledge concepts. This is because they are known to elicit the thinking path that describes how a student arrives at their conclusion (leacock2003c-rater, ). Unfortunately, grading open-ended textual answers is tedious as substantial resources and time are needed to train raters to accurately and consistently code responses (roy2015perspective, ). More importantly, the inconsistent or unfair assessments, caused by diverged interpretations, biases, or mistakes create another challenge to SAQs grading in practice (suzen2020textmining, ). To mitigate these issues and provide timely and consistent evaluation, automatic short-answer grading (ASAG) burrows2015eras systems have become appealing. ASAG, which can be traced back to the 1960s, has bloomed in recent years due to advancements in natural language processing (NLP) leacock2003c-rater ; xie2024grade . Early ASAG systems often used pattern-matching techniques and hand-crafted features leacock2003c-rater . Thus, those systems required intensive human labor to build and were limited to a few specific grading tasks. The rise of deep learning (DL) has lessened the amount of burdensome feature designs needed for early ASAG systems. DL provides an end-to-end solution that automatically learns to output grading scores from a large number of graded answer samples hassan2018automatic . Due to the strong data-fitting capability of DL models, DL-based ASAG systems are able to be extended to different tasks if a large number of annotated samples are available. However, when the annotated sample size is limited, DL-based ASAG systems often face serious over-fitting issues. Beyond that, as DL is a black-box model whose results lack interpretation, the application of DL-based ASAG systems is still limited condor2024explainable .

The emergence of pre-trained language models (PLMs) and the more advanced Large Language Models (LLMs) have recently revolutionized the design of ASAG systems due to their human-like language ability and human-interpretable intermediate textual results. Therefore, many recent studies have attempted to build ASAG systems with LLMs. Promising results have been demonstrated that using fine-tuning latif2023finetuning and prompting techniques such as Chain-of-Thought (CoT) cohn2024science and in-context learning lee2024applyingllm . Yet these recent techniques are still limited due to LLMs’ inherent limitations such as sensitivity to prompts, context window restriction, etc., making the complex ASAG task challenging for the LLM grader. In reality, accurate, standardized, and unambiguous guidelines are critical to help human graders formulate a precise interpretation of scoring criteria. For LLM-based ASAG systems, those guidelines also serve as the principal instructions. They teach LLMs to perform the grading task following a similar standard as human graders. However, using guidelines composed by pedagogical experts directly for LLMs is sub-optimal since the general-purposed LLMs lack domain-specific knowledge and can misinterpret the guidelines (leacock2003c-rater, ). Meanwhile, LLMs are often sensitive to various facets of prompts (jiang2020knowlmsknow, ) where minor changes could lead to great differences in LLM’s performance. Optimizing the guidelines manually for LLMs can further take a lot of trial and error. Thus, recent works propose to conduct guideline modification with LLMs to offload human burden cohn2024science . While the modified guidelines yield performance improvement, the prompt search space in these methods is relatively limited. Because of this, the modified guidelines are not necessarily optimal. Additionally, abundant human efforts such as timely feedback or a large amount of labeling are required. Therefore, methods to optimize grading guidelines automatically and effectively are still desired.

In this paper, we propose a unified multi-agent ASAG framework that automatically optimizes grading guidelines. Specifically, it employs an iterative reflection mechanism to generate task prompts (guidelines) that effectively capture learners’ thinking and knowledge from a small dataset of short answers. To achieve this, we innovatively introduce prompt optimization in ASAG, framing grading guideline refinement as an optimization problem aimed at maximizing accuracy. Inspired by APO pryzant2023gradient , we develop novel techniques such as misconfidence-based selection, iterative optimization, and log-probability-based robustness to enhance the framework’s stability in producing accurate and trustworthy score predictions on unseen datasets. To minimize human labeling effort, our mechanism intelligently selects short-answer samples that contribute to optimal guideline refinement. Additionally, the framework supports assessments across varying levels of complexity, offering interpretable evaluations for each learning objective while improving overall scoring accuracy. To validate our approach, we conducted experiments on two real-world grading datasets. The first dataset comprises responses from high school students within a physical sciences curriculum, while the second consists of a national sample of teachers answering questions designed to assess content-specific knowledge required for teaching (copur2022mathematics, ). Experimental results demonstrate that GradeOpt outperforms representative baselines in both accuracy and alignment. Further analysis highlights consistent improvements in test accuracy across iterations, showcasing the framework’s ability to continuously enhance grading guidelines. To the best of our knowledge, we are the first to apply prompt optimization in ASAG by refining grading guidelines akin to generating an optimal task prompt. We believe that our multi-agent reflective mechanism can unlock the full potential of LLMs in learning analytics by providing detailed and accurate assessments while significantly reducing educators’ grading workload.

\Description

Illustration of the proposed framework Refer to caption

Figure 1: An Illustration of the proposed framework.

2 Related Work

2.1 Automatic Short Answer Grading

Automatic Short Answer Grading (ASAG) is often treated as a text classification or regression problem in NLP studies. Here we mainly focus on classification due to its relevancy to our setting. Traditional ASAG models mainly rely on text similarity and employ classic ML classifiers. They use lexical features such as bag-of-words (BOW) mohler2011learning and TF-IDF del2023gradeaid , or syntactic features indicating the structure of sentences leacock2003c-rater . However, these methods require significant manual design, which makes them hard to be applied to new datasets. To reduce the burden of feature engineering, Deep Neural Networks (DNNs) such as Long-Short-Term-Memory (LSTM) are utilized hassan2018automatic , which produce superior results but suffer from limited generalizability. Pre-trained BERT-based models provide enhanced versatility through transfer learning including on ASAG datasets camus2020investigating . To further enhance grading accuracy, researchers have made attempts to ensemble BERT with statistics-based methods  erickson2020automated and data augmentation lun2020multiple . LLMs are increasingly utilized in ASAG and similar assessment tasks yang2024content ; cohn2024science ; chu2025enhancing . However, their prompts are mostly always manually-crafted and thus are unable to properly adapt to new datasets. To solve this issue, several works have shifted attention to assisting educators with guideline creation xie2024grade ; cohn2024science .

2.2 LLM Prompt Optimization and Reflection

Prompts are critical to the success of LLMs zhou2023ape . To tailor LLMs to challenging tasks, manually crafted prompts are adopted to enhance the performance wei2023cot . To automate the generation and optimization of prompts, prompt optimization emerges as a promising method for input prompt refinement.

Using these techniques, LLMs have demonstrated superior performance in many down-stream tasks, particularly in instruction following and reasoning pryzant2023gradient ; zhou2023ape ; yang2024large . However, such automatic methods are risky when directly applied to ASAG tasks considering the limitations of LLMs such as hallucination huang2023surveyhallucinationllm and misalignment kenton2021alignment . To enhance both accuracy and trustworthiness, we adopt the idea of state-of-the-art prompt optimization APO pryzant2023gradient and implement novel techniques for reliability. Similar to how humans gather knowledge from failures, experience-and-reflect pan2023automatically is an important technique for improving LLMs’ alignment with task specifications. By reflection, LLMs learn through failure, which enriches its knowledge base and provides valuable reference in similar scenarios. Self-reflection has demonstrated promising results in improving LLM reasoning shinn2023reflexion ; madaan2023selfrefine . However, LLMs’ reflection ability is relatively limited when it comes to self-correction without human feedback or true labels huang2024largelanguagemodelsselfcorrect . A recent work tyen2024llmscannot divides the task of self-correction into two steps: mistake finding and output correction. They empirically show that while LLMs struggle to find errors, their correction ability is robust when given ground-truth labels. This provides grounding support for our proposed framework due to the similar use of true labels in guiding LLM reflection.

3 Problem Statement

We define ASAG as a text classification task, which grades the short answer text xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by classifying it into the discrete score categories {y^i=(xi)=sj|j=1,..,C}\{\hat{y}_{i}=\mathcal{F}(x_{i})=s_{j}|j=1,..,C\}{ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_F ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_j = 1 , . . , italic_C }, where \mathcal{F}caligraphic_F is a ASAG system, y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the score prediction, sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the score category, and C𝐶Citalic_C is size of the score category set. When \mathcal{F}caligraphic_F is an LLM, the grading guideline text G𝐺Gitalic_G will be concatenated at the front of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as an instructional prompt, and the grading process can be expressed as y^i=(G,xi)=sjsubscript^𝑦𝑖𝐺subscript𝑥𝑖subscript𝑠𝑗\hat{y}_{i}=\mathcal{F}(G,x_{i})=s_{j}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_F ( italic_G , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. In this work, we focuses on leveraging the reflection and refining capabilities of LLMs to automatically generate an optimized grading guideline Gsuperscript𝐺G^{*}italic_G start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT based on a small amount of graded short answer text 𝒟={(xi,yi)|i=1,,N}𝒟conditional-setsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁\mathcal{D}=\{(x_{i},y_{i})|i=1,...,N\}caligraphic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_i = 1 , … , italic_N }, where N𝑁Nitalic_N is the number of graded samples. The goal of our framework can be expressed as: G=argmax𝐺Σi=1N𝟙yi=y^iN,G𝒢formulae-sequencesuperscript𝐺𝐺argmaxsuperscriptsubscriptΣ𝑖1𝑁subscript1subscript𝑦𝑖subscript^𝑦𝑖𝑁𝐺𝒢G^{*}=\underset{G}{\text{argmax}\ }\frac{\Sigma_{i=1}^{N}\mathbbm{1}_{y_{i}=% \hat{y}_{i}}}{N},\ G\in\mathcal{G}italic_G start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = underitalic_G start_ARG argmax end_ARG divide start_ARG roman_Σ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG , italic_G ∈ caligraphic_G,where 𝒢𝒢\mathcal{G}caligraphic_G is the potential grading space, 𝟙{}subscript1\mathbbm{1}_{\{\cdot\}}blackboard_1 start_POSTSUBSCRIPT { ⋅ } end_POSTSUBSCRIPT is an indicator function that is 1 if yi=y^isubscript𝑦𝑖subscript^𝑦𝑖y_{i}=\hat{y}_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 0 otherwise. Once the optimization process is finished, our framework will concatenate the optimized guidelines Gsuperscript𝐺G^{*}italic_G start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT at the front of unlabeled short answer text and generate the grading results, y^=(G,xi)^𝑦superscript𝐺subscript𝑥𝑖\hat{y}=\mathcal{F}(G^{*},x_{i})over^ start_ARG italic_y end_ARG = caligraphic_F ( italic_G start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

4 Method

In this section, we introduce our unified multi-agent ASAG framework GradeOpt. It can automatically optimize the grading guidelines and achieve better grading alignment with human experts. Next, we first give an overview of GradeOpt. Then, we detail the LLM-based agent design, and implementation details.

4.1 An Overview

As demonstrated in Figure 1, GradeOpt consists of two stages: training and test-time adaptation. The training stage is supported by three LLM-based agents: Grader, Reflector, and Refiner. They synergically enhance the grading guidelines by optimizing the score classification accuracy using the graded answers to the SAQs 𝒟trainsubscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D}_{train}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT (i.e., the training data). In the test-time adaptation stage, the system first performs an out-of-distribution (OOD) test over a small amount of unlabeled answers sampled from the test data. To be specific, by checking the log likelihood score of the predicted grading results, GradeOpt decides whether the optimized guidelines Goptsubscript𝐺𝑜𝑝𝑡G_{opt}italic_G start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT can be applied to the test data directly. If the test failed, the current guideline is not optimal for the test data. Therefore, our framework will improve Goptsubscript𝐺𝑜𝑝𝑡G_{opt}italic_G start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT via test-time training. If the test successes, GradeOpt will perform the auto-grading over the whole test data automatically.

Task Description: You are GradeGPT. You assess teachers’ knowledge of students’ mathematical thinking by grading their responses to a pedagogical content knowledge question.
Question Stem: Based on the student’s work, what is student likely to understand about the relationship between the length of the shadow and the height of the object?
Key Concept: Teachers should infer that the student possibly understands that there is a proportional relationship between the height of an object and the length of its shadow. However, because the concept of halving/doubling is natural to students, it is unclear if the student understands the relationship between object height and shadow length is proportional or if they understand equivalent ratios.
Scoring Rubrics: - Award 0 points if it does not address key concept … - Award 1 point if the response includes an accurate mention or implicit understanding of the key concept … - Award 2 points if the response offers a clear and explicit analysis of the proportional relationship between the objects …
Adaptation Rules: 1. Mention of Key Concept: - Rule: Award 1 point if the teacher response includes any accurate mention or implicit understanding of the Key Concept (proportionality/equivalent ratios), even if it lacks detailed analysis or evidence from the student’s work. - Example 1: If the response mentions that the student understands the relevance of using equivalent ratios, this should earn 1 point even if the analysis is not detailed.
Figure 2: An example of the optimized guidelines, Gsuperscript𝐺G^{*}italic_G start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

4.2 Training Stage

The training stage is to optimize the guideline for the Grader agent to achieve the optimal grading performance over the training dataset 𝒟trainsubscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D}_{train}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT. GradeOpt leverages a multi-agent framework powered by three agents which collaboratively predict scores for 𝒟trainsubscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D}_{train}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT, identify errors, and suggest rule modifications to mitigate errors.

Before diving into the details of this stage, we first give a brief introduction to the three key components of a common grading guideline: Question Stem (Gqssubscript𝐺𝑞𝑠G_{qs}italic_G start_POSTSUBSCRIPT italic_q italic_s end_POSTSUBSCRIPT), Key Concept (Gkcsubscript𝐺𝑘𝑐G_{kc}italic_G start_POSTSUBSCRIPT italic_k italic_c end_POSTSUBSCRIPT) and Scoring Rubric (Gsrsubscript𝐺𝑠𝑟G_{sr}italic_G start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT). Specifically, Gqssubscript𝐺𝑞𝑠G_{qs}italic_G start_POSTSUBSCRIPT italic_q italic_s end_POSTSUBSCRIPT contains the complete question contents, Gkcsubscript𝐺𝑘𝑐G_{kc}italic_G start_POSTSUBSCRIPT italic_k italic_c end_POSTSUBSCRIPT describes the test knowledge concepts, and Gsrsubscript𝐺𝑠𝑟G_{sr}italic_G start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT is the operational guidance instructing human graders how to score responses. As we previously mentioned, directly using G0={Gqs||Gkc||Gsr}subscript𝐺0conditional-setsubscript𝐺𝑞𝑠conditionalsubscript𝐺𝑘𝑐subscript𝐺𝑠𝑟G_{0}=\{G_{qs}||G_{kc}||G_{sr}\}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { italic_G start_POSTSUBSCRIPT italic_q italic_s end_POSTSUBSCRIPT | | italic_G start_POSTSUBSCRIPT italic_k italic_c end_POSTSUBSCRIPT | | italic_G start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT } as grading guideline for Grader is sub-optimal since the human-based scoring rubrics commonly lack detailed explanations to some concepts. As a result, LLM-based grading methods could provide ambiguous judgments. To solve this issue, GradeOpt focuses on optimizing G𝐺Gitalic_G by appending new Adaption Rules (Garsubscript𝐺𝑎𝑟G_{ar}italic_G start_POSTSUBSCRIPT italic_a italic_r end_POSTSUBSCRIPT) that provides the detailed explanations regarding reflections from failed predictions and identified errors. In Figure 2, we present an example of optimized grading guideline Goptsubscript𝐺𝑜𝑝𝑡G_{opt}italic_G start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT. Specifically, when given the expert-designed input containing “Task Description", “Question Stem", “Key Concept" and “Scoring Rubrics", GradeOpt automatically generates the additional descriptions in “Adaptation Rules". These new rules help describe how to assign a grade based on answer patterns and details.

The training procedure is shown on the left sub-figure of Figure  1. During training, the optimization is conducted in an iterative manner. In the t𝑡titalic_t-th round, GradeOpt first draws a batch of samples b𝑏bitalic_b from 𝒟trainsubscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D}_{train}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT and sends them to the grader agent for grading. GradeOpt compares the grades outputted by LLMs with human-annotated scores, then identifies error samples. These samples are then sent to the reflector agent for error reflections. Based on the reflections generated from those error samples, the reflector agent proposes a series of suggestions for improving Gt1subscript𝐺𝑡1G_{t-1}italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, represented by ΔGtΔsubscript𝐺𝑡\Delta G_{t}roman_Δ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. ΔGtΔsubscript𝐺𝑡\Delta G_{t}roman_Δ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is then sent to the refiner agent, which fuses Gt1subscript𝐺𝑡1G_{t-1}italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT with ΔGtΔsubscript𝐺𝑡\Delta G_{t}roman_Δ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and generates Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the next iteration of optimization. Next, we will introduce detailed designs of the three agents in GradeOpt. Then, we will present the implementation details of the iterative optimization process.

4.2.1 Agent Configurations

Grader 

The Grader focuses on mapping xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT based on the given G𝐺Gitalic_G. In GradeOpt, we leverage the exceptional instruction-following capability of LLMs by using a prompt to instruct LLMs to simulate the grading process of human graders. To fully exploit the potential of LLMs, we incorporate the prompt engineering strategy Chain-of-Thought wei2023cot . This encourages LLMs to provide both judgment and intermediate reasoning steps in their outputs. With such design, the Grader becomes better aligned with the human-like grading process. Meanwhile, the intermediate reasoning steps provide support for the Reflector to discover the potential improvements to the given guideline. The prompt for the Grader agent is shown in Figure 3.

Grader Prompt Task Description: In this task, you perform the task of assessing teachers’ knowledge of students’ mathematical thinking by grading teacher’s response to a math teaching question. Question Stem: <question stem> Key Concept: <key concept> Scoring Rubrics: <scoring rubrics> Adaptation Rules: <adaptation rules> Output format <score> Reasoning: <reasoning> Output Rules 1. Replace <score> with only one integer from 0, 1, or 2. 2. Replace <reasoning> with your reasoning. Let’s think step by step!
Figure 3: An example of the prompt to Grader.
Reflector 

The role of Reflector is to propose ways to improve the current guideline Gt1subscript𝐺𝑡1G_{t-1}italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT by reflecting over the error samples returned by Grader. To be specific, we design a two-step instruction prompt for LLMs to achieve this goal. In the first step, LLM is instructed to analyze the individual and shared failure reasons for a set of error samples. Then, in the second step, we ask LLMs to propose suggestions that can help resolve those issues. In general, the two-step improving process is analogous to the gradient descent algorithm used by parameter optimization for machine learning algorithm (ruder2017overviewgradientdescentoptimization, ). In our case, the guideline G𝐺Gitalic_G serves as the parameter of Grader and identifying the error reason is similar to the “gradient". Finally, proposing improving suggestions based on discovered reasons is similar to making a descent down the “gradient" and thus optimizing Gt1subscript𝐺𝑡1G_{t-1}italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. The prompt for the Reflector agent is shown in Figure 4.

Refiner 

The role of Refiner is to generate a new guideline Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on the suggestions from Reflector. Specifically, Refiner is asked to make modifications to the examples and illustrations to the content in Garsubscript𝐺𝑎𝑟G_{ar}italic_G start_POSTSUBSCRIPT italic_a italic_r end_POSTSUBSCRIPT. Such edits include adding, removing, or editing. Note that we keep the other components, i.e., Gqssubscript𝐺𝑞𝑠G_{qs}italic_G start_POSTSUBSCRIPT italic_q italic_s end_POSTSUBSCRIPT, Gkcsubscript𝐺𝑘𝑐G_{kc}italic_G start_POSTSUBSCRIPT italic_k italic_c end_POSTSUBSCRIPT, Gscsubscript𝐺𝑠𝑐G_{sc}italic_G start_POSTSUBSCRIPT italic_s italic_c end_POSTSUBSCRIPT unchanged since they are composed by human experts, and any small change may distort the scoring logic away from its original design. The refined guideline can be expressed as Gt={Gqs||Gkc||Gsc||Gar}subscript𝐺𝑡conditional-setsubscript𝐺𝑞𝑠conditionalsubscript𝐺𝑘𝑐subscript𝐺𝑠𝑐subscript𝐺𝑎𝑟G_{t}=\{G_{qs}||G_{kc}||G_{sc}||G_{ar}\}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_G start_POSTSUBSCRIPT italic_q italic_s end_POSTSUBSCRIPT | | italic_G start_POSTSUBSCRIPT italic_k italic_c end_POSTSUBSCRIPT | | italic_G start_POSTSUBSCRIPT italic_s italic_c end_POSTSUBSCRIPT | | italic_G start_POSTSUBSCRIPT italic_a italic_r end_POSTSUBSCRIPT }, where ||||| | is the text concatenation operator. The prompt for the Refiner agent is given in Figure 5.

4.2.2 Iterative Optimization Designs

Nested Iteration 

The high complexity of test questions and grading guidelines makes it nontrivial to implement the optimization directly. Beyond that, the constraint over the input context window size of LLMs forbids it to accept all examples in 𝒟𝒟\mathcal{D}caligraphic_D for processing at once. To resolve that, we propose a nested iterative optimization approach, i.e., inner and outer loop, in GradeOpt. Specifically, during the t𝑡titalic_t-th outer loop, GradeOpt selects a batch of samples boutsubscript𝑏𝑜𝑢𝑡b_{out}italic_b start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT from 𝒟trainsubscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D}_{train}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT and sends them with Gt1subscript𝐺𝑡1G_{t-1}italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to Grader for grading. Then, the wrongly graded answers etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are filtered for reflections. However, due to the input context window size limitation, all errors in etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT cannot be entirely processed by reflector and refiner simultaneously. Thus, we introduce the inner loop, which samples an inner batch binsubscript𝑏𝑖𝑛b_{in}italic_b start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT from etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and updates Gt1subscript𝐺𝑡1G_{t-1}italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT with the iterative procedure.

Reflector Prompt You are ReflectorGPT, a helpful AI agent capable of reflecting on [adaptation rules] that is used by a classifier for a grading task. Your task is to reflect and give reasons for why [adaptation rules] have gotten the given examples in [failed examples] wrong.
The prompt contains two components: 1. [question stem], [key concept] and [scoring rubrics] (these three are given by experts and should not be modified); 2. [adaptation rules] (your task to modify).
Important Steps For Devising Rules: Read [failed examples]. For each one of the errors, perform the following steps: - Step 1: Explain why the classifier made the mistakes, and provide detailed, explanative analyses for why this teacher response should not be interpreted in that wrong way. - Step 2: Devise or modify [adaptation rules] for each mistake to help classifier effectively avoid the mistake and classify the teacher response into the correct category (label). Make sure the devised rule is explanative, straightforward, detailed, concise, and in 1 to 3 sentences.
Question Stem: <question stem> Key Concept: <key concept> Scoring Rubrics: <scoring rubrics> Adaptation Rules: <adaptation rules>
But [adaptation rules] gets the following examples wrong: Failed Examples: <errors>
Give reasons for why [adaptation rules] could have gotten the examples wrong. Let’s think step by step!
Figure 4: An example of the prompt to Reflector.

To accelerate the optimization process and encourage a wider exploration of all possible combinations of error samples in boutsubscript𝑏𝑜𝑢𝑡b_{out}italic_b start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT, we integrate the beam searching strategy (freitag2017beam, ) within both inner and outer loops. The algorithm of the nested iteration is shown in Algorithm 1. To be specific, in the w𝑤witalic_w-th inner loop of the t𝑡titalic_t-th outer iteration, GradeOpt accepts guidelines beam Gt,w1={gt,w1(k)1kK}subscript𝐺𝑡𝑤1conditional-setsuperscriptsubscript𝑔𝑡𝑤1𝑘1𝑘𝐾G_{t,w-1}=\{g_{t,w-1}^{(k)}\mid 1\leq k\leq K\}italic_G start_POSTSUBSCRIPT italic_t , italic_w - 1 end_POSTSUBSCRIPT = { italic_g start_POSTSUBSCRIPT italic_t , italic_w - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∣ 1 ≤ italic_k ≤ italic_K } from (w1)𝑤1(w-1)( italic_w - 1 )-th inner iteration instead of a single guideline for refining (line 5). Then, during the inner iteration, each gt,w1(k)superscriptsubscript𝑔𝑡𝑤1𝑘g_{t,w-1}^{(k)}italic_g start_POSTSUBSCRIPT italic_t , italic_w - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT will be sent for reflection and refinement with L𝐿Litalic_L independently sampled inner batches binsubscript𝑏𝑖𝑛b_{in}italic_b start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT in a parallel manner (line 9). After all refined guidelines for the w𝑤witalic_w-th inner loop are finished, Gt,w={gt,w(l,k)1lL,1kK}subscript𝐺𝑡𝑤conditional-setsuperscriptsubscript𝑔𝑡𝑤𝑙𝑘formulae-sequence1𝑙𝐿1𝑘𝐾G_{t,w}=\{g_{t,w}^{(l,k)}\mid 1\leq l\leq L,1\leq k\leq K\}italic_G start_POSTSUBSCRIPT italic_t , italic_w end_POSTSUBSCRIPT = { italic_g start_POSTSUBSCRIPT italic_t , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l , italic_k ) end_POSTSUPERSCRIPT ∣ 1 ≤ italic_l ≤ italic_L , 1 ≤ italic_k ≤ italic_K }, each new guideline gt,w(l,k)superscriptsubscript𝑔𝑡𝑤𝑙𝑘g_{t,w}^{(l,k)}italic_g start_POSTSUBSCRIPT italic_t , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l , italic_k ) end_POSTSUPERSCRIPT will be tested over a hold-out validation set 𝒟valsubscript𝒟𝑣𝑎𝑙\mathcal{D}_{val}caligraphic_D start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT (line 14). Meanwhile, the top-K𝐾Kitalic_K performing guidelines will be kept as Gt,wsubscript𝐺𝑡𝑤G_{t,w}italic_G start_POSTSUBSCRIPT italic_t , italic_w end_POSTSUBSCRIPT and passed to the (w+1)𝑤1(w+1)( italic_w + 1 )-th inner loop. Finally, the beam output of the last iteration of inner loop Gt,Wsubscript𝐺𝑡𝑊G_{t,W}italic_G start_POSTSUBSCRIPT italic_t , italic_W end_POSTSUBSCRIPT will be sent to the (t+1)𝑡1(t+1)( italic_t + 1 )-th outer iteration (line 4).

While this procedure helps increase the accuracy and reliability, blindly increasing the iteration could lead to over-fitting and higher computational overheadsjuneja2024taskfacet . This is particularly true for smaller datasets. To help address these challenges, we introduce an early-stopping criteria. Specifically, during the selection for top-K performed Gt,wsubscript𝐺𝑡𝑤G_{t,w}italic_G start_POSTSUBSCRIPT italic_t , italic_w end_POSTSUBSCRIPT in the w𝑤witalic_w-th inner loop, we record the performance metric mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT of the best performed guideline. Then, in the next (w+1)𝑤1(w+1)( italic_w + 1 )-th inner iteration, we check if mw+1subscript𝑚𝑤1m_{w+1}italic_m start_POSTSUBSCRIPT italic_w + 1 end_POSTSUBSCRIPT is improved. If mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT stops improving for two consecutive inner iterations, it indicates that the current guideline is facing risks to be over-fitted, thus following inner iterations are skipped. Similarly, during the t𝑡titalic_t-th outer iterations, if its inner iteration is terminated due to the early-stopping and (t1)𝑡1(t-1)( italic_t - 1 )-th outer iteration’s inner iteration is also terminated by early-stopping, the following outer iterations will also be skipped.

Batch Sampling Strategies 

Using self-reflective approaches of LLMs to refine grading guidelines requires the exposure of similar errors in consecutive optimization iterations due to LLMs’ lack of ability in generating appropriate modifications with one attempt ma2024llmsgood . This is especially true for complicated cases involving nuance differences between score categories. However, the randomness of batch sampling in the outer loop fails to guarantee this pre-requisite, which limits the performance of GradeOpt. To solve this, we develop a novel sampling strategy, which leverages the misconfidence metric (ψ𝜓\psiitalic_ψ(xu2024misconfidence, ) to find challenging examples in 𝒟trainsubscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D}_{train}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT. To be specific, given xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as an input to Grader and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as its human grading result, we calculate ψi=maxyi^yilogPLLM(yi^|G,xi)logPLLM(yi|G,xi)subscript𝜓𝑖subscript^subscript𝑦𝑖subscript𝑦𝑖subscript𝑃𝐿𝐿𝑀conditional^subscript𝑦𝑖𝐺subscript𝑥𝑖subscript𝑃𝐿𝐿𝑀conditionalsubscript𝑦𝑖𝐺subscript𝑥𝑖\psi_{i}=\frac{\max_{\hat{y_{i}}\neq y_{i}}\log{P_{LLM}(\hat{y_{i}}|G,x_{i})}}% {\log{P_{LLM}(y_{i}|G,x_{i})}}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG roman_max start_POSTSUBSCRIPT over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≠ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT ( over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | italic_G , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG roman_log italic_P start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_G , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG, where yi^^subscript𝑦𝑖\hat{y_{i}}over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG is the prediction of Grader. The misconfidence quantifies the discrepancy between the highest log probability of Grader’s incorrect prediction yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the log probability of correct prediction yi^^subscript𝑦𝑖\hat{y_{i}}over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG. Intuitively, the larger ψ𝜓\psiitalic_ψ indicates that the Grader is giving the wrong judgment with a relatively high confidence over the correct one, thereby implying that the sample is more challenging. However, calculating ψ𝜓\psiitalic_ψ over all xi𝒟trainsubscript𝑥𝑖subscript𝒟𝑡𝑟𝑎𝑖𝑛x_{i}\in\mathcal{D}_{train}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT is computationally expensive and cannot be directly done in each iteration. To avoid introducing the additional computing cost to the current algorithm, we only calculate ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for samples in current iteration batch xiboutsubscript𝑥𝑖subscript𝑏𝑜𝑢𝑡x_{i}\in b_{out}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_b start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT and select the top-C𝐶Citalic_C samples as seeds to query similar samples from 𝒟trainsubscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D}_{train}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT through embedding similarities. In this way, we simplify the selection process and ensure the consecutive appearance of the similar challenging examples between iterations. At last, to avoid the optimization being operated over the same portion of samples from 𝒟trainsubscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D}_{train}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT all the time, we only select half batch based on misconfidence, and keep the another half as random samples. The detailed comparisons between the batch sampling strategies are presented in Section 5.7.

Data: training split of Dataset 𝒟trainsubscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D}_{train}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT, validation split of Dataset 𝒟valsubscript𝒟𝑣𝑎𝑙\mathcal{D}_{val}caligraphic_D start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT, initial guidelines 𝒢𝒢\mathcal{G}caligraphic_G, outer loop iteration number T𝑇Titalic_T, inner loop iteration number W𝑊Witalic_W, parallel inner batch number L𝐿Litalic_L, guidelines beam size K𝐾Kitalic_K.
Result: Optimized guidelines Goptsubscript𝐺𝑜𝑝𝑡G_{opt}italic_G start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT.
Initialize G0,W={g0,W(k)}={𝒢}subscript𝐺0𝑊subscriptsuperscript𝑔𝑘0𝑊𝒢G_{0,W}=\{g^{(k)}_{0,W}\}=\{\mathcal{G}\}italic_G start_POSTSUBSCRIPT 0 , italic_W end_POSTSUBSCRIPT = { italic_g start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 , italic_W end_POSTSUBSCRIPT } = { caligraphic_G };
for t1𝑡1t\leftarrow 1italic_t ← 1 to T𝑇Titalic_T do
       boutsubscript𝑏𝑜𝑢𝑡absentb_{out}\leftarrowitalic_b start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ← sample an outer iteration batch from 𝒟trainsubscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D}_{train}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ;
       Initialize Gt,0subscript𝐺𝑡0G_{t,0}italic_G start_POSTSUBSCRIPT italic_t , 0 end_POSTSUBSCRIPT = Gt1,Wsubscript𝐺𝑡1𝑊G_{t-1,W}italic_G start_POSTSUBSCRIPT italic_t - 1 , italic_W end_POSTSUBSCRIPT ;
       for w1𝑤1w\leftarrow 1italic_w ← 1 to W𝑊Witalic_W do
            
            for k1𝑘1k\leftarrow 1italic_k ← 1 to K𝐾Kitalic_K do
                   y^outsubscript^𝑦𝑜𝑢𝑡absent\hat{y}_{out}\leftarrowover^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ← generate grading results for boutsubscript𝑏𝑜𝑢𝑡b_{out}italic_b start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT by Grader with guideline gt,w(k)subscriptsuperscript𝑔𝑘𝑡𝑤g^{(k)}_{t,w}italic_g start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_w end_POSTSUBSCRIPT ;
                   et,ksubscript𝑒𝑡𝑘absente_{t,k}\leftarrowitalic_e start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT ← find error graded samples from boutsubscript𝑏𝑜𝑢𝑡b_{out}italic_b start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT caused by guideline gt,w(k)subscriptsuperscript𝑔𝑘𝑡𝑤g^{(k)}_{t,w}italic_g start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_w end_POSTSUBSCRIPT ;
                   do l1𝑙1l\leftarrow 1italic_l ← 1 to L𝐿Litalic_L in parallel
                         bin(l)superscriptsubscript𝑏𝑖𝑛𝑙absentb_{in}^{(l)}\leftarrowitalic_b start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ← randomly sample an inner batch from et,ksubscript𝑒𝑡𝑘e_{t,k}italic_e start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT ;
                         gt,w(k,l)superscriptsubscript𝑔𝑡𝑤𝑘𝑙absentg_{t,w}^{(k,l)}\leftarrowitalic_g start_POSTSUBSCRIPT italic_t , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k , italic_l ) end_POSTSUPERSCRIPT ← generate optimized guideline with inputting bin(l)superscriptsubscript𝑏𝑖𝑛𝑙b_{in}^{(l)}italic_b start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and gt,w1(k)superscriptsubscript𝑔𝑡𝑤1𝑘g_{t,w-1}^{(k)}italic_g start_POSTSUBSCRIPT italic_t , italic_w - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT to Reflector and Refiner ;
                        
                  
            Gt,w={gt,w(k)1kK}subscript𝐺𝑡𝑤conditional-setsuperscriptsubscript𝑔𝑡𝑤𝑘1𝑘𝐾absentG_{t,w}=\{g_{t,w}^{(k)}\mid 1\leq k\leq K\}\leftarrowitalic_G start_POSTSUBSCRIPT italic_t , italic_w end_POSTSUBSCRIPT = { italic_g start_POSTSUBSCRIPT italic_t , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∣ 1 ≤ italic_k ≤ italic_K } ← select top-K𝐾Kitalic_K performed gt,w(k,l)superscriptsubscript𝑔𝑡𝑤𝑘𝑙g_{t,w}^{(k,l)}italic_g start_POSTSUBSCRIPT italic_t , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k , italic_l ) end_POSTSUPERSCRIPT based on grading performance over 𝒟valsubscript𝒟𝑣𝑎𝑙\mathcal{D}_{val}caligraphic_D start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT ;
            
      
Algorithm 1 Nested Iterative Prompt Optimization Algorithm

4.3 Test-time Adaptation Stage

In this stage, GradeOpt begins to perform the automatic grading to the large scaled unlabeled responses in test data. However, due to the diversity of language expressions existing in open-ended answers and other influence factors such as geography and time that change users’ expression styles, the performance of the auto-graded is not always guaranteed to be the same as during training. Such phenomenon is well-recognized as the out-of-distribution (OOD) issue in many machine learning problems (hendrycks2016baseline, ). Prior work (hendrycks2016baseline, ) has shown that capturing prediction probability statistics about correct or in-sample examples is often sufficient for detecting whether an example is in error or abnormal. Inspired by this, we compose a confidence indicator ζ=1|𝒟test|xi𝒟testmaxj(logPLLM(sj|G,xi))𝜁1subscript𝒟𝑡𝑒𝑠𝑡subscriptsubscript𝑥𝑖subscript𝒟𝑡𝑒𝑠𝑡subscript𝑗subscript𝑃𝐿𝐿𝑀conditionalsubscript𝑠𝑗𝐺subscript𝑥𝑖\zeta=\frac{1}{|\mathcal{D}_{test}|}\sum_{x_{i}\in\mathcal{D}_{test}}\max_{j}(% \log P_{LLM}(s_{j}|G,x_{i}))italic_ζ = divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_log italic_P start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_G , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ), where logPLLM()subscript𝑃𝐿𝐿𝑀\log P_{LLM}(\cdot)roman_log italic_P start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT ( ⋅ ) denotes the log likelihood probability given by the LLM. Intuitively, the log probability reflects the confidence that Grader gives to its graded results. By comparing ζ𝜁\zetaitalic_ζ with the average LLM confidence scores μ𝜇\muitalic_μ on samples in 𝒟trainsubscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D}_{train}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT, we can know how serious the OOD phenomenon is. Specifically, when ζ>μ𝜁𝜇\zeta>\muitalic_ζ > italic_μ, it indicates that G𝐺Gitalic_G is well-applicable to 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. When ζ<μ𝜁𝜇\zeta<\muitalic_ζ < italic_μ, it suggests that the guideline is facing serious OOD influences, which suggests that grader may struggle to produce reliable and accurate predictions for 𝒟testsubscript𝒟𝑡𝑒𝑠𝑡\mathcal{D}_{test}caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT.

If the test samples are deemed to be OOD, a common solution is to first compose an adaption dataset from the testing scenario. Using this adaption dataset, we then perform test-time training on the existing model. To be specific, test-time training leverages the annotation samples from 𝒟testsubscript𝒟𝑡𝑒𝑠𝑡\mathcal{D}_{test}caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT and fine-tunes the optimized guideline Goptsubscript𝐺𝑜𝑝𝑡G_{opt}italic_G start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT with the same training process introduced in Section 4.2. Unfortunately, in the ASAG scenario, the annotation is usually expensive. Besides, it is challenging to ask pedagogical experts to provide a large amount of annotation samples to help the existing system adapt to any changes in a timely manner. To solve this issue, we propose an incremental labeling approach which checks the marginal performance changes brought by gradually increasing the size of annotation samples. By selecting the size with highest marginal gains in metrics like accuracy and Kappa, GradeOpt only asks pedagogical experts for necessary annotations. This not only reduces the annotation work loads but also increases the adaption efficiency of the framework. Finally, when the Goptsubscript𝐺𝑜𝑝𝑡G_{opt}italic_G start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT passes the OOD test, GradeOpt will be leveraged to finish ASAG over all samples in 𝒟testsubscript𝒟𝑡𝑒𝑠𝑡\mathcal{D}_{test}caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT.

Refiner Prompt I’m trying to write a classifier for a grading task. You are RefinerGPT, a helpful AI agent capable of refining [adaptation rules] to be used by the classifier.
The [adaptation rules] must contain patterns learned from failed examples, explaining why the predicted score is wrong comparing to the correct label. The set of rules must strictly abide by [scoring rubrics] and must clearly use patterns/details from examples to clearly illustrate and explain. Question Stem: <question stem> Key Concept: <key concept> Scoring Rubrics: <scoring rubrics> Adaptation Rules: <adaptation rules>
But [adaptation rules] have gotten several examples wrong, with the reasons of the problems examined as follows: Failed Examples: <errors> Error Feedbacks: <error feedbacks>
Based on the above information, I wrote one different improved set of rules in replacement of [adaptation rules] for instructing the classifier to learn patterns from examples for avoiding such errors.
Let’s think step by step!
Figure 5: An example of the prompt to Refiner.

5 Experiment

In this section, we conduct experiments to validate the effectiveness of GradeOpt. Through the experiments, we aim to answer the following research questions. RQ1: Whether the refined guidelines based on prompt optimization match or exceed the performance of human-crafted guidelines? RQ2: Are the optimized guidelines applicable to new datasets of the same or similar questions? RQ3: How does each component contribute to the overall effectiveness of the guideline optimization system?

5.1 Datasets

To address the research questions above, we conduct experiments using two representative datasets for SAQ grading. Unlike existing ASAG studies (dzikovska2013semeval, ; mohler2011learning, ), which focus solely on student responses, our work extends ASAG to the grading of pedagogical answers from both students and teachers. The first dataset, 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, consists of teachers’ responses to questions designed to assess the knowledge and skills essential for teaching mathematics (copur2022mathematics, ). Since grading pedagogical answers requires a more nuanced interpretation to capture the underlying thought process, evaluating GradeOpt on this dataset allows us to examine its performance on more complex ASAG tasks. Specifically, 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT includes six questions addressing different aspects of teacher knowledge, with responses labeled on a three-point scale: Bad (0), Fair (1), and Good (2). The second dataset, 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, evaluates GradeOpt on student responses, aligning with prior studies. It comprises 252 high school student responses to 11 assessment items within a physical sciences curriculum. These assessments measure Learning Progress (LP)-aligned scientific text-based explanations, reflecting students’ ability to apply knowledge of electrical interactions in high school Physical Science kaldaras2021developing . Responses in 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are graded on a binary scale: Fail (0) or Pass (1). All grading labels in both 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT were assigned by at least two human raters. In cases of disagreement, a third rater provided the final judgment. Detailed statistics for both datasets are presented in Table 1. For our experiments, we split both datasets into training, validation, and test sets using a 7:1:2 ratio.

Table 1: Detailed statistics of different questions in both datasets 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The number of samples in each label category is shown as Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.
𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Question Total C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / C3subscript𝐶3C_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT Question Total C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 261 36 / 104 / 121 Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 252 43 / 209
Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 265 78 / 47 / 140 Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 252 123 / 129
Q3subscript𝑄3Q_{3}italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 236 132 / 66 / 38 Q3subscript𝑄3Q_{3}italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 252 113 / 139
Q4subscript𝑄4Q_{4}italic_Q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 231 180 / 44 / 7 Q4subscript𝑄4Q_{4}italic_Q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 252 183 / 69
Q5subscript𝑄5Q_{5}italic_Q start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 232 83 / 112 / 37 Q5subscript𝑄5Q_{5}italic_Q start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 252 242 / 10
Q6subscript𝑄6Q_{6}italic_Q start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT 229 74 / 43 / 112 Q6subscript𝑄6Q_{6}italic_Q start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT 252 244 / 8
Q7subscript𝑄7Q_{7}italic_Q start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT 230 64 / 114 / 52 Q7subscript𝑄7Q_{7}italic_Q start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT 252 245 / 7
Q8subscript𝑄8Q_{8}italic_Q start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT 231 108 / 24 / 99 Q8subscript𝑄8Q_{8}italic_Q start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT 252 227 / 25
- - - Q9subscript𝑄9Q_{9}italic_Q start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT 252 243 / 9
- - - Q10subscript𝑄10Q_{10}italic_Q start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT 252 210 / 42
- - - Q11subscript𝑄11Q_{11}italic_Q start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT 252 241 / 11

5.2 Baselines

We compared our model with several representative ASAG baselines. Firstly, we choose two popular non-LLM methods, i.e., SBERT (reimers2019sentencebert, ) with Logistic Regression and RoBERTa (liu2019roberta, ) with Fine-tuning. Both of them have demonstrated strong performance in prior studies condor2021automaticSA ; poulton2021explaining . In addition, we adopt GPT-4o with zero-shot prompting, referred to as GPT-4o, as another baseline. Compared with non-LLM methods, LLM’s exceptional instruction and human-like reasoning capabilities make it a powerful method when facing complicated grading cases (henkel2024can, ). To mitigate the manual burden of revising the guidelines in the GPT-4o setting, we implement and compare GradeOpt with APO (pryzant2023gradient, ), which is a state-of-the-art method for automatic prompt optimization tasks.

5.3 Implementations

To implement the nested iterative optimization, we set the outer batch size |bout|=64subscript𝑏𝑜𝑢𝑡64|b_{out}|=64| italic_b start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT | = 64 and inner batch size |bin|=8subscript𝑏𝑖𝑛8|b_{in}|=8| italic_b start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT | = 8. The outer loop iteration number T=5𝑇5T=5italic_T = 5 and the inner loop iteration number W=3𝑊3W=3italic_W = 3. We implement the beam search selection mechanism with Upper Confidence Bound (UCB) (auer2003using, ), where the guideline beam size K=4𝐾4K=4italic_K = 4. The evaluation metric for UCB is Cohen’s Kappa as it empirically works better than other metrics. The agents in our framework are all powered by GPT-4o openai2024gpt4 with zero-shot prompting. The temperature for Grader is set to 0.0 to decrease the randomness of the result. The temperatures for both Reflector and Refiner are set to 0.5, since we want to encourage the LLMs to be more open in exploring the error reasons and propose the improving suggestions. For each question, we run the algorithm 3 times and report the average results.

5.4 Evaluation Metrics

In this work, we use Accuracy (Acc) and Cohen’s Kappa (κcsubscript𝜅𝑐\kappa_{c}italic_κ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT) as the evaluation metrics to compare the performance of different models. To be specific, accuracy measures the percentage of correct predictions across all cases, while Cohen’s Kappa measures the inter-rater alignment between model’s predictions and expert annotations, accounting for agreement by chance. For the 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT dataset, which involves multi-class classification, we additionally utilize Quadratic Weighted Kappa (κwsubscript𝜅𝑤\kappa_{w}italic_κ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT), which is particularly suitable for ordinal data as it assigns different weights to disagreements based on their magnitude.

5.5 Main Results

In this section, we address RQ1 by comparing baseline models with GradeOpt on both datasets, 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Table 2 presents the performance of baseline models and GradeOpt on 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The results reveal several key observations. While all models achieve relatively high accuracy across questions, the Cohen’s kappa values for baseline models such as RoBERTa and SBERT on some questions are notably low, often close to zero. This indicates a poor alignment between automated and manual grading. A deeper analysis reveals that non-LLM-based models exhibit a uniform majority classification phenomenon, leading to skewed grading patterns. This suggests that LLM-based models provide more reliable grading results compared to their non-LLM counterparts. Comparing GPT-4o with prompt-optimized methods further highlights the importance of optimization. Optimized prompts consistently enhance grading performance while reducing variance across different questions. This finding confirms that directly applying raw human-provided rubrics is suboptimal, and prompt optimization is necessary to fully leverage LLMs in automatic grading. Lastly, GradeOpt outperforms the state-of-the-art (SOTA) automatic prompt optimization method, APO, across all questions. This result demonstrates the superior effectiveness of GradeOpt in improving grading performance, reinforcing its advantage over existing methods.

Table 2: Performance of GradeOpt and Baseline Models on 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The best performed model of each metric is marked with bold, the second best one is marked with underline.
Question RoBERTa SBERT GPT-4o APO GradeOpt
Accuracy (Acc)
Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.80 0.61 0.85 0.90 0.92
Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.81 0.70 0.72 0.89 0.91
Q3subscript𝑄3Q_{3}italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.76 0.76 0.75 0.80 0.86
Q4subscript𝑄4Q_{4}italic_Q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 0.79 0.74 0.51 0.67 0.70
Q5subscript𝑄5Q_{5}italic_Q start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 0.79 0.69 0.64 0.79 0.80
Q6subscript𝑄6Q_{6}italic_Q start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT 0.49 0.76 0.70 0.81 0.84
Q7subscript𝑄7Q_{7}italic_Q start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT 0.55 0.68 0.51 0.68 0.73
Q8subscript𝑄8Q_{8}italic_Q start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT 0.66 0.62 0.66 0.85 0.89
Cohen’s Kappa (κcsubscript𝜅𝑐\kappa_{c}italic_κ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT)
Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.65 0.32 0.76 0.85 0.88
Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.66 0.42 0.56 0.80 0.85
Q3subscript𝑄3Q_{3}italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.00 0.00 0.38 0.51 0.68
Q4subscript𝑄4Q_{4}italic_Q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 0.00 0.00 0.09 0.35 0.36
Q5subscript𝑄5Q_{5}italic_Q start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 0.58 0.44 0.30 0.60 0.63
Q6subscript𝑄6Q_{6}italic_Q start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT 0.00 0.17 0.55 0.69 0.70
Q7subscript𝑄7Q_{7}italic_Q start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT 0.00 0.41 0.33 0.50 0.52
Q8subscript𝑄8Q_{8}italic_Q start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT 0.37 0.29 0.48 0.75 0.80
Quadratic Weighted Kappa (κwsubscript𝜅𝑤\kappa_{w}italic_κ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT)
Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.70 0.31 0.82 0.88 0.89
Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.81 0.52 0.75 0.93 0.94
Q3subscript𝑄3Q_{3}italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.00 0.00 0.61 0.67 0.76
Q4subscript𝑄4Q_{4}italic_Q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 0.00 0.00 0.24 0.41 0.54
Q5subscript𝑄5Q_{5}italic_Q start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 0.59 0.32 0.56 0.68 0.71
Q6subscript𝑄6Q_{6}italic_Q start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT 0.00 0.17 0.62 0.77 0.80
Q7subscript𝑄7Q_{7}italic_Q start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT 0.00 0.48 0.58 0.64 0.70
Q8subscript𝑄8Q_{8}italic_Q start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT 0.44 0.33 0.68 0.84 0.87
Table 3: Performance of GradeOpt and Baseline Models on 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The best performed model of each metric is marked with bold, the second best one is marked with underline.
Question RoBERTa SBERT GPT-4o APO GradeOPT
Accuracy (Acc)
Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.84 0.84 0.84 0.96 0.98
Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.57 0.55 0.76 0.86 0.86
Q3subscript𝑄3Q_{3}italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.69 0.67 0.88 0.92 0.92
Q4subscript𝑄4Q_{4}italic_Q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 0.65 0.65 0.78 0.80 0.82
Q5subscript𝑄5Q_{5}italic_Q start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 0.94 0.94 0.94 0.94 0.98
Q6subscript𝑄6Q_{6}italic_Q start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT 0.98 0.98 0.88 0.92 0.94
Q7subscript𝑄7Q_{7}italic_Q start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT 0.98 0.98 1.00 1.00 1.00
Q8subscript𝑄8Q_{8}italic_Q start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT 0.92 0.92 0.47 0.76 0.78
Q9subscript𝑄9Q_{9}italic_Q start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT 0.98 0.98 0.96 0.96 0.98
Q10subscript𝑄10Q_{10}italic_Q start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT 0.88 0.88 0.73 0.86 0.88
Q11subscript𝑄11Q_{11}italic_Q start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT 0.94 0.94 0.90 0.96 0.96
Cohen’s Kappa (κcsubscript𝜅𝑐\kappa_{c}italic_κ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT)
Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.00 0.00 0.55 0.83 0.92
Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.05 0.00 0.52 0.72 0.72
Q3subscript𝑄3Q_{3}italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.08 0.00 0.74 0.81 0.81
Q4subscript𝑄4Q_{4}italic_Q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 0.00 0.00 0.59 0.62 0.65
Q5subscript𝑄5Q_{5}italic_Q start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 0.00 0.00 0.64 0.64 0.64
Q6subscript𝑄6Q_{6}italic_Q start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT 0.00 0.00 0.22 0.31 0.38
Q7subscript𝑄7Q_{7}italic_Q start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT 0.00 0.00 1.00 1.00 1.00
Q8subscript𝑄8Q_{8}italic_Q start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT 0.00 0.00 0.10 0.15 0.17
Q9subscript𝑄9Q_{9}italic_Q start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT 0.00 0.00 0.48 0.48 0.66
Q10subscript𝑄10Q_{10}italic_Q start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT 0.00 0.00 0.34 0.56 0.60
Q11subscript𝑄11Q_{11}italic_Q start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT 0.00 0.00 0.50 0.73 0.73

Table 3 presents the performance of baseline models and GradeOpt on dataset 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Consistent with the findings in Table 2, LLM-based ASAG methods consistently outperform non-LLM-based models, reaffirming the advantages of leveraging LLMs for ASAG tasks. Moreover, GradeOpt consistently achieves the best performance among all baselines in Table 3, demonstrating its effectiveness across different datasets. This confirms that GradeOpt is a generalizable framework suitable for various ASAG tasks. Comparing model performance across the two datasets reveals an expected trend: grading responses related to teacher knowledge yields lower performance metrics. This aligns with our expectations, as evaluating pedagogical knowledge is inherently more complex, requiring deeper expertise and more intricate logical reasoning. Beyond performance improvements, LLM-based ASAG methods enhance grading transparency and comprehensibility. Educators can easily interpret the LLM’s scoring rationale, which can be further broken down at the expectation level. Each key concept or criterion specified in the rubric or learning objectives is individually assessed, allowing educators to evaluate how well student responses align with specific learning goals. This level of explainability fosters greater trust in automated grading and facilitates more informed instructional decisions.

5.6 Adaptation Results

Table 4: Result for Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over new dataset 𝒟1superscriptsubscript𝒟1\mathcal{D}_{1}^{\prime}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT before (Goptsubscript𝐺𝑜𝑝𝑡G_{opt}italic_G start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT) and after (Goptsuperscriptsubscript𝐺𝑜𝑝𝑡G_{opt}^{\prime}italic_G start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) test-time training. The average confidence indicator (CI) on Pilot Dataset 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is μ=0.2𝜇0.2\mu=-0.2italic_μ = - 0.2 and questions with CI (ζ<0.2𝜁0.2\zeta<-0.2italic_ζ < - 0.2) are marked with OOD.
Question Metric     𝔾𝟘subscript𝔾0\mathbb{G_{0}}blackboard_G start_POSTSUBSCRIPT blackboard_0 end_POSTSUBSCRIPT     𝔾𝕠𝕡𝕥subscript𝔾𝕠𝕡𝕥\mathbb{G_{opt}}blackboard_G start_POSTSUBSCRIPT blackboard_o blackboard_p blackboard_t end_POSTSUBSCRIPT     𝔾𝕠𝕡𝕥superscriptsubscript𝔾𝕠𝕡𝕥\mathbb{G_{opt}^{\prime}}blackboard_G start_POSTSUBSCRIPT blackboard_o blackboard_p blackboard_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT     Δdouble-struck-Δ\mathbb{\Delta}blackboard_Δ
Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Acc 0.67 0.7 0.78 +0.08
Kappa 0.49 0.52 0.64 +0.12
CI (ζ𝜁\zetaitalic_ζ) - -0.22 -0.17 +0.05
OOD - ×\times× -
Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Acc 0.75 0.82 - -
Kappa 0.59 0.7 - -
CI (ζ𝜁\zetaitalic_ζ) - -0.16 - -
OOD - ×\times× - -

To address RQ2, we collect an external dataset, 𝒟1superscriptsubscript𝒟1\mathcal{D}_{1}^{\prime}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, containing responses to two questions, Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, from new teachers across the nation. Specifically, we gather 1,352 responses for Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 1,364 responses for Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. From these, we randomly select 100 responses per question to serve as the test-time training dataset, while the remaining responses are used for evaluation. By applying the optimized guidelines Goptsubscript𝐺𝑜𝑝𝑡G_{opt}italic_G start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT learned from 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in Section 5.5, we explore its grading performance over 𝒟1superscriptsubscript𝒟1\mathcal{D}_{1}^{\prime}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. In addition, if the Goptsubscript𝐺𝑜𝑝𝑡G_{opt}italic_G start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT fails to pass the OOD test, where the confidence indicator ζ<μ𝜁𝜇\zeta<\muitalic_ζ < italic_μ, we will implement the test-time training with the train split of national dataset 𝒟trainsubscriptsuperscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D^{\prime}}_{train}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT. Then, the adapted guideline will be tested over the same 𝒟testsubscriptsuperscript𝒟𝑡𝑒𝑠𝑡\mathcal{D^{\prime}}_{test}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT again. Overall, the results are reported in Table 4. From the table, we can find that Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is marked with OOD Flag as its confidence indicator ζ=0.22<0.2=μ𝜁0.220.2𝜇\zeta=-0.22<-0.2=\muitalic_ζ = - 0.22 < - 0.2 = italic_μ. By comparing its performance between Table 2 and Tabel 4, we can confirm that it suffers great performance drops. Meanwhile, the confidence indicator of Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, ζ=0.16>0.2=μ𝜁0.160.2𝜇\zeta=-0.16>-0.2=\muitalic_ζ = - 0.16 > - 0.2 = italic_μ, and its performance gap between 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 𝒟2superscriptsubscript𝒟2\mathcal{D}_{2}^{\prime}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is relatively smaller. These two observations indicate that the proposed confidence indicator is a valid indicator for the OOD detection purpose. On the other hand, by comparing Goptsubscript𝐺𝑜𝑝𝑡G_{opt}italic_G start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT with the raw guidelines provided by G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we find that Goptsubscript𝐺𝑜𝑝𝑡G_{opt}italic_G start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT consistently outperforms G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This observation indicates that even the automatically optimized guideline suffers from the OOD issue, it is still better than the raw guideline. Finally, by calculating the performance change between the guideline before and after the test-time training, we can find that GradeOpt is able to get adapted to the new examples with only limited available annotated examples. In addition, the performance of guidelines after the test-time training is restored back to the acceptable grading range, e.g., Kappa >0.6absent0.6>0.6> 0.6, which indicates that the test-time training is an effective solution to help a high-performed Goptsubscript𝐺𝑜𝑝𝑡G_{opt}italic_G start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT to quickly get applied to different datasets.

5.7 Ablation Studies

Side-by-side bars: misconfidence strategy accuracy vs. random.

(a) Accuracy Comparison

Side-by-side bars: misconfidence strategy cohen’s kappa vs. random.

(b) Cohen’s Kappa Comparison

Figure 6: Performance comparison between GradeOpt with misconfidence-based and random-based outer batch selection strategies.

To answer RQ3, we conduct ablation studies to the nested iteration introduced in Section 4.2. We choose to experiment with Q5subscript𝑄5Q_{5}italic_Q start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT from 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as its relatively simple rubric design of Q5subscript𝑄5Q_{5}italic_Q start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT makes the shortest guideline prompt, leaving room for GradeOpt to add in its reflective experience as it iteratively learns from 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Thus, experimenting with Q5subscript𝑄5Q_{5}italic_Q start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT can better showcase GradeOpt’s optimization power. With the experimental results shown in the following sections, we demonstrate the effectiveness of each component.

Line plot: GradeOpt accuracy rises as outer-batch size increases,
plateauing after size 8.

(a) Outer Batch Size

Line plot: GradeOpt accuracy rises as inner-batch size increases, plateauing after size 8.

(b) Inner Batch Size

Figure 7: Performance of GradeOpt with different outer and inner iteration batch sizes.

First, we demonstrate the superiority of our misconfidence-based batch sampling strategy by comparing it with the random-based one. From Figure 6, we can observe that misconfidence-based batch sampling results are more consistent and accurate. While random selection generates optimal guidelines in 2 to 4 rounds, misconfidence-based selection consistently optimizes guidelines in 3 rounds. This, together with high predictive accuracy and alignment brought by misconfidence-based selection, makes the system reliable in practical educational scenarios as the required training round number is coherent. Then, we conduct experiments over the sizes of outer batch |bout|{20,32,64}subscript𝑏𝑜𝑢𝑡203264|b_{out}|\in\{20,32,64\}| italic_b start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT | ∈ { 20 , 32 , 64 } and inner batch |bin|{4,6,8}subscript𝑏𝑖𝑛468|b_{in}|\in\{4,6,8\}| italic_b start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT | ∈ { 4 , 6 , 8 }, targeting at exploring the influence of those two hyper-parameters on the performance of GradeOpt. From the results in Figure 7, we observe an increasing trend in accuracy and kappa as its outer batch size increases. This observation suggests that increasing the number of examples is always beneficial to the final performance. Similarly, from the Figure 7, we find an consistent increasing trend of performance as error number increases. Based on these two findings, we can conclude that the larger batch size is likely to bring performance gains to the GradeOpt. At last, we study how the iterations number impacts accuracy performance. In our experiment, we explore different iteration numbers, ranging from 1 to 5. Whilst increasing iteration on a minibatch, we utilize the early-stopping signal introduced in Section 4.2.2 to carefully monitor overfitting. As shown in Figure 8, increasing iteration with the help of early-stopping signal can effectively lead to higher test accuracy as well as more stable performance. While five iterations produce higher accuracy, due to limited computational resource, we use three iterations as our default setting.

Influence of iteration number.
Figure 8: Performance of GradeOpt with different numbers of iterations.

6 Case Study

In this section, we conduct case study by going through the refinement history and presenting the extraordinary performance of GradeOpt in optimizing the guidelines to match with task specifications and training dataset. We showcase example responses of Reflector and Refiner separately.

6.1 Reflector

The Reflector reflects and gives feedback when provided with a small selection of error samples. For example, as shown in Table 5, given a selection of errors in the prompt, Reflector follows a two-step reflect-and-suggest approach. It first provides an explanation to each error, illustrating the reason for why the mistake is made and analyzing why the response should be assessed differently. Based on this reflection, Reflector next discusses its suggestion of rule modification, based on the evidence from error sample.

Table 5: Case Study for Reflector
Example Error Teacher Response: “They are equal proportions because they are situations during the same time and the same angle of the sun.” - Expert Label: 2; AI Prediction: 1
Reflector’s Error Feedback Step 1: Explanation of Mistake - The classifier predicted a score of 1 because the response mentions equal proportions and references conditions affecting proportionality (same time and same angle of the sun). However, it lacks specific evidence from the student’s work and a thorough interpretation of the student’s mathematical thinking. The correct label is 2 because the response does provide a clear explanation of the proportional relationship and the conditions under which it holds, which is a nuanced understanding of proportionality.
Step 2: Rule Modification - Rule: If the response explains equal proportions and references conditions affecting proportionality, classify as 2, even if it lacks specific evidence from the student’s work.

6.2 Refiner

As introduced in Section 4.2.1, Refiner is tasked with guideline modification given the feedback from Reflector. Refiner can edit by adding, changing, removing, etc. We first showcase an example of Refiner adding a rule in Table  6, in response to Reflector’s feedback in Section 6.1. We observe that Refiner takes Reflector’s rules suggested in “Step 2: Rule Modification”. The new rule explains two components for grading a certain type of answer: first, the answer pattern, by saying “if the teacher’s response explains equal proportions and references conditions affecting proportionality, …, even if it lacks specific evidence from the student’s work”; second, the score assignment, by saying “classify as 2”. It additionally adds in details from the example by noting “(e.g., same time, same angle of the sun)” and includes the whole answer to give an elucidative grasping of to what answers this rule can be applied. To conclude, when adding rules Refiner is to include new types of answers, concluding their patterns and explaining word or phrase details. Next, we show an example of Refiner editing the guidelines. This happens when Refiner determines that a defective rule can be adjusted to give a better explanation or more complete details. As the example in Table 6 shows, part of the given rule is misleading, so Refiner revises this part. The changed rule first redefines the scenarios when this rule can be utilized, then it provides a detailed explanation by citing the answer and illustrating how the answer falls into the pattern category defined in the new rule to receive the correct score.

Table 6: Case Study for Refiner
Action Refiner’s generated rules
Add “1. Equal Proportions and Conditions:
- Rule: If the teacher’s response explains equal proportions and references conditions affecting proportionality (e.g., same time, same angle of the sun), classify as 2, even if it lacks specific evidence from the student’s work.
- Example: They are equal proportions because they are situations during the same time and the same angle of the sun.”
Edit Before Edit: “- Award 1 point if … explicit evidence. For instance, if the response mentions that the student understands the unit rate, which is related to the concept of proportionality. (misleading statement)
After Edit: “- Award 1 point if … explicit evidence. For example, if the response states that the student might have a limited understanding of proportionality, it should be awarded 1 point. For instance, in the response “their answer makes sense only if there is a proportional relationship between the height of the object and the length of the shadow," teacher mentioned “limited/partial understanding of the proportional relationship" but lacked depth. (ambiguity resolved)

7 Conclusion

This paper explores fully automating guideline optimization to leverage LLM techniques including reflection and prompt engineering to solve ASAG tasks. We innovatively decompose the ASAG procedure into two steps: guideline optimization and grading. Specifically, we set our focus on automatic guideline optimization to avoid the manual efforts of composing a task-optimal guideline. To further prevent labeling a large amount of data, we propose a two-phase “train and test-adapt" procedure to maximally tune a guideline on a small training set and securely ensure this optimized output is reliable for large-scale grading. The proposed GradeOpt is a multi-agent guideline optimization system that iteratively leads the LLM to reflect on mistakes, learn answer patterns, and make improving modifications. Empirical experiments on two pedagogical datasets have demonstrated the effectiveness of GradeOpt.

8 Acknowledgments

This work was supported in part by the National Science Foundation under Grant No. 1813760, No. 2405483, No. 2200757 and No. 2234015. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. We thank Dr. Clare Carlson for help in revising one of the scoring rubrics.

References

  • [1] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. JMLR, 2003.
  • [2] S. Burrows, I. Gurevych, and B. Stein. The eras and trends of automatic short answer grading. IJAIED, 2015.
  • [3] L. Camus and A. Filighera. Investigating transformers for automatic short answer grading. In AIED, 2020.
  • [4] Y. Chu et al. Enhancing llm-based short answer grading with retrieval-augmented generation. arXiv preprint arXiv:2504.05276, 2025.
  • [5] C. Cohn, N. Hutchins, T. Le, and G. Biswas. A chain-of-thought prompting approach with llms for evaluating students’ formative assessment responses in science. In AAAI, 2024.
  • [6] A. Condor, M. Litster, and Z. A. Pardos. Automatic short answer grading with sbert on out-of-sample questions. In EDM, 2021.
  • [7] A. Condor and Z. Pardos. Explainable automatic grading with neural additive models. In AIED, 2024.
  • [8] Y. Copur-Gencturk and T. Tolar. Mathematics teaching expertise: A study of the dimensionality of content knowledge, pedagogical content knowledge, and content-specific noticing skills. Teaching and Teacher Education, 114:103696, 2022.
  • [9] M. O. Dzikovska et al. Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. In SemEval, 2013.
  • [10] J. A. Erickson et al. The automated grading of student open responses in mathematics. In LAK, 2020.
  • [11] M. Freitag and Y. Al-Onaizan. Beam search strategies for neural machine translation. arXiv preprint arXiv:1702.01806, 2017.
  • [12] D. Gobbo et al. Gradeaid: a framework for automatic short answers grading in educational contexts—design, implementation and evaluation. KAIS, 2023.
  • [13] S. Hassan, A. A. Fahmy, and M. El-Ramly. Automatic short answer scoring based on paragraph embeddings. IJACSA, 2018.
  • [14] D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
  • [15] O. Henkel et al. Can large language models make the grade? an empirical study evaluating llms ability to mark short answer questions in k-12 education. In Learning@ Scale, 2024.
  • [16] J. Huang et al. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023.
  • [17] L. Huang et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
  • [18] Z. Jiang et al. How can we know what language models know? arXiv preprint arXiv:1911.12543, 2020.
  • [19] G. Juneja et al. Task facet learning: A structured approach to prompt optimization. arXiv preprint arXiv:2406.10504, 2024.
  • [20] L. Kaldaras, H. Akaeze, and J. Krajcik. Developing and validating next generation science standards-aligned learning progression to track three-dimensional learning of electrical interactions in high school physical science. Journal of Research in Science Teaching, 2021.
  • [21] Z. Kenton et al. Alignment of language agents. arXiv preprint arXiv:2103.14659, 2021.
  • [22] E. Latif and X. Zhai. Fine-tuning chatgpt for automatic scoring. arXiv preprint arXiv:2310.10072, 2023.
  • [23] C. Leacock and M. Chodorow. C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 2003.
  • [24] G.-G. Lee et al. Applying large language models and chain-of-thought for automatic scoring. arXiv preprint arXiv:2312.03748, 2024.
  • [25] Y. Liu et al. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  • [26] J. Lun et al. Multiple data augmentation strategies for improving performance on automatic short answer scoring. In AAAI, 2020.
  • [27] R. Ma et al. Are large language models good prompt optimizers? arXiv preprint arXiv:2402.02101, 2024.
  • [28] A. Madaan et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  • [29] M. Mohler, R. Bunescu, and R. Mihalcea. Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In ACL, 2011.
  • [30] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2024.
  • [31] L. Pan et al. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188, 2023.
  • [32] A. Poulton and S. Eliens. Explaining transformer-based models for automatic short answer grading. In ICDTE, 2021.
  • [33] R. Pryzant et al. Automatic prompt optimization with "gradient descent" and beam search. arXiv preprint arXiv:2305.03495, 2023.
  • [34] N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  • [35] S. Roy, Y. Narahari, and O. D. Deshmukh. A perspective on computer assisted assessment techniques for short free-text answers. In CAA, 2015.
  • [36] S. Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2017.
  • [37] N. Shinn et al. Reflexion: Language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366, 2023.
  • [38] N. Süzen et al. Automatic short answer grading and feedback using text mining methods. Procedia Computer Science, 2020.
  • [39] G. Tyen et al. LLMs cannot find reasoning errors, but can correct them given the error location. In ACL, 2024.
  • [40] J. Wei et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2023.
  • [41] W. Xie et al. Grade like a human: Rethinking automated assessment with large language models. arXiv preprint arXiv:2405.19694, 2024.
  • [42] S. Xu and C. Zhang. Misconfidence-based demonstration selection for llm in-context learning. arXiv preprint arXiv:2401.06301, 2024.
  • [43] C. Yang et al. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2024.
  • [44] K. Yang et al. Content knowledge identification with multi-agent large language models (llms). In AIED, 2024.
  • [45] Y. Zhou et al. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2023.
\balancecolumns