GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning

Yutong Li University of Electronic Science and Technology of ChinaChengduChina , Yitian Zhou University of Electronic Science and Technology of ChinaChengduChina , Xudong Wang University of Electronic Science and Technology of ChinaChengduChina , GuoChen University of Electronic Science and Technology of ChinaChengduChina and Caiyan Qin School of Robotics and Advanced Manufacture, Harbin Institute of Technology, ShenzhenShenzhenChina

Abstract.

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, including multi-step reasoning such as mathematical proving. However, existing approaches often lack an explicit and dynamic mechanism to structurally represent and evolve intermediate reasoning states, which limits their ability to perform context-aware theorem selection and iterative conclusion generation. To address these challenges, we propose GraphMind, a novel dynamic graph-based framework that integrates the graph neural network (GNN) with LLMs to iteratively select theorems and generate intermediate conclusions for multi-step reasoning. Our method models the reasoning process as a heterogeneous evolving graph, where nodes represent conditions, theorems, and conclusions, while edges capture logical dependencies between nodes. By encoding the current reasoning state with GNN and leveraging semantic matching for theorem selection, our framework enables context-aware, interpretable, and structured reasoning in a closed-loop manner. Experiments on various question-answering (QA) datasets demonstrate that our proposed GraphMind method achieves consistent performance improvements and significantly outperforms existing baselines in multi-step reasoning, validating the effectiveness and generalizability of our approach.

^†^†copyright: none

1. Introduction

Large language models (LLMs) such as GPT-4 (openai2023gpt4), Claude (anthropic2023claude), and Gemini (google2023gemini) have demonstrated remarkable capabilities across a wide range of natural language processing tasks, including text generation, question answering, and code synthesis. Beyond surface-level fluency, recent advances show that LLMs can perform multi-step reasoning (wei2022chain), including mathematical problem solving (cobbe2021training; lewkowycz2022solving), and formal theorem proving (yang2023proofnet; han2021proofwriter). These reasoning tasks often involve selecting appropriate theorems based on initial conditions and generating intermediate conclusions through iterative inference, ultimately leading to the final result. The results are largely enabled by prompt-based techniques such as Chain-of-Thought prompting (kojima2022large), self-consistency (wang2022self), and retrieval-augmented generation (lewis2020retrieval).

Despite these promising advances, current LLMs still struggle to manage and evolve structured reasoning states over multiple inference steps. In particular, they lack an explicit mechanism to represent intermediate conclusions and control the application of relevant theorems in a coherent reasoning trajectory. For example, (han2021proofwriter) and (yang2023proofnet) leverage symbolic structures to guide reasoning, but rely on pre-defined proof sketches or static logic graphs, without dynamically modeling intermediate conclusions. Similarly, graph-based retrieval methods (tian2023graph) retrieve theorems based on fixed queries, without dynamically modeling the evolving context or maintaining an explicit reasoning state.

These limitations indicate that current LLM-based methods often treat theorem invocation or knowledge retrieval as isolated operations rather than as components of a coherent and evolving reasoning trajectory. Consequently, LLMs often fail on tasks that require iterative theorem matching and structured multi-hop reasoning over dynamically evolving problem states, limiting their effectiveness in mathematical proof construction and formal deductive tasks.

To overcome these challenges, we propose GraphMind, a novel reasoning framework that treats the entire deduction process as a dynamically evolving graph. We design the reasoning graph inspired by recent efforts to integrate symbolic structures into neural reasoning systems (yang2023proofnet; tian2023graph; abdaljalil2025theorem). In this graph, nodes represent reasoning units including initial conditions, intermediate conclusions, and invoked theorems, while edges encode logical dependencies among them. We employ a GNN module to encode the current reasoning state and fuse contextual signals, enabling informed theorem selection and iterative conclusion generation.

Our method proceeds reasoning in a closed-loop pipeline: starting from a graph consisting of initial conditions as nodes, the GNN is applied to encode and fuse the current reasoning state into a latent representation. This representation is then used to retrieve the most relevant theorem from a predefined theorem library via a semantic matching module. The selected theorem, combined with the current state representation, forms a new prompt that is passed to the LLM to generate an intermediate conclusion. This conclusion, along with the invoked theorem, is added to the graph as new nodes, and corresponding edges are created to reflect logical flow. The graph is thus updated, and the process repeats. This dynamic graph-based approach enables context-aware theorem invocation, structured reasoning tracking, and interpretable proof trajectories.

Prior works have explored incorporating external knowledge retrieval (lewis2020retrieval), graph-based representations (tian2023graph), and theorem-guided reasoning frameworks (abdeljalil2023theorem), aiming at addressing the limitations of LLMs in structured and iterative reasoning. However, these approaches often rely on static knowledge retrieval or isolated reasoning steps without dynamically updating the reasoning state, which limits their ability to capture the evolving context and control the theorem selection process effectively. In contrast to these approaches that rely on static knowledge retrieval or linear prompting, our GraphMind method introduces a unified framework where symbolic structure and neural generation are deeply integrated. Through GNN-based graph encoding, semantic theorem selection, and LLM-guided generation into a closed-loop pipeline, our framework enables explicit connections between each reasoning step to both its logical predecessors and future applications, supporting a more accurate, controllable, and generalizable inference process.

Overall, our main contributions are presented as follows:

•

This study proposes an explicit reasoning framework that models the entire theorem-driven reasoning process as a dynamically evolving graph. By explicitly tracking intermediate conclusions and applying theorems within a heterogeneous graph, our framework enables interpretable and structured multi-step reasoning.
•

To the best of our knowledge, this work is the first to integrate GNN with LLMs for dynamic reasoning state fusion within a closed-loop inference system. By leveraging graph-based theorem selection and prompt-driven conclusion generation, our system achieves context-aware reasoning and improved stepwise consistency.
•

This study demonstrates the effectiveness and generalizability of our approach through comprehensive experiments on question-answering (QA) datasets from various domains, showing that our method significantly outperforms existing baselines in terms of accuracy.

2. Related Work

2.1. Theorem-Guided Reasoning in LLMs

Several recent works have explored incorporating formal or semi-formal theorems into the reasoning processes of LLMs. ProofWriter (han2021proofwriter_acl) constructs synthetic deductive and abductive reasoning tasks using predefined logical rules, enabling LLMs to perform multi-step inference in natural language. ProofNet (yang2023proofnet_acl) builds upon this by introducing a formal theorem corpus and a retrieval-based mechanism for selecting relevant premises and generating proofs. Theorem-of-Thought (abdeljalil2023theorem_aaai) further introduces a multi-agent architecture, assigning distinct reasoning roles such as theorem selection and proof synthesis to separate components.

While these approaches demonstrate the potential of theorem-based reasoning, they often treat theorem retrieval and reasoning context as static, limiting their ability to adapt across inference steps. In parallel, recent studies have investigated ways to structure the reasoning context to improve multi-step inference. Some methods rely on symbolic memory or structured annotations to guide model decisions, while others explore integration of external knowledge bases or intermediate state tracking. Compared to these approaches, our method introduces a dynamically evolving graph to represent the current reasoning state, where intermediate conclusions are explicitly encoded and updated at each step. This design enables theorem retrieval to be conditioned on the structured context, allowing for more context-aware and iterative reasoning.

2.2. Graph-Structured Representations for Multi-Step Inference in LLMs

Multi-step reasoning with LLMs presents significant challenges due to the need for maintaining and integrating intermediate inference states over multiple reasoning paths. To address these challenges, graph-based reasoning methods have attracted increasing attention, as they enable explicit modeling of reasoning states and dependencies in structured graph formats.

Graph-of-Thought (ma2024think_acl) proposes a framework that represents multiple reasoning paths within a graph, allowing exploration and synthesis of diverse inference trajectories. Similarly, Graph Neural Prompting (tian2023graph) leverages GNNs to incorporate relational dependencies among entities into prompt construction, thereby enhancing the conditioning of LLM outputs. Additionally, Graph-Program (yao2023graphprogram_icml) treats reasoning as program synthesis over evolving symbolic graphs, coordinating LLM calls based on graph transformations.

These approaches demonstrate the effectiveness of graph representations in structuring and guiding multi-step reasoning. However, most existing methods maintain static graph structures throughout inference, which may limit their capacity to capture the evolving reasoning context dynamically. Building upon these advances, our method introduces a dynamic reasoning graph that is incrementally updated during inference. By jointly applying graph neural networks and theorem selection at each step, our framework forms a tightly coupled closed-loop system that continuously evolves the reasoning state, explicitly encodes intermediate conclusions, and supports context-aware theorem application.

3. Method

3.1. Overview

Refer to caption — Figure 1. Overview of the proposed GraphMind framework, consisting of four core modules: graph encoding, theorem matching, conclusion generation, and graph expansion. The pipeline encodes the evolving reasoning state into a graph structure, selects context-aware theorems, generates intermediate conclusions via LLMs, and updates the graph with new nodes and edges.

We propose a closed-loop framework for structured reasoning that integrates GNN and LLMs within a dynamically evolving reasoning graph. Existing LLM-based reasoning systems often struggle to maintain consistency across multiple inference steps, lacking an explicit mechanism to track intermediate conclusions or leverage prior deductions in a structured manner. Meanwhile, symbolic systems with theorem retrieval modules often fail to adapt flexibly to complex natural language contexts and evolving reasoning trajectories.

To address these limitations, we formulate reasoning as a graph-based iterative process. In our framework, nodes represent reasoning elements such as conditions, theorems, and intermediate conclusions, while edges capture their logical dependencies. At each step, a relational GNN encodes the current reasoning graph to produce a global state representation, which is then used to match contextually relevant theorems from a structured library. The selected theorem, combined with the current reasoning context, guides a pretrained LLM to generate a new intermediate conclusion. This conclusion is then added back into the graph, updating the reasoning state and enabling further inference.

By combining symbolic graph reasoning with large-scale language modeling, our method addresses two key challenges: (1) how to maintain a coherent and interpretable reasoning state across multiple steps, and (2) how to dynamically retrieve and apply structured knowledge in a context-aware manner. This closed-loop architecture enables step-by-step inference grounded in structured logic while maintaining global coherence and traceability throughout the reasoning process. The overview of the entire pipeline is illustrated in Figure 1.

3.2. Reasoning State Graph Construction

Structured and interpretable multi-step inference requires an explicit mechanism to track reasoning progress and support theorem-driven deduction. To address this, the reasoning process is formulated as an iterative expansion over a heterogeneous graph, a type of graph that contains multiple types of nodes and edges to represent semantically distinct entities and relations. This structure is well-suited for modeling logical inference, as it allows the representation of diverse reasoning elements, such as conditions, theorems, and conclusions, and their directed dependencies within a unified framework. By explicitly encoding this evolving reasoning state, the heterogeneous graph enables interpretable tracking of inference steps and provides a natural substrate for theorem application and conclusion generation.

At each reasoning step $t$ , the current state is represented as a directed heterogeneous graph:

(1)

\mathcal{G}^{(t)}=(\mathcal{V}^{(t)},\mathcal{E}^{(t)},\mathcal{R}),

where the node set $\mathcal{V}^{(t)}$ consists of three distinct categories: initial conditions, intermediate conclusions, and applied theorems. This design reflects the semantic heterogeneity inherent in formal reasoning tasks. Edges $\mathcal{E}^{(t)}$ are directed and typed, encoding logical relations such as UseCond (conditions used by a theorem) and Infers (theorem infers a conclusion). The relation set $\mathcal{R}$ specifies the allowed edge types. By separating node and edge semantics, the heterogeneous graph structure enables finer-grained modeling of reasoning dynamics.

To enable structural reasoning over this graph, each node $v_{i}\in\mathcal{V}^{(t)}$ is associated with a representation vector $x_{i}^{(k,t)}\in\mathcal{R}^{d}$ , where $k$ is the GNN layer index. This embedding formulation allows information to propagate across the graph and models both local interactions and global structure. A readout function aggregates the final-layer node embeddings to produce a compact summary of the current reasoning state:

(2)

r^{(t)}=\texttt{Readout}\left(\left\{x_{i}^{(K,t)}\right\}\right),

where $K$ is the number of layers in the graph neural network. The resulting representation $r^{(t)}$ serves as a semantic fingerprint of the reasoning context at step $t$ .

A fixed theorem library $\mathcal{T}=\{T_{j}\}$ is maintained to support theorem-based inference. Each theorem $T_{j}$ is encoded into a dense vector $\vec{t}_{j}$ using a shared encoder, producing embeddings that remain constant during inference. The similarity between the current reasoning state and the candidate theorems is computed via a metric function such as cosine similarity:

(3)

T^{*}=\arg\max_{T_{j}\in\mathcal{T}}\texttt{sim}(r^{(t)},\vec{t}_{j}).

This formulation enables efficient and scalable theorem selection by retrieving the most relevant candidate through similarity computation.

Based on the selected theorem $T^{*}$ and the current graph representation $r^{(t)}$ , a structured prompt is constructed and fed into the LLM to generate an intermediate conclusion $z^{(t)}$ . The generation process thus integrates symbolic knowledge (from $T^{*}$ ) and contextual understanding (from $r^{(t)}$ ).

Accordingly, the reasoning graph is expanded to reflect this new step. The selected theorem $T^{*}$ and generated conclusion $z^{(t)}$ are added to the graph as new nodes, and edges are created to encode their logical connections to prior content:

(4)

\mathcal{V}^{(t+1)}=\mathcal{V}^{(t)}\cup\{T^{*},z^{(t)}\},

(5)

\mathcal{E}^{(t+1)}=\mathcal{E}^{(t)}\cup\left\{(c_{i},T^{*},\texttt{UseCond})\right\}\cup\left\{(T^{*},z^{(t)},\texttt{Infers})\right\},

where $c_{i}$ refers to the subset of supporting conditions retrieved from the current graph. This iterative expansion process enables the system to preserve the reasoning trajectory and supports traceable, multi-step inference grounded in structured logic.

3.3. GraphMind: Multi-Step Reasoning Framework over Evolving Graphs

We adopt a closed-loop design for structured reasoning, which iteratively updates a heterogeneous graph to reflect the evolving reasoning process. At each step $t$ , the system operates on the current reasoning graph $\mathcal{G}^{(t)}$ and executes four core modules: graph encoding, theorem matching, conclusion generation, and graph expansion.

3.3.1. Graph Encoding

To capture the evolving state of the reasoning process, we encode the heterogeneous graph $\mathcal{G}^{(t)}$ using a relational GNN. This GNN accounts for node types (e.g., conditions, theorems, conclusions) and relation types (e.g., uses, infers) to enable structured message passing. Each node $v_{i}\in\mathcal{V}^{(t)}$ is initialized with an embedding $x_{i}^{(0,t)}$ and updated across $K$ layers:

(6)

x_{i}^{(k+1,t)}=\mathrm{GNNLayer}^{(k)}\left(x_{i}^{(k,t)},\{x_{j}^{(k,t)}:(v_{j},v_{i},r)\in\mathcal{E}^{(t)}\}\right),

where $r\in\mathcal{R}$ denotes the relation type controlling message passing. A readout function aggregates all updated node features to produce the global reasoning state:

(7)

r^{(t)}=\mathrm{Readout}\left(\{x_{i}^{(K,t)}:v_{i}\in\mathcal{V}^{(t)}\}\right).

By structuring the evolving reasoning trace as a graph and embedding it with GNN, the model maintains a holistic understanding of the logical context and dependencies across steps.

3.3.2. Theorem Matching

Given the encoded reasoning state $r^{(t)}$ , the system performs semantic retrieval over a structured theorem library $\mathcal{T}=\{T_{j}\}$ . Each theorem is pre-encoded into a dense vector $\vec{t}_{j}$ , and a similarity-based scorer ranks candidates:

(8)

s_{j}^{(t)}=\mathrm{sim}(r^{(t)},\vec{t}_{j}),\ T^{*}=\arg\max_{T_{j}\in\mathcal{T}}s_{j}^{(t)}.

This graph-conditioned retrieval ensures that selected theorems are context-aware and relevant to the current logical state, rather than relying on static or rule-based selection.

3.3.3. Conclusion Generation

By decoupling structural reasoning from natural language generation, we leverage LLMs for open-ended synthesis while retaining symbolic control from the graph.

The selected theorem $T^{*}$ , combined with the current reasoning context, is formulated into a structured prompt and input to a pretrained LLM. The LLM then generates the next logical conclusion:

(9)

z^{(t)}=\mathrm{LLM}\left(\mathrm{Prompt}(r^{(t)},T^{*})\right).

A typical prompt template is: "Given conditions: [Cond1], [Cond2], ..., and Theorem: [T*], derive the next conclusion."

3.3.4. Graph Expansion

The newly generated conclusion $z^{(t)}$ and applied theorem $T^{*}$ are integrated back into the reasoning graph, forming a traceable and interpretable reasoning path. Logical dependencies are encoded as typed edges:

(10)

\mathcal{V}^{(t+1)}=\mathcal{V}^{(t)}\cup\{T^{*},z^{(t)}\},

(11)

\mathcal{E}^{(t+1)}=\mathcal{E}^{(t)}\cup\left\{(c_{i},T^{*},\texttt{UseCond})\right\}\cup\left\{(T^{*},z^{(t)},\texttt{Infers})\right\},

where $c_{i}$ denotes the supporting conditions retrieved from the graph.

This step closes the reasoning loop, structurally encoding the LLM-generated conclusion for further steps, enabling explainable, traceable reasoning with cumulative memory.

Generally, we establish a structured reasoning loop by encoding the evolving graph state, identifying context-relevant theorems, generating conclusions via LLMs, and incorporating new reasoning elements into the graph for subsequent steps. We summarize the entire framework in Algorithm 1.

Algorithm 1 GraphMind Framework

0: initial graph-based reasoning state

\mathcal{G}^{(0)}

, Theorem library

\mathcal{T}

0: Final graph

\mathcal{G}^{(T)}

with accumulated conclusions

Loop until reasoning terminates:

1. Graph Encoding:

Use relational GNN to obtain current reasoning state embedding

r^{(t)}

from graph

\mathcal{G}^{(t)}

2. Theorem Matching:

Compute similarity scores

s_{j}=\mathrm{sim}(r^{(t)},\vec{t}_{j})

and select top theorem

T^{*}

3. Conclusion Generation:

Use LLM to generate new conclusion

z^{(t)}

based on

T^{*}

and its premises

4. Graph Expansion:

Add

T^{*}

z^{(t)}

, and corresponding edges into graph

\mathcal{G}^{(t)}

return Final graph

\mathcal{G}^{(T)}

3.4. Training Objectives and Optimization

To enable accurate theorem retrieval conditioned on the evolving reasoning state, we train the framework to align the graph-based state representation $r^{(t)}$ with the corresponding ground-truth theorem embedding $\vec{t}^{+}$ at each step. A similarity-based scorer for $\vec{t}^{+}$ is defined as:

(12)

s^{{(t)}+}=\mathrm{sim}(r^{(t)},\vec{t}^{+}),

where $\mathrm{sim}(\cdot,\cdot)$ denotes cosine similarity. This alignment is optimized using the InfoNCE loss (oord2018representation), which promotes semantic similarity between the current reasoning state and the correct theorem, while discriminating against negative examples. The loss function is defined as:

(13)

\mathcal{L}_{\mathrm{match}}^{(t)}=-\log\left(\frac{\exp(s^{{(t)}+}/\tau)}{\exp(s^{{(t)}+}/\tau)+\sum_{j\in\mathcal{N}}\exp(s_{j}^{(t)}/\tau)}\right),

where $\tau$ is a temperature scaling factor.

To construct negative samples for contrastive training, a fixed-size subset $\mathcal{N}$ is randomly drawn from the remaining theorems in the library $\mathcal{T}\setminus\{T^{+}\}$ . In-batch negative sampling is adopted for computational efficiency and variance reduction.

During training, the model iteratively updates its parameters to minimize the contrastive loss at each reasoning step. This encourages the graph encoder and theorem matcher to develop context-aware representations that facilitate accurate and discriminative theorem selection throughout the reasoning process.

4. Experiments

In this section, we present the experimental results to evaluate the effectiveness and generalizability of our proposed GraphMind method. Our experiments focus on two main aspects:

•

We validate the overall performance and superiority of GraphMind through comparisons with strong baselines across multiple datasets from different domains.
•

We conduct ablation studies on the GNN-based theorem selection module to assess its contribution to reasoning performance.

4.1. Experimental Setup

4.1.1. Datasets and Evaluation

We conduct experiments on three question-answering (QA) datasets from distinct domains: GSM8K (cobbe2021training) (mathematics), FinQA (chen2021finqa) (finance), and LegalBench (guha2023legalbench) (law).

•

GSM8K is a collection of 8,500 high-quality grade-school math problems. Each question typically requires 2 to 8 reasoning steps to solve, where intermediate steps are often guided by basic mathematical theorems. This dataset evaluates models’ capability in basic mathematical reasoning and logical deduction.
•

FinQA is a large-scale dataset derived from financial reports, where question answering may require complex numerical reasoning and difficult financial concept understanding. It assesses financial reasoning ability under semi-structured inputs (e.g., tables combined with textual context), measuring the models’ applicability to real-world financial scenarios.
•

LegalBench is a benchmark covering a broad set of legal reasoning tasks. It includes diverse textual formats, task structures, legal domains, and levels of reasoning complexity. This dataset examines models’ performance in legal-domain text understanding and logical reasoning, with a particular focus on formal reasoning tasks such as deductive and inductive inference.

We employ overall accuracy as the primary evaluation metric to assess the performance of the baselines and our methods. Overall accuracy measures whether the model successfully derives a final conclusion that matches the target answer, thus reflecting its end-to-end reasoning correctness over the entire test set. The results are averaged over three testing runs to ensure stability and reliability.

4.1.2. Data Preprocessing

To support training and evaluation on multi-step theorem reasoning tasks, we systematically reconstruct and preprocess the original datasets accordingly. Specifically, each dataset is split into 80% training and 20% testing. The training set is enhanced with explicit supervision signals, while the test set is designed to assess model generalization in simulated real-world settings without access to theorem annotations.

In the training set, each sample consists of the following fields: (1) a natural language question that requires multi-step deduction; (2) a list of relevant premises containing factual or numerical background; (3) a target conclusion as the correct answer; and (4) a complete sequence of structured inference steps. Each reasoning step includes a natural language description of the step, the applied theorem ID, the IDs of used premises or intermediate conclusions, and the corresponding result.

To support consistent theorem usage across problems, we construct a global theorem set shared across all training samples. This set is obtained by automatically extracting candidate theorems from the original data using a language model, followed by semantic clustering to merge similar entries. The resulting set includes approximately 80 distinct theorems, each described by a natural language statement and a unique ID.

The test set is sampled from the test split in the original dataset and preserves its natural structure. Each test sample includes the question, the corresponding ground-truth answer, and a set of premises. Unlike the training set, test samples do not include any theorem annotations or inference steps. This design requires the model to autonomously select appropriate theorems, construct intermediate conclusions, and generate the final answer without explicit guidance—thus simulating real-world problem-solving scenarios and testing the ability of the model to generalize the learned reasoning strategies.

To ensure data quality and theorem coverage, we apply a three-step pipeline: (1) Theorem Extraction and Compression: We use an LLM (e.g., GPT) to identify potential theorems per sample and cluster semantically similar ones, producing a compact and generalizable theorem library. (2) Reasoning Chain Generation: For each sample, we use LLMs to generate stepwise inference traces based on the extracted theorems, ensuring each step specifies the used theorem and premises. (3) Sample Filtering and Balancing: We select a subset of samples that evenly covers all theorems and constrain the number of samples per theorem (e.g., 200-600) to mitigate long-tail distribution issues in training.

4.1.3. Baselines

•

Chain-of-Thought (CoT) (kojima2022large): This framework encourages LLMs to generate a sequence of intermediate reasoning steps in natural language before producing the final answer. By decomposing multi-step problems into smaller sub‑problems, CoT significantly improves performance on arithmetic, commonsense, and symbolic reasoning tasks, especially for large LLMs.
•

Tree-of-Thought (ToT) (yao2023tree): This framework generalizes over CoT prompting and enables the LLM to explore multiple reasoning paths in parallel with a tree-like structure. This allows the model to use tree search backtracking and self‑evaluation to select promising branches.
•

Graph-of-Thought (GoT) (ma2024think_acl): This framework models intermediate thought units as graph nodes, with edges representing logical dependencies between them. This non‑sequential structure allows combining diverse thought units, leveraging feedback loops, and distilling insights from the graph representation.
•

Active-Prompt (diao2024active): This framework builds upon CoT prompting by incorporating uncertainty-guided active learning. For each training example, the model generates multiple candidate reasoning chains and selects the most uncertain samples based on answer disagreement or other uncertainty measures for manual CoT annotation. These high-quality exemplars are then used for few-shot prompting, leading to significant performance gains on complex reasoning tasks.
•

LLM-ARC (kalyanpur2024llm): This framework combines an LLM with an automated reasoning critic (ARC) for neuro-symbolic reasoning. The LLM generates logical programs and corresponding semantic tests, while the ARC executes them and provides feedback to iteratively refine the reasoning process.

For comparison experiments, we use GPT-3.5-turbo as the backbone model. Meanwhile, ablation experiments on the GNN-based theorem selection module are conducted using both GPT-4o-mini (hurst2024gpt) and GPT-3.5-turbo to validate generalizability.

4.1.4. Hyperparameters and Implementation Details

We adopt the following key hyperparameters and experimental settings during training and inference:

•

Training epochs (train_epoch): The number of full passes over the training set during model optimization.
•

Batch size (batch_size): The number of samples used in each parameter update, balancing training efficiency and GPU memory usage.
•

Learning rate (learning_rate): The step size for gradient updates of the GNN parameters. We use a relatively large learning rate to accelerate convergence.
•

Maximum inference steps (max_inference_steps): The upper bound on the number of reasoning steps allowed during multi-step deduction, preventing infinite loops.
•

Temperature (temperature): A scaling factor applied to the similarity distribution in the theorem selection stage, enhancing the discriminative ability of the model in multi-class retrieval.
•

Sample count range per label (target_samples_per_label_min
/target_samples_per_label_max): Controls the number of samples drawn for each theorem label to mitigate class imbalance during training.

During inference, text-embedding-ada-002 model from OpenAI is employed as the text encoder, which outputs 1536-dimensional embeddings. These embeddings are used for fact representation and similarity computation with candidate theorems. In addition, we perform data encoding and preprocessing before training, and apply a label balancing strategy (balance_labels=True) to enhance models’ robustness on underrepresented theorem classes.

4.2. Performance Evaluation

Table 1 presents a detailed comparison of our method against several state-of-the-art reasoning baselines, including Chain-of-Thought (CoT), Graph-of-Thought (GoT), Tree-of-Thought (ToT), Active-Prompt, and LLM-ARC, across three representative datasets from various domains: GSM8K, FinQA, and LegalBench. Our approach consistently outperforms all baselines across all benchmarks, with particularly notable improvements on FinQA and LegalBench.

As shown in Table 1, the performance gains from GoT and ToT over the standard CoT baseline indicate the importance of dynamically constructing reasoning structures during the inference process. Specifically, GoT and ToT outperform CoT by approximately 1.7-2.0 percentage points on GSM8K and over 1.3-2.2 points on FinQA and LegalBench. These improvements suggest that organizing intermediate reasoning steps into a structured graph or tree rather than the basic chain in CoT prompting facilitates better information decomposition and local consistency. In particular, the ability to dynamically evolve the graph to expand and reuse reasoning sub-nodes leads to better performance in multi-step problems, where different parts of the problem may share intermediate conclusions.

Despite achieving the best overall performance, the absolute accuracy on FinQA and LegalBench is relatively lower than that on GSM8K. This discrepancy stems from the semantic variability of natural language in these datasets. In FinQA and LegalBench, correctness is often determined not just by factual consistency but by language understanding. LLMs may generate semantically correct but lexically divergent answers, such as paraphrased justifications and varied reasoning styles, which are penalized by strict matching metrics. In contrast, GSM8K focuses on numerical answers, allowing for more objective evaluation criteria and higher alignment with ground-truth labels.

These findings highlight both the strength of our graph-based reasoning architecture and the inherent evaluation challenges in natural language reasoning tasks. Overall, the results demonstrate that our method provides a more robust and generalizable framework for reasoning across diverse domains.

Methods	GSM8K	FinQA	LegalBench
CoT	74.20	55.62	56.34
GoT	75.86	56.97	58.10
ToT	76.21	57.61	57.14
Active-Prompt	78.20	56.12	58.63
LLM-ARC	77.91	57.52	58.88
Ours (GraphMind)	80.52	62.37	63.18

Table 1. Comparison of reasoning accuracy (%) across three benchmark datasets. Our method consistently outperforms all baselines, demonstrating the effectiveness of our GNN-integrated dynamic graph-based reasoning framework.

4.3. Ablation Studies

To evaluate the contribution of the GNN-based reasoning module, we design an ablation variant by replacing the graph embedding with a simple average embedding strategy. Specifically, instead of aggregating premise node representations through learned GNN parameters, the ablated model (w/o GNN) generates the reasoning state vector by directly averaging the text embeddings of retrieved premises. For theorem selection, the GNN model leverages graph-structured similarity learning, whereas the ablation variant adopts nearest neighbor retrieval based on cosine similarity. While the method w/o GNN eliminates the overhead of additional GNN parameters and offers faster computation and simpler implementation, it lacks the ability to model complex inter-premise relationships. This contrast allows us to isolate and quantify the benefit of structural reasoning learned through the GNN. Ablation results are presented in Table 2.

As shown in Table 2, the GNN-based heterogeneous graph construction module plays a critical role in our proposed graph reasoning framework. On GPT-3.5-turbo, the GNN-based method achieves 80.52% on GSM8K, 62.37% on FinQA, and 63.18% on LegalBench, yielding improvements of +0.9%, +3.96%, and +2.35% respectively over the non-GNN version. Similarly, on GPT-4o-mini, the GNN module leads to more substantial gains: from 85.01% to 92.16% on GSM8K (+7.15%), from 68.27% to 73.11% on FinQA (+4.84%), and from 61.88% to 69.87% on LegalBench (+7.99%). These results highlight the ability of GNN to capture rich inter-premise dependencies and long-range logical relations, which simple embedding averaging fails to model effectively.

Moreover, the consistent performance improvements across two distinct LLM architectures confirm that our reasoning framework is not only effective but also generalizable, further validating its robustness and scalability.

Methods	GSM8K	FinQA	LegalBench
GPT-3.5-turbo
w/o GNN	79.62	58.41	60.83
GNN-based method	80.52	62.37	63.18
GPT-4o-mini
w/o GNN	85.01	68.27	61.88
GNN-based method	92.16	73.11	69.87

Table 2. Ablation studies comparing the GNN-based reasoning module with a simplified variant using average embedding, across three datasets and two backbone LLMs.

5. Conclusion

In this paper, we propose GraphMind, a framework that integrates symbolic knowledge and neural reasoning by constructing task-specific heterogeneous graphs and leveraging the GNN to enhance logical reasoning in LLMs. Our method achieves superior performance across several state-of-the-art prompting and reasoning baselines on diverse datasets, demonstrating the benefits of structured knowledge modeling. Through detailed ablation studies, we demonstrate the effectiveness of GNN-based state representation and theorem selection mechanisms, highlighting the role of graph structure in capturing inter-premise relationships and improving reasoning accuracy. Overall, GraphMind shows strong generalizability and potential for extension to broader reasoning tasks and domains. Future improvements could include learning-based graph construction and enhanced interpretability to further improve adaptability and transparency.