RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables

Nikhil Abhyankar¹, Purvi Chaurasia², Sanchit Kabra¹, Ananya Srivastava²,
Vivek Gupta³, Chandan K. Reddy¹
¹Virginia Tech, ²IGDTUW New Delhi, ³Arizona State University

Abstract

Existing tabular reasoning benchmarks mostly test models on small, uniform tables, underrepresenting the complexity of real-world data and giving an incomplete view of Large Language Models’ (LLMs) reasoning abilities. Real tables are long, heterogeneous, and domain-specific—mixing structured fields with free text and requiring multi-hop reasoning across thousands of tokens. To address this gap, we introduce RUST-BENCH, a benchmark of 7,966 questions from 2,031 real-world tables spanning two domains: (i) RB-Science (NSF grant records) and (ii) RB-Sports (NBA statistics). Unlike prior work, RUST-BENCH evaluates LLMs jointly across scale, heterogeneity, domain specificity, and reasoning complexity. Experiments with open-source and proprietary models show that LLMs struggle with heterogeneous schemas and complex multi-hop inference, revealing persistent weaknesses in current architectures and prompting strategies. RUST-BENCH establishes a challenging new testbed for advancing tabular reasoning research.¹¹1Correspondence: [email protected], [email protected]
https://github.com/tabular-reasoning/RUST-BENCH

Nikhil Abhyankar¹, Purvi Chaurasia², Sanchit Kabra¹, Ananya Srivastava², Vivek Gupta³, Chandan K. Reddy¹ ¹Virginia Tech, ²IGDTUW New Delhi, ³Arizona State University

1 Introduction

Semi-structured tables containing free-form text embedded within structured fields are common across various domains gupta2020infotabs. Effective data analysis in science, finance, and sports requires reasoning over large, domain-specific tables that combine symbolic structure with textual context. However, existing benchmarks predominantly evaluate short, homogeneous Wikipedia-derived tables pasupat2015compositional; chen2019tabfact, which limits both model generalizability and robustness. Although Large Language Models (LLMs) have made tabular reasoning more accessible by allowing users to query tables directly in natural language cheng2022binding, systematic evaluation of their reasoning abilities over complex tables remain underexplored chen2023large.

Refer to caption — Figure 1: Illustration of a multi-step reasoning process for a complex question grounded in a sports table from RUST-BENCH. The example shows that real-world tabular reasoning often demands multiple complementary reasoning skills (temporal, arithmetic, and contextual) and the coordinated use of heterogeneous evidence across long, domain-specific tables.

Table 1: Comparison of RUST-BENCH with other Table QA datasets. RUST-BENCH contains a variety of complex question types over large, domain-specific tables containing semi-structured information. *Only the contents of the table are considered.

Dataset	Source	Complex	Unanswerable	Domain	Semi	Large	# Avg.	Context
Dataset	Source	Reasoning	Questions	Specific	Structured	Tables	Rows	Length
WikiTQ pasupat2015compositional	Wikipedia Wikipedia	✗	✗	✗	✗	✗	6.3	1133.51
TabFact chen2019tabfact	Wikipedia	✗	✗	✗	✗	✗	6.2	586.51
Hybrid-QA chen2020hybridqa	Wikipedia	✓	✗	✗	✓	✗	15.7	372.14^∗
OTT-QA chen2020open	Wikipedia	✓	✗	✗	✓	✗	15.7	372.14^∗
CRT-QA zhang2023crt	Wikipedia	✓	✓	✗	✗	✗	12.6	257.12
TAT-QA zhu2021tat	Financial Reports Financial	✓	✗	✓	✓	✗	9.4	378.31
FINQA chen2021finqa	FinTabNet zheng2021global	✓	✗	✓	✓	✗	6.4	687.51
SciTab lu2023scitab	SciGen moosavi2021scigen	✓	✗	✓	✗	✗	7.5	254.53
RUST-BENCH	NSF NSF2024, Sportsett thomson2020sportsett	✓	✓	✓	✓	✓	45.1	23040.68

Real-world tabular reasoning introduces four major challenges for LLMs: scale, multi-hop reasoning, heterogeneity, and domain specificity. First, tables can be long, often spanning hundreds of rows and columns, and such long contexts are known to degrade LLM reasoning performance liu2023lost. Similarly, model performance deteriorates as table size grows, even when the entire table fits within the context window, since only a small fraction of rows are typically relevant to a given query abhyankar2024h. Second, many queries require multi-hop reasoning—locating relevant rows, integrating dispersed evidence, and composing it into an answer. Third, heterogeneity arises when tables mix structured fields with free-form text, requiring models to reason over diverse data modalities chen2020hybridqa; zhu2021tat. Finally, domain specificity introduces specialized terminology and domain-specific reasoning patterns, as seen in finance chen2021finqa and science lu2023scitab, which require specialized domain knowledge for effective inference. While existing benchmarks assess specific aspects of table reasoning, they often evaluate these challenges in isolation. The absence of benchmarks that jointly incorporate scale, heterogeneity, and domain specificity constitutes a fundamental limitation, constraining systematic progress toward generalizable tabular reasoning models. We therefore pose the question: Can LLMs effectively reason over unstructured text embedded in long, domain-specific tables?

To answer this, we introduce RUST-BENCH, a new benchmark explicitly designed to stress-test models across four orthogonal axes of real-world tabular reasoning: domain specificity, table length, semi-structured information, and multi-hop reasoning, offering a comprehensive and realistic evaluation framework. RUST-BENCH comprises 2,031 tables primarily sourced from two domains: (a) science and (b) sports, accompanied by 7,966 carefully curated question–answer pairs. We construct the dataset using an LLM-driven hybrid symbolic–semantic generation pipeline, that systematically constructs high-quality, multi-hop queries grounded in real-world semi-structured tables while reducing manual annotation costs. As illustrated in Figure 1, each question is designed to evaluate a wide spectrum of reasoning skills (including temporal, numerical, aggregation, verification, commonsense, counterfactual, and ambiguity resolution) with most requiring multi-hop reasoning that integrates information across multiple cells through both parallel and sequential inference. As shown in Table 1, existing benchmarks primarily rely on Wikipedia, which generally involves short contexts and relatively simple reasoning. These datasets often lack domain-specific information, unanswerable queries, and large semi-structured tables, thereby limiting their capacity to appropriately reflect real-world complexity. In contrast, RUST-BENCH introduces domain-grounded tables, expands the range of reasoning types, and substantially scales up table size (averaging 45.1 rows and roughly 23000 tokens per table). This design offers a more realistic and challenging evaluation setting for LLMs. We evaluate RUST-BENCH using state-of-the-art proprietary and open-source LLMs, employing diverse prompting strategies and reasoning methods. Our findings expose systematic weaknesses in handling scale, heterogeneity, and reasoning composition, confirming the value of RUST-BENCH as a challenging and diagnostic benchmark for advancing research on LLM-based table reasoning. Our main contributions are:

$\bullet$ We introduce RUST-BENCH, a large-scale benchmark that jointly evaluates LLMs across four orthogonal dimensions (i.e., scale, heterogeneity, domain specificity, and complex reasoning) previously treated in isolation by existing datasets.

$\bullet$ We develop a hybrid dataset generation pipeline that leverages the complementary strengths of symbolic and semantic reasoning to construct diverse, multi-hop, domain-grounded QA pairs efficiently.

$\bullet$ Comprehensive evaluations of state-of-the-art open-source and proprietary models reveal that current LLMs struggle with large, heterogeneous tables and multi-step reasoning, exposing persistent gaps in table reasoning architectures and prompting strategies.

2 RUST-BENCH Dataset

2.1 Task Formulation

In table-based reasoning, each problem instance is represented as a triplet (T, Q, A), where T denotes the tabular data, Q represents the associated query, and A signifies the anticipated response. Specifically, in the context of table-centric question-answering systems, both Q and A are in natural language. The primary objective is to derive a prediction a utilizing Q and T, which can be formally expressed as a= $\pi_{\theta}$ (T, Q), where $\pi_{\theta}$ symbolizes the predictive model.

2.2 RUST-BENCH Creation

Table Collection.

We curate domain-grounded tables from two high-quality sources: the NSF Grants Database NSF2024 for science and the SportSett:Basketball dataset thomson2020sportsett, an enhanced version of RotoWire wiseman2017challenges, for sports. The raw data is cleaned and organized into domain-specific JSON tables, sampled by attributes (such as year and region) and by uniform random selection (Figure 2(a)). We focus on constructing large tables with more than 30 rows, consistent with the definition in chen2023large. To ensure diversity and cross-domain comparability, we apply structured sampling to balance table sizes: 50% with 30–40 rows, 40% with 40–60, and 10% with 60–100. This stratification balances coverage and scale across the domains, yielding a representative mixture of table sizes and schema complexities.

QA Generation.

Creating high-quality QA pairs for long, domain-specific tables is particularly challenging as manual annotation is slow, costly, and prone to errors when tables span thousands of tokens. Inspired by recent LLM-based data generation methods park2023generative; zhang2023crt; li2024planning, we adopt in-context learning and role-playing paradigms to enable scalable and diverse dataset construction at a lower annotation cost. However, only using LLMs’ textual (semantic) reasoning is inadequate as it captures natural-language inference but fails on structural and quantitative reasoning. Conversely, symbolic reasoning methods yield precise numerical manipulation and structural consistency but lack flexibility with unstructured text liu2023rethinking. We therefore leverage their complementary strengths to design a hybrid symbolic–semantic pipeline (Figure 2(b)) comprising (a) a symbolic approach, which uses SQL-like logical forms to create schema-intensive, reasoning-heavy queries, and (b) a semantic approach, which generates natural, inference-oriented questions from unstructured text.

(a) Symbolic Approach.

The symbolic approach exploits LLMs’ code-generation abilities to synthesize SQL queries over both structured and unstructured table components, to create questions involving numerical reasoning, aggregation, and logic. We construct a library of 75 SQL templates with placeholders (e.g., SELECT [columns] FROM [table] WHERE [condition]) covering diverse query patterns such as selection, aggregation, and conditional operations (Appendix A.1). During generation, a template is sampled and instantiated with table-specific values, providing a structural scaffold for producing valid SQL queries (Figure 2(b)). For example, a template may yield SELECT MAX(attendance) FROM RB_Sports WHERE city==‘New York’, which is then paraphrased into a natural language question ‘What is the highest attendance recorded in NYC?’ by prompting an LLM. To ensure fluency and avoid explicit SQL exposure, entity names are masked or rephrased (e.g., New York → NYC) during paraphrasing. This dual process enables coverage of multiple reasoning types, integrating structured computation with textual variation.

(b) Semantic Approach.

The semantic component uses LLMs’ semantic reasoning to derive insights from unstructured text segments and generate diverse, inference-driven questions that go beyond surface-level lookups. However, LLMs struggle with long or complex inputs liu2023lost, often producing (1) overly simplistic questions and (2) repetitive patterns, especially on large tables. To mitigate these issues, we restrict inputs to either: Single Row-Based method for focused intra-row reasoning, or a Multi-Row-Based method for multi-hop reasoning across a small subset of semantically related rows. This setup reduces contextual load and encourages inference beyond simple lookups while keeping questions easily verifiable by human annotators. To further enhance diversity, we maintain a pool of in-context exemplars spanning multiple reasoning types and randomly sample from them during generation. Combined with temperature variation, this encourages broader coverage and deeper reasoning. Details of the single-row and multi-row generation processes are in Appendix A.2.

2.3 RUST-BENCH Validation

Although LLMs can generate QA pairs at scale, their outputs often suffer from misalignment, limited diversity, and uneven reasoning depth zhang2023crt. To ensure high-quality supervision for RUST-BENCH, we adopt a rigorous human-in-the-loop verification pipeline. This process substantially improves quality by filtering out poor generations. We first discard malformed or duplicated QA pairs and those with empty or ill-formed answers. Eight Computer Science graduate students act as annotators to review each remaining pair using a custom web interface that displays the full semi-structured table alongside its question and answer (See Appendix A.3). Annotators rate clarity, answer correctness, and reasoning complexity and flag uncertain or incorrect cases for secondary review. They are also instructed to ensure that the final answers are concise, self-contained, and free of redundant text to facilitate consistent automatic and human evaluation. Three expert reviewers then re-examine all pairs and consolidate the verified dataset. Low-quality or unverifiable examples are removed, while minor errors are corrected. As summarized in Table 2, this process yields a curated set of high-quality QA pairs supporting multi-hop reasoning over long, heterogeneous tables.

Table 2: Breakdown of QA pairs before and after human verification.

Dataset	Category	Original	Final	% Discarded
Dataset	Category	# QA	# QA	% Discarded
RB-Sports	Single Row	2886	2712	6.0%
	Multi Row	1222	838	31.4%
	Symbolic	1431	1338	6.5%
RB-Science	Single Row	915	805	12.0%
	Multi Row	1516	1101	27.3%
	Symbolic	1267	1172	7.5%

2.4 RUST-BENCH Statistics

Table 3 summarizes the RUST-BENCH dataset, comprising 2,031 tables spanning RB-Sports (1,326) and RB-Science (705). Although both domains contain tables of similar length, RB-Science shows greater structural complexity, with more columns and higher token counts per table. We include 5,674 questions in RB-Sports and 2,292 in RB-Science, averaging 4.28 questions per table in RB-Sports and 3.25 in RB-Science, plus a subset of unanswerable queries. For unstructured passages, RB-Science has higher average token counts (477.62 vs. 400.58; medians 469 vs. 368) and a larger token standard deviation (149.87 vs. 114.21), while RB-Sports has slightly more sentences per passage on average. To assess annotation quality, we conducted a human-rated complexity study following nan2022fetaqa. Three experts rated 100 random examples on a 1–5 scale, with scores $\geq$ 4 indicating high-quality QA pairs. The study achieved 91.7% inter-annotator agreement, confirming the dataset’s reliability.

Table 3: Summary statistics of RUST-BENCH across RB-Sports and RB-Science.

Tables
	RB-Sports	RB-Science
# Tables	1326	705
Avg. Rows / Table	44.95	45.13
Avg. Columns / Table	12.0	28.0
Avg. Tokens / Table	18304.47	31948.79
Questions
# Questions	5674	2292
Avg. Question Length (words)	26.92	27.48
# Questions / Table	4.28	3.25
# Unanswerable Questions	132	372
Unstructured Text
Avg Tokens / Passage	400.58	477.62
Std Tokens	114.21	149.87
Median Tokens	368.00	469.00
Avg Sentences / Passage	16.22	14.34
Std Sentences	4.32	4.84
Median Sentences	15.00	14.00
Inter-Annotator Agreement	91.7%

3 Experiments

Table 4: Comparison of LLM backbones using various prompting strategies on variants RB-Science and RB-Sports using: (a) Exact Match (EM), (b) BLEU, and (c) LLM-as-a-judge (LLM-score). Higher values indicate better performance.

Model	Strategy	RB-Science			RB-Sports
Model	Strategy	EM (%)	BLEU	LLM-score (%)	EM (%)	BLEU	LLM-score (%)
Large Language Models
GPT-4o-mini	Zero-Shot	36.6	0.293	40.4	39.8	0.285	43.1
	Few-Shot	37.9	0.296	36.7	31.3	0.301	33.9
	CoT	44.4	0.378	48.8	42.1	0.365	45.2
	PoT	32.8	0.312	34.5	30.6	0.285	33.6
Llama-3.3-70B	Zero-Shot	38.8	0.301	47.1	39.2	0.311	44.3
	Few-Shot	41.7	0.347	46.4	46.7	0.350	48.9
	CoT	44.2	0.401	45.3	42.2	0.392	43.9
	PoT	27.7	0.299	30.6	31.1	0.289	33.0
Gemini-2.0-Flash	Zero-Shot	40.7	0.370	47.3	38.6	0.345	45.4
	Few-Shot	45.9	0.373	48.8	41.4	0.340	43.3
	CoT	47.3	0.454	50.8	44.1	0.419	48.7
	PoT	18.2	0.225	23.6	26.3	0.239	29.1
Mistral-Small-3.2	Zero-Shot	48.3	0.410	50.5	45.7	0.404	48.0
	Few-Shot	50.3	0.373	51.6	43.9	0.365	45.2
	CoT	52.6	0.454	53.1	51.5	0.446	51.7
	PoT	29.8	0.278	29.9	20.5	0.241	26.4
Large Reasoning Models
Qwen3-14B		42.6	0.441	44.4	41.2	0.433	43.1
Qwen-QwQ		48.1	0.526	54.1	46.1	0.479	55.7
Qwen-Distill-32B		43.1	0.407	49.9	39.2	0.426	44.6
Llama-Distill-70B		44.6	0.483	52.4	40.5	0.455	50.9

LLM Backbones.

We benchmark a diverse set of state-of-the-art large language models, spanning both open-source and proprietary families, as well as reasoning-optimized variants for complex problem-solving. Specifically, we evaluate Llama-3.3-70B-Instruct dubey2024llama, GPT-4o-mini openai2023gpt, Gemini-2.0-Flash team2023gemini, and Mistral-Small-3.2-24B-Instruct-2506 jiang2024mixtral. Beyond these general-purpose models, we also assess Qwen3-14B, Qwen-32B-QwQ yang2025qwen3, Qwen-Distill-32B, and Llama-Distill-70B guo2025deepseek, which are specialized for reasoning tasks. All models are evaluated using default hyperparameters and a fixed decoding temperature ( $\tau=0.1$ ) for consistency across runs. Following wang2023chain, each table is linearized into a pipe-separated format and concatenated with its query across models.

Baselines.

We evaluate two baseline categories: (i) prompting strategies and (ii) table reasoning methods developed specifically for tabular data. For prompting, we adopt four standard paradigms: (i) Zero-shot prompting, where the model directly answers the table–question pair; (ii) Few-shot prompting chen2023large, with four in-context examples; (iii) Chain-of-Thought (CoT) wei2022chain, encouraging intermediate reasoning steps; and (iv) Program-of-Thought (PoT) chen2023program, which incorporates executable programs as intermediate reasoning. For table reasoning methods, we use GPT-4o-mini and Llama-3.3-70B as LLM backbones to evaluate six state-of-the-art approaches: BlendSQL glenn2024blendsql, a hybrid framework embedding SQL-style reasoning within natural prompts; Chain-of-Table wang2023chain, which performs stepwise table updates for interpretable reasoning; ProTrix wu2024protrix, integrating SQL planning with compositional reasoning; TabSQLify nahid2024tabsqlify, which uses SQL to partition large tables into sub-tables for scalable inference; TableMaster cao2025tablemaster, combining textual and symbolic reasoning via adaptive table verbalization; and NormTab nahid2024normtab, normalizing table structures and values to improve symbolic interpretability. Additional implementation details are provided in Appendix B.3.

Evaluation Metrics.

For fairness and consistency, all models are evaluated under identical input and output constraints, focusing on accuracy and generation quality. Each model is instructed to produce concise, self-contained natural language answers; for SQL-based methods, query output is post-processed and verbalized in natural language for comparability. Following pasupat2015compositional; zhang2023crt, we report Exact Match (EM) as the primary metric. We further relax the evaluation with BLEU papineni2002bleu to capture $n$ -gram overlap and an LLM-as-a-Judge (LLM-Score) evaluation using GPT-4o-mini to assess semantic equivalence. This combination provides complementary signals for lexical accuracy, surface fluency, and semantic faithfulness. For more details, see Appendix B.2.

3.1 Main Results

Table 4 reports the performance of different LLM backbones on RUST-BENCH using Exact Match (EM), BLEU, and LLM-score. Overall, Qwen-QwQ achieves the highest performance across all metrics, with an LLM-score reaching 54.1 and 55.7 for RB-Science and RB-Sports, respectively. Furthermore, it can be seen that CoT consistently outperforms Zero-Shot and Few-Shot for smaller models, highlighting the importance of explicit reasoning in this setting. In contrast, PoT exhibits the weakest performance across all models, likely due to the semi-structured nature of the data. In Table 5, we present a comparison of table reasoning baselines on RUST-BENCH, implemented using GPT-4o-mini and Llama-3.3-70B as the backbones. Among these, TableMaster achieves the best overall results, reaching 42.3% EM on RB-Science and 43.1% on RB-Sports. In contrast, symbolic or SQL-based methods such as TabSQLify and BlendSQL perform worse, achieving EM scores of 15.3% and 13.6%, respectively. These findings suggest that purely symbolic reasoning pipelines are insufficient for the flexible, context-driven inference required by RUST-BENCH, which is consistent with our findings.

3.2 Impact of Table Size

To investigate how table size affects reasoning accuracy in our setting, we analyze model performance across naturally occurring tables grouped by total token count, spanning from 10K to 85K tokens. As illustrated in Figure 3, GPT-4o-mini, Gemini-2.0-Flash, and LLaMA-3.3-70B exhibit a consistent, monotonic decline in Exact Match accuracy as table size increases, with degradation becoming particularly pronounced beyond the 35K–50K token threshold. Notably, this performance drop occurs well within the nominal context windows of modern LLMs (typically 128k+ tokens), suggesting that the bottleneck arises from reasoning and attention limitations rather than raw context length. This degradation can be attributed to LLMs’ difficulty in retrieving and integrating dispersed evidence across long sequences liu2023lost, difficulty in locating relevant information, and increased multi-hop reasoning complexity. Unlike existing benchmarks that predominantly feature concise tables under 5000 tokens (pasupat2015compositional; chen2019tabfact), RUST-BENCH includes substantially longer and more heterogeneous tables where critical information is often scattered across extensive contexts. These findings highlight the need for improved query-specific data extraction mechanisms to effectively handle large-scale tabular reasoning tasks.

Table 5: Comparison of baselines on RUST-BENCH using GPT-4o-mini and Llama-3.3-70B using: (a) Exact Match (EM), (b) BLEU, and (c) LLM-as-a-judge (LLM-score), with higher values indicating better performance.

Method	GPT-4o-mini			Llama-3.3-70B
Method	EM (%)	BLEU	LLM-score (%)	EM (%)	BLEU	LLM-score (%)
TabSQLify	15.3	0.206	22.3	14.4	0.120	18.6
BlendSQL	13.6	0.186	20.2	11.7	0.145	13.6
ProTrix	32.6	0.319	33.9	28.3	0.265	31.5
Chain-of-Table	30.1	0.247	35.1	33.2	0.358	36.9
NormTab	33.9	0.338	36.8	30.9	0.279	34.9
TableMaster	42.3	0.431	44.2	43.1	0.386	45.4

3.3 Impact of Real-World Table Complexity

To assess how the combination of real-world structural complexity and multi-hop reasoning affects model performance, we compare two proprietary LLMs GPT-4o-mini and Gemini-2.0-Flash across WikiTQ (a general-knowledge benchmark) and RUST-BENCH. We evaluate both models under zero-shot and Chain-of-Thought (CoT) prompting settings. As shown in Figure 4, both models demonstrate strong performance on WikiTQ, with GPT-4o-mini achieving 59.4% accuracy in zero-shot and 64.5% with CoT, while Gemini-2.0-Flash reaches 69.7% and 80.4%, respectively. In contrast, performance on RUST-BENCH drops sharply to roughly 20-30% across all prompting strategies for both models. This substantial gap reveals the compounding challenges introduced by domain-specific reasoning, heterogeneous table schemas, long contexts, and multi-hop inference. Unlike WikiTQ’s short, homogeneous tables dominated by direct lookup queries, RUST-BENCH captures the full spectrum of real-world tabular reasoning, where multiple factors interact to create harder reasoning problems. Such a dramatic decline underscores the limits of current LLMs in generalizing beyond simplified benchmarks and highlights the pressing need for more robust and compositional reasoning mechanisms.

3.4 Impact of Heterogeneous Data

While multi-hop evaluation has been extensively studied as a driver of task difficulty, the influence of data heterogeneity and structure on reasoning performance remains less explored. To investigate how the underlying data influences reasoning performance, we conduct controlled experiments on a subset of randomly sampled 100 RB-Sports tables in two settings: structured and unstructured. We convert the semi-structured tables while keeping the underlying content identical in both its variants. In the structured setting, information is normalized into explicit columns, minimizing free-form text; in the unstructured setting, each table row is verbalized into natural-language sentences and appended to the textual field, simulating highly heterogeneous inputs. We first evaluate symbolic reasoning methods, specifically Program-of-Thought (PoT) prompting, on the structured and semi-structured variants. As shown in Figure 5, PoT consistently achieves higher accuracy on the structured version across all models except Llama-3.3-70B, which performs comparably on both. This pattern indicates that symbolic reasoning benefits from explicit schema structure and reduced textual noise, confirming its reliance on syntactic regularity. Next, we assess text-based reasoning methods using Chain-of-Thought (CoT) prompting on the unstructured and semi-structured variants.

Figure 6 shows that CoT yields higher accuracy on the unstructured representation, indicating that natural-language continuity facilitates stepwise reasoning when explicit structure is absent. Overall, these results show that semi-structured data presents the greatest reasoning challenge, as it combines the ambiguity of free-text with the rigidity of tabular schema, while purely structured or unstructured formats better align with the respective strengths of symbolic and semantic reasoning. We further perform an in-depth error analysis to characterize common failure modes and provide qualitative examples of reasoning diversity in RUST-BENCH in Appendices C, D, and E.

4 Related Work

General Table Reasoning.

Table reasoning tasks typically involve well-structured, short tables, often derived from Wikipedia-based sources. Datasets such as WikiTQ pasupat2015compositional, SQA iyyer2017search, WikiSQL zhong2017seq2sql, and Spider yu2018spider focus on question answering or text-to-SQL tasks that test reasoning over such tables. While WikiTQ and SQA include complex questions, WikiSQL pairs natural language questions with SQL queries, and Spider offers a large-scale, cross-domain collection with diverse databases and complex SQL. Beyond question answering, fact-verification datasets like TabFact chen2019tabfact and Infotabs gupta2020infotabs evaluate claim verification over Wikipedia data, while FetaQA nan2022fetaqa targets free-form question answering requiring reasoning over entity relations. However, these datasets primarily rely on short, factual tables with limited query diversity and shallow reasoning depth.

Semistructured and Complex Reasoning.

Datasets such as FEVEROUS aly2021fact, Hybrid-QA chen2020hybridqa, and OTT-QA chen2020open extend table reasoning to open-domain contexts combining text and tables, yet still exhibit limited diversity in reasoning types and structural variation. In contrast, reasoning-focused datasets like TempTabQA gupta2023temptabqa and TABMWP lu2022dynamic emphasize specific reasoning skills like temporal and numerical reasoning, respectively, but lack semi-structured contexts. CRT-QA zhang2023crt covers a broader range of reasoning types but remains constrained by structured-only, open domain data. Our dataset bridges these gaps by combining domain-specific, semi-structured tables with diverse, multi-hop reasoning tasks that span both structured and unstructured modalities.

Domain-Specific Datasets.

Datasets tailored to specific domains typically require specialized background knowledge and retrieval mechanisms to answer domain-grounded questions. In the finance domain, FinQA chen2021finqa, TAT-QA zhu2021tat, and MultiHiertt zhao2022multihiertt emphasize numerical and logical reasoning, often integrating heterogeneous data sources. SemTabFacts wang2021semeval and SciTAB lu2023scitab focus on claim verification using tables from scientific articles, while SciTabQA lu2023scitab extends this to question answering over mixed textual and tabular evidence. Despite their domain focus, these datasets generally contain small, homogeneous tables with limited semi-structured context, thereby constraining the study of complex, multi-hop reasoning. As illustrated in Figure 7, RUST-BENCH differs by unifying large-scale, heterogeneous, and domain-specific tables—capturing the full spectrum of real-world reasoning challenges.

5 Conclusion

We presented RUST-BENCH, the first benchmark that jointly evaluates LLMs on tabular reasoning across four fundamental challenges of real-world data: scale, heterogeneity, domain specificity, and multi-hop inference. Our experiments demonstrate that even the strongest proprietary and open-source models systematically fail under these conditions, as accuracy drops sharply with increasing table length, and multi-hop reasoning over semi-structured, domain-specific tables frequently breaks down. RUST-BENCH provides a robust evaluation framework and a foundation for advancing research in symbolic and structured reasoning, which is an essential step toward reliable real-world deployment. Future work on RUST-BENCH will emphasize broader coverage by adding diverse domains (e.g., healthcare, finance, climate), multilingual settings, and more complex table structures (hierarchical, nested, evolving) to better test cross-domain generalization. We will also introduce real-world noise, i.e., missing cells, typos, schema drift, and conflicting units—to assess robustness, calibration, and recovery under imperfect data. Finally, we will pair LLMs with tools for retrieval, schema induction, and execution, aiming for verifiable, scalable reasoning over semi-structured data.

Limitations

While RUST-BENCH marks a step forward in evaluating LLMs on realistic tabular reasoning, it could further incorporate multi-table and relational reasoning, introduce training splits to support fine-tuning and adaptation, and explore richer evaluation protocols that better capture semantic correctness in complex answers. These developments can help create robust and generalizable approaches to tabular reasoning in real-world applications.

Ethics Statement

We, the authors, affirm that our work adheres to the highest ethical standards in research and publication. We have carefully considered and addressed various ethical issues to ensure the responsible and fair use of computational linguistics methodologies. To facilitate reproducibility, we provide detailed information, including code, datasets (all publicly available and in compliance with their respective ethical standards), and other relevant resources. Our claims align with the experimental results, though some stochasticity is expected with black-box large language models, which we minimize by maintaining a fixed temperature. We provide comprehensive details on annotations, dataset splits, models used, and prompting methods, ensuring our work can be reliably reproduced.

Acknowledgments

This research was partially supported by the U.S. National Science Foundation (NSF) under Grant No. 2416728. We also extend our gratitude to the Complex Data Reasoning and Analysis Lab (CoRAL) at Arizona State University for providing essential computational resources, mentorship, and a collaborative research environment that greatly contributed to the progress of this work. We sincerely thank Beenaa Salian and Preethi Suresh for their assistance in data annotation, verification, and code implementation, which played a key role in ensuring the accuracy and reliability of our results. Finally, we appreciate the thoughtful and constructive feedback provided by the reviewers, which helped strengthen the quality and presentation of this research.

Appendix A More Details on Dataset Generation

A.1 Symbolic Approach

To enable QA pair generation using the symbolic approach, we curate a diverse collection of approximately 75 SQL query templates. These templates are designed to cover a broad spectrum of SQL constructs, including basic SELECT statement, conditional logic (AND, OR), aggregation (MAX, SUM, etc.), sorting (ORDER BY), grouping (GROUP BY), and joins. As shown in Figure 8, each template includes placeholder tokens for table names, columns, and filter conditions, allowing for broad applicability across different schemas. To instantiate these templates, we adopt a prompt-based generation approach leveraging large language models (LLMs). Specifically, we sample a template at random and prompt the LLM with task instructions and in-context exemplars to replace the template placeholders using schema-specific information derived from a target semi-structured table. This results in a fully instantiated SQL query tailored to the table (Figure 9). The generated SQL is then executed on the underlying table to obtain the corresponding answer. In a subsequent step, we prompt the LLM with the SQL query and its result to generate a natural language question that semantically aligns with the query logic but obscures the clauses. The final output is a question-answer pair, where the answer is grounded in the execution result of the SQL, and the question is a fluent natural language version reflecting the underlying semantics. This pipeline supports scalable QA dataset generation grounded in executable symbolic programs, enabling evaluation of models on structured reasoning tasks.

A.2 Semantic Approach

As outlined in Section 2, we employ two prompting strategies: Single Row-Based and Multi-Row-Based to improve the quality, diversity, and verifiability of LLM-generated questions over large tabular data. Figure 10 illustrates both approaches. In the Single Row-Based method, we randomly sample one row from the table and use it as the entire input context. This localization helps the LLM focus on intra-row reasoning, such as retrieving or interpreting structured and unstructured cell content. It also simplifies verification, as each question-answer (QA) pair depends on a well-defined and constrained context. In contrast, the Multi-Row Based method is designed to enable multi-row reasoning by selecting a subset of rows that are semantically connected via a shared entity in a specific column. By narrowing the input to only a few rows, these strategies, as shown in Figure 10 (bottom), help overcome LLM limitations with long inputs by explicitly controlling context size and composition. They allow generating QA pairs that are diverse in type, grounded in the table content, and more easily verifiable.

A.3 More Details on Data Validation

Figure 11 illustrates the custom verification interface used during the human-in-the-loop annotation process. Each screen presents a question, its predicted answer, and a detailed explanation generated by the model, alongside an interactive table view displaying the relevant semi-structured data. Annotators could validate the question-answer pair using tools such as column-specific filters, row-level sorting, and a search bar to locate supporting evidence quickly. The interface also includes input fields for correcting errors and a checkbox for discarding invalid questions. This setup ensured that annotators had full contextual access while verifying QA pairs, improving both accuracy and efficiency. After one round of annotations, the samples were further verified by expert verifiers to ensure high-quality question-answer pairs. The entire process was conducted by annotators and reviewed by graduate students in Computer Science.

Appendix B Implementation Details

In this section, we describe the prompting strategies, evaluation metrics, and LLM-based table reasoning baselines used in our study, along with their implementation details.

B.1 Prompting Techniques

We implement four reasoning techniques to use LLMs to perform tabular reasoning. Figures 12, 13, 14, and 15 highlight the direct prompting (zero-shot), few-shot, chain-of-thought (CoT) and program-of-thought (PoT) prompts for the LLMs respectively.

B.2 Evaluation Metrics

Exact Match (EM).

Following WikiTQ pasupat2015compositional, we implement exact match (EM) as the metric for evaluating model performance. EM assigns a score of 1 if the predicted answer is exactly the same as the gold answer, and 0 if otherwise. The final EM accuracy is calculated by adding the individual exact match scores divided by the total number of samples in the set. However, despite ignoring regex, punctuations, and case-sensitivity, EM penalizes semantically correct generations that do not exactly match the ground truth. It becomes increasingly challenging to evaluate longer answers that contain short phrases or multiple entities as the answer. We thus explore more relaxed metrics that do not penalize semantically correct generations.

BLEU Score.

BLEU score papineni2002bleu is a metric used in machine translation to compare the quality of machine-translated text with a set of reference translations. It measures the n-gram overlap between the reference text and the prediction, assigning a score of 0-1 depending on the amount of overlap. Despite being better than EM at longer phrases, the BLEU score measures the word overlap, missing out on the semantic relevance between the prediction and the reference.

LLM-score.

To correctly measure the generation quality and take the semantic similarity between the outputs and the predictions, we use an LLM as a judge to evaluate and score the generated outputs. As illustrated in Figure 16, the LLM is tasked to assign a score on a scale of 0-5 based on the correctness of the prediction. With a score of 4 representing less than 5% error between the ground truth and the prediction, the final accuracy is calculated by summing the total number of samples reporting a score of 4 or more, divided by the total samples. This enables us to gauge the answers semantically and return a better metric to evaluate the answers semantically.

B.3 Baselines

BlendSQL

glenn2024blendsql is a unified dialect that integrates SQL logic with large language model (LLM) reasoning across semi-structured data. It serves as a superset of SQLite, enabling complex hybrid question answering tasks involving multi-hop reasoning. The implementation utilizes the open-source repository blendsql²²2https://github.com/parkervg/blendsql, with dataset-specific in-context examples and default parameters.

Chain-of-Table

wang2023chain is a prompting framework that extends Chain-of-Thought by incorporating tabular data explicitly in the reasoning chain. It guides LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. The implementation follows the official GitHub repository chain-of-table³³3https://github.com/google-research/chain-of-table with the in-context examples tailored to our dataset.

ProTrix

wu2024protrix introduces a Plan-then-Reason framework that plans the reasoning path using the query and context, then assigns each step to either textual or program-based reasoning to arrive at the final answer. We modify their official repository⁴⁴4https://github.com/WilliamZR/ProTrix in-context examples to suit RUST-BENCH and use their default hyperparameters.

TabSQLify

nahid2024tabsqlify is a semantic parsing-based method that translates natural language questions into executable SQL queries over structured tables. It leverages text-to-SQL generation to decompose tables into smaller, relevant sub-tables containing only essential information for answering questions or verifying statements. We utilize tabsqlify⁵⁵5https://github.com/mahadi-nahid/TabSQLify with updated in-context examples for inference.

TableMaster

cao2025tablemaster is a unified framework that combines multiple techniques for table reasoning. The method first retrieves relevant table content and enriches it with semantic verbalizations, and employs adaptive reasoning to flexibly choose between textual and symbolic reasoning depending on each query. We adopt the official repository TableMaster⁶⁶6https://github.com/zzlang-c/TableMaster, retaining their default hyperparameters for fair comparison.

NormTab

nahid2024normtab focuses on improving symbolic interpretability by normalizing table structures and values prior to reasoning. It standardizes heterogeneous column names and formats, reducing schema variance and enabling more consistent SQL-based reasoning across diverse tables. We utilize the public normtab⁷⁷7https://github.com/mahadi-nahid/NormTab repository, following default parameters and adapting the prompts to our dataset.

Appendix C Reasoning Diversity in RUST-BENCH

Distribution of Question Types.

To understand the reasoning diversity in RUST-BENCH, we adopt and extend the taxonomy proposed in CRT-QA zhang2023crt, which builds on the BIG-bench framework srivastava2022beyond. As shown in Table 6, our annotation covers a broad spectrum of reasoning types—from high-frequency operations such as filtering and temporal reasoning to more complex forms including multi-hop, implicit, and counterfactual reasoning. This diversity underscores the layered cognitive demands required for real-world table understanding. Filtering and temporal reasoning are the most common types, reflecting the frequent need to locate relevant records and interpret time-dependent relationships. However, a significant proportion of questions also require multi-hop reasoning (26.18%), numerical computation (26.83%), and logical composition (27.85%), highlighting the dataset’s emphasis on compositional and quantitative reasoning. Although rarer, counterfactual, commonsense, and causal reasoning further test model generalization beyond surface-level retrieval.

Table 6: Distribution of reasoning types. Categories are non-exclusive; percentages may not sum to 100%.

Reasoning Type	Percentage (%)
Filtering / Selection	75.89
Temporal Reasoning	39.33
Logical Reasoning	27.85
Numerical	26.83
Multi-hop Reasoning	26.18
Aggregation	23.97
Comparison	17.43
Implicit Reasoning	11.36
Unanswerable	6.83
Sorting / Ranking	5.47
Causal Reasoning	5.20
Commonsense Reasoning	0.44
Spatial Reasoning	0.24
Counterfactual / Negative	0.19

Unanswerable Questions.

In practical table reasoning, not all queries are grounded in the available data. Distinguishing answerable from unanswerable questions is therefore crucial for reliable model deployment in domains such as finance and science. To evaluate this capability, RUST-BENCH incorporates explicitly unanswerable questions following zhang2023crt—queries that cannot be resolved using the table content alone. Examples include those that require external knowledge or contain logical contradictions. A model is considered correct only if it abstains by responding with phrases such as “cannot answer” or “not enough information.” We manually verify outputs to measure accuracy. As shown in Figure 17, models struggle considerably with this task: even under Chain-of-Thought prompting, Gemini-2.0-Flash achieves only 52.27% accuracy in RB-Sports and 26.97% in RB-Science, indicating the persistent challenge of reliable unanswerable detection in table QA.

Appendix D Qualitative Analysis

Semi-structured tables in RUST-BENCH pose a unique challenge for LLMs, as they require reasoning that spans both structured schema elements (e.g., categorical or numeric fields) and unstructured text (e.g., summaries or descriptions). Such inputs expose the limitations of models that excel in either symbolic precision or semantic understanding, but not both. As illustrated in Figure 18, answering ‘How many projects focus on children and how many children did the earliest project address?’ requires scanning abstracts for child-related projects, counting across rows, and applying temporal reasoning to identify the earliest award. Crucially, the abstract of the 2016 brain connectivity project mentions developmental trajectories without specifying participant numbers, so the correct response must acknowledge the absence of detail. Similarly, for the question ‘In a March game at TD Garden, which player from the losing team had the highest points and what was the point difference between him and the leading scorer of the winning team?’ (Figure 19) requires filtering structured fields to locate the relevant March 2019 Celtics–Nuggets game, extracting top scorers from the unstructured summary, aligning them with their teams, and performing arithmetic to compute the score difference. This case exemplifies hybrid reasoning across structured and unstructured inputs, combined with entity disambiguation and grounded numeric comparison. These cases underscore how RUST-BENCH questions move beyond single-field lookup, requiring schema filtering, semantic interpretation, aggregation, and handling capabilities that remain fragile in current LLMs.

Appendix E Error Analysis

To analyze the sources of performance degradation, we manually examined 100 randomly sampled erroneous predictions from Gemini-2.0-Flash (CoT). Errors were grouped into four major categories reflecting distinct failure modes: (i) Interpretation Error: counting or lookup mistakes caused by complex table structures and increased token load from unstructured fields; (ii) Logical Inconsistency Errors: contradictory or incomplete reasoning chains, particularly in multi-hop settings; (iii) Misalignment Errors: outputs that deviate from the expected answer schema or provide only partial results; and (iv) Extraction Errors: incorrect or missed retrievals from structured or unstructured regions of the table. The breakdown in Table 7 shows that no single type dominates; instead, errors stem from the interaction between structural complexity, multi-step reasoning, and representational inconsistencies introduced by semi-structured inputs.

Table 7: Breakdown of 100 randomly sampled erroneous predictions from Gemini-2.0-Flash (CoT).

Error Type	Percentage
Interpretation Error	22%
Logical Inconsistencies	31%
Misalignment Error	27%
Extraction Error	20%

Extraction Error.

These involve failures to retrieve key information from structured fields or unstructured text. The model may skip valid rows or miss implicit cues, such as differences in project counts across years (Figure 20) or mentions of child-related studies buried in abstracts (Figure 21).

Logical Inconsistency Error.

These occur when the model generates an apparently coherent reasoning chain but produces a final answer inconsistent with its intermediate analysis. For example, as shown in Figure 22, the model may identify both Standard Grant and Continuing Grant as valid answers but report only one, revealing a collapse between reasoning and final output generation.

Interpretation Error.

Here, the model misreads the scope of the question or the table structure, overlooking relevant rows or applying filters incorrectly. As illustrated in Figure 23, it may compute time gaps based on a single record while ignoring other valid entries, leading to incomplete evidence gathering and erroneous conclusions.

Misalignment Error.

In some cases, the model’s reasoning is correct, but the output format deviates from the expected answer schema—for instance, returning a sum instead of individual attendance values (Figure 24). Collectively, these patterns show that while LLMs can perform multi-step reasoning, they often lose alignment between reasoning, evidence retrieval, and output generation particularly when operating on semi-structured data that demands both symbolic precision and semantic understanding.