RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables

Nikhil Abhyankar1, Purvi Chaurasia2, Sanchit Kabra1, Ananya Srivastava2,
Vivek Gupta3, Chandan K. Reddy1
1Virginia Tech, 2IGDTUW New Delhi, 3Arizona State University
Abstract

Existing tabular reasoning benchmarks mostly test models on small, uniform tables, underrepresenting the complexity of real-world data and giving an incomplete view of Large Language Models’ (LLMs) reasoning abilities. Real tables are long, heterogeneous, and domain-specific—mixing structured fields with free text and requiring multi-hop reasoning across thousands of tokens. To address this gap, we introduce RUST-BENCH, a benchmark of 7,966 questions from 2,031 real-world tables spanning two domains: (i) RB-Science (NSF grant records) and (ii) RB-Sports (NBA statistics). Unlike prior work, RUST-BENCH evaluates LLMs jointly across scale, heterogeneity, domain specificity, and reasoning complexity. Experiments with open-source and proprietary models show that LLMs struggle with heterogeneous schemas and complex multi-hop inference, revealing persistent weaknesses in current architectures and prompting strategies. RUST-BENCH establishes a challenging new testbed for advancing tabular reasoning research.111Correspondence: [email protected], [email protected]
 [Uncaptioned image]https://github.com/tabular-reasoning/RUST-BENCH

RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables

Nikhil Abhyankar1, Purvi Chaurasia2, Sanchit Kabra1, Ananya Srivastava2, Vivek Gupta3, Chandan K. Reddy1 1Virginia Tech, 2IGDTUW New Delhi, 3Arizona State University

1 Introduction

Semi-structured tables containing free-form text embedded within structured fields are common across various domains gupta2020infotabs. Effective data analysis in science, finance, and sports requires reasoning over large, domain-specific tables that combine symbolic structure with textual context. However, existing benchmarks predominantly evaluate short, homogeneous Wikipedia-derived tables pasupat2015compositional; chen2019tabfact, which limits both model generalizability and robustness. Although Large Language Models (LLMs) have made tabular reasoning more accessible by allowing users to query tables directly in natural language cheng2022binding, systematic evaluation of their reasoning abilities over complex tables remain underexplored chen2023large.

Refer to caption
Figure 1: Illustration of a multi-step reasoning process for a complex question grounded in a sports table from RUST-BENCH. The example shows that real-world tabular reasoning often demands multiple complementary reasoning skills (temporal, arithmetic, and contextual) and the coordinated use of heterogeneous evidence across long, domain-specific tables.
Table 1: Comparison of RUST-BENCH with other Table QA datasets. RUST-BENCH contains a variety of complex question types over large, domain-specific tables containing semi-structured information. *Only the contents of the table are considered.
Dataset Source Complex Unanswerable Domain Semi Large # Avg. Context
Reasoning Questions Specific Structured Tables Rows Length
WikiTQ pasupat2015compositional Wikipedia Wikipedia 6.3 1133.51
TabFact chen2019tabfact Wikipedia 6.2 586.51
Hybrid-QA chen2020hybridqa Wikipedia 15.7 372.14
OTT-QA chen2020open Wikipedia 15.7 372.14
CRT-QA zhang2023crt Wikipedia 12.6 257.12
TAT-QA zhu2021tat Financial Reports Financial 9.4 378.31
FINQA chen2021finqa FinTabNet zheng2021global 6.4 687.51
SciTab lu2023scitab SciGen moosavi2021scigen 7.5 254.53
RUST-BENCH NSF NSF2024, Sportsett thomson2020sportsett 45.1 23040.68

Real-world tabular reasoning introduces four major challenges for LLMs: scale, multi-hop reasoning, heterogeneity, and domain specificity. First, tables can be long, often spanning hundreds of rows and columns, and such long contexts are known to degrade LLM reasoning performance liu2023lost. Similarly, model performance deteriorates as table size grows, even when the entire table fits within the context window, since only a small fraction of rows are typically relevant to a given query abhyankar2024h. Second, many queries require multi-hop reasoning—locating relevant rows, integrating dispersed evidence, and composing it into an answer. Third, heterogeneity arises when tables mix structured fields with free-form text, requiring models to reason over diverse data modalities chen2020hybridqa; zhu2021tat. Finally, domain specificity introduces specialized terminology and domain-specific reasoning patterns, as seen in finance chen2021finqa and science lu2023scitab, which require specialized domain knowledge for effective inference. While existing benchmarks assess specific aspects of table reasoning, they often evaluate these challenges in isolation. The absence of benchmarks that jointly incorporate scale, heterogeneity, and domain specificity constitutes a fundamental limitation, constraining systematic progress toward generalizable tabular reasoning models. We therefore pose the question: Can LLMs effectively reason over unstructured text embedded in long, domain-specific tables?

To answer this, we introduce RUST-BENCH, a new benchmark explicitly designed to stress-test models across four orthogonal axes of real-world tabular reasoning: domain specificity, table length, semi-structured information, and multi-hop reasoning, offering a comprehensive and realistic evaluation framework. RUST-BENCH comprises 2,031 tables primarily sourced from two domains: (a) science and (b) sports, accompanied by 7,966 carefully curated question–answer pairs. We construct the dataset using an LLM-driven hybrid symbolic–semantic generation pipeline, that systematically constructs high-quality, multi-hop queries grounded in real-world semi-structured tables while reducing manual annotation costs. As illustrated in Figure 1, each question is designed to evaluate a wide spectrum of reasoning skills (including temporal, numerical, aggregation, verification, commonsense, counterfactual, and ambiguity resolution) with most requiring multi-hop reasoning that integrates information across multiple cells through both parallel and sequential inference. As shown in Table 1, existing benchmarks primarily rely on Wikipedia, which generally involves short contexts and relatively simple reasoning. These datasets often lack domain-specific information, unanswerable queries, and large semi-structured tables, thereby limiting their capacity to appropriately reflect real-world complexity. In contrast, RUST-BENCH introduces domain-grounded tables, expands the range of reasoning types, and substantially scales up table size (averaging 45.1 rows and roughly 23000 tokens per table). This design offers a more realistic and challenging evaluation setting for LLMs. We evaluate RUST-BENCH using state-of-the-art proprietary and open-source LLMs, employing diverse prompting strategies and reasoning methods. Our findings expose systematic weaknesses in handling scale, heterogeneity, and reasoning composition, confirming the value of RUST-BENCH as a challenging and diagnostic benchmark for advancing research on LLM-based table reasoning. Our main contributions are:

\bullet We introduce RUST-BENCH, a large-scale benchmark that jointly evaluates LLMs across four orthogonal dimensions (i.e., scale, heterogeneity, domain specificity, and complex reasoning) previously treated in isolation by existing datasets.

\bullet We develop a hybrid dataset generation pipeline that leverages the complementary strengths of symbolic and semantic reasoning to construct diverse, multi-hop, domain-grounded QA pairs efficiently.

\bullet Comprehensive evaluations of state-of-the-art open-source and proprietary models reveal that current LLMs struggle with large, heterogeneous tables and multi-step reasoning, exposing persistent gaps in table reasoning architectures and prompting strategies.

2 RUST-BENCH Dataset

Refer to caption

Figure 2: Overview of RUST-BENCH’s dataset generation and verification pipeline. (a) Table Generation: Raw data are extracted from public web sources and reorganized into tables containing at least 30 rows each. (b) Dataset Generation: Question–Answer pairs are created through two complementary methods: (i) a symbolic approach, which uses SQL-like logical forms to construct schema-intensive, reasoning-heavy queries, and (ii) a semantic approach, which employs LLMs to generate natural, inference-oriented questions from unstructured text. (c) Dataset Verification: All generated pairs undergo human verification to ensure factual correctness and annotation quality.

2.1 Task Formulation

In table-based reasoning, each problem instance is represented as a triplet (T, Q, A), where T denotes the tabular data, Q represents the associated query, and A signifies the anticipated response. Specifically, in the context of table-centric question-answering systems, both Q and A are in natural language. The primary objective is to derive a prediction a utilizing Q and T, which can be formally expressed as a=πθ\pi_{\theta}(T, Q), where πθ\pi_{\theta} symbolizes the predictive model.

2.2 RUST-BENCH Creation

Table Collection.

We curate domain-grounded tables from two high-quality sources: the NSF Grants Database NSF2024 for science and the SportSett:Basketball dataset thomson2020sportsett, an enhanced version of RotoWire wiseman2017challenges, for sports. The raw data is cleaned and organized into domain-specific JSON tables, sampled by attributes (such as year and region) and by uniform random selection (Figure 2(a)). We focus on constructing large tables with more than 30 rows, consistent with the definition in chen2023large. To ensure diversity and cross-domain comparability, we apply structured sampling to balance table sizes: 50% with 30–40 rows, 40% with 40–60, and 10% with 60–100. This stratification balances coverage and scale across the domains, yielding a representative mixture of table sizes and schema complexities.

QA Generation.

Creating high-quality QA pairs for long, domain-specific tables is particularly challenging as manual annotation is slow, costly, and prone to errors when tables span thousands of tokens. Inspired by recent LLM-based data generation methods park2023generative; zhang2023crt; li2024planning, we adopt in-context learning and role-playing paradigms to enable scalable and diverse dataset construction at a lower annotation cost. However, only using LLMs’ textual (semantic) reasoning is inadequate as it captures natural-language inference but fails on structural and quantitative reasoning. Conversely, symbolic reasoning methods yield precise numerical manipulation and structural consistency but lack flexibility with unstructured text liu2023rethinking. We therefore leverage their complementary strengths to design a hybrid symbolic–semantic pipeline (Figure 2(b)) comprising (a) a symbolic approach, which uses SQL-like logical forms to create schema-intensive, reasoning-heavy queries, and (b) a semantic approach, which generates natural, inference-oriented questions from unstructured text.

(a) Symbolic Approach.

The symbolic approach exploits LLMs’ code-generation abilities to synthesize SQL queries over both structured and unstructured table components, to create questions involving numerical reasoning, aggregation, and logic. We construct a library of 75 SQL templates with placeholders (e.g., SELECT [columns] FROM [table] WHERE [condition]) covering diverse query patterns such as selection, aggregation, and conditional operations (Appendix A.1). During generation, a template is sampled and instantiated with table-specific values, providing a structural scaffold for producing valid SQL queries (Figure 2(b)). For example, a template may yield SELECT MAX(attendance) FROM RB_Sports WHERE city==‘New York’, which is then paraphrased into a natural language question ‘What is the highest attendance recorded in NYC?’ by prompting an LLM. To ensure fluency and avoid explicit SQL exposure, entity names are masked or rephrased (e.g., New YorkNYC) during paraphrasing. This dual process enables coverage of multiple reasoning types, integrating structured computation with textual variation.

(b) Semantic Approach.

The semantic component uses LLMs’ semantic reasoning to derive insights from unstructured text segments and generate diverse, inference-driven questions that go beyond surface-level lookups. However, LLMs struggle with long or complex inputs liu2023lost, often producing (1) overly simplistic questions and (2) repetitive patterns, especially on large tables. To mitigate these issues, we restrict inputs to either: Single Row-Based method for focused intra-row reasoning, or a Multi-Row-Based method for multi-hop reasoning across a small subset of semantically related rows. This setup reduces contextual load and encourages inference beyond simple lookups while keeping questions easily verifiable by human annotators. To further enhance diversity, we maintain a pool of in-context exemplars spanning multiple reasoning types and randomly sample from them during generation. Combined with temperature variation, this encourages broader coverage and deeper reasoning. Details of the single-row and multi-row generation processes are in Appendix A.2.

2.3 RUST-BENCH Validation

Although LLMs can generate QA pairs at scale, their outputs often suffer from misalignment, limited diversity, and uneven reasoning depth zhang2023crt. To ensure high-quality supervision for RUST-BENCH, we adopt a rigorous human-in-the-loop verification pipeline. This process substantially improves quality by filtering out poor generations. We first discard malformed or duplicated QA pairs and those with empty or ill-formed answers. Eight Computer Science graduate students act as annotators to review each remaining pair using a custom web interface that displays the full semi-structured table alongside its question and answer (See Appendix A.3). Annotators rate clarity, answer correctness, and reasoning complexity and flag uncertain or incorrect cases for secondary review. They are also instructed to ensure that the final answers are concise, self-contained, and free of redundant text to facilitate consistent automatic and human evaluation. Three expert reviewers then re-examine all pairs and consolidate the verified dataset. Low-quality or unverifiable examples are removed, while minor errors are corrected. As summarized in Table 2, this process yields a curated set of high-quality QA pairs supporting multi-hop reasoning over long, heterogeneous tables.

Table 2: Breakdown of QA pairs before and after human verification.
Dataset Category Original Final % Discarded
 # QA  # QA
RB-Sports Single Row 2886 2712 6.0%
Multi Row 1222 838 31.4%
Symbolic 1431 1338 6.5%
RB-Science Single Row 915 805 12.0%
Multi Row 1516 1101 27.3%
Symbolic 1267 1172 7.5%

2.4 RUST-BENCH Statistics

Table 3 summarizes the RUST-BENCH dataset, comprising 2,031 tables spanning RB-Sports (1,326) and RB-Science (705). Although both domains contain tables of similar length, RB-Science shows greater structural complexity, with more columns and higher token counts per table. We include 5,674 questions in RB-Sports and 2,292 in RB-Science, averaging 4.28 questions per table in RB-Sports and 3.25 in RB-Science, plus a subset of unanswerable queries. For unstructured passages, RB-Science has higher average token counts (477.62 vs. 400.58; medians 469 vs. 368) and a larger token standard deviation (149.87 vs. 114.21), while RB-Sports has slightly more sentences per passage on average. To assess annotation quality, we conducted a human-rated complexity study following nan2022fetaqa. Three experts rated 100 random examples on a 1–5 scale, with scores \geq4 indicating high-quality QA pairs. The study achieved 91.7% inter-annotator agreement, confirming the dataset’s reliability.

Table 3: Summary statistics of RUST-BENCH across RB-Sports and RB-Science.
RB-Sports RB-Science
Tables
# Tables 1326 705
Avg. Rows / Table 44.95 45.13
Avg. Columns / Table 12.0 28.0
Avg. Tokens / Table 18304.47 31948.79
Questions
# Questions 5674 2292
Avg. Question Length (words) 26.92 27.48
# Questions / Table 4.28 3.25
# Unanswerable Questions 132 372
Unstructured Text
Avg Tokens / Passage 400.58 477.62
Std Tokens 114.21 149.87
Median Tokens 368.00 469.00
Avg Sentences / Passage 16.22 14.34
Std Sentences 4.32 4.84
Median Sentences 15.00 14.00
Inter-Annotator Agreement 91.7%

3 Experiments

Table 4: Comparison of LLM backbones using various prompting strategies on variants RB-Science and RB-Sports using: (a) Exact Match (EM), (b) BLEU, and (c) LLM-as-a-judge (LLM-score). Higher values indicate better performance.
Model Strategy RB-Science RB-Sports
EM (%) BLEU LLM-score (%) EM (%) BLEU LLM-score (%)
Large Language Models
GPT-4o-mini Zero-Shot 36.6 0.293 40.4 39.8 0.285 43.1
Few-Shot 37.9 0.296 36.7 31.3 0.301 33.9
CoT 44.4 0.378 48.8 42.1 0.365 45.2
PoT 32.8 0.312 34.5 30.6 0.285 33.6
Llama-3.3-70B Zero-Shot 38.8 0.301 47.1 39.2 0.311 44.3
Few-Shot 41.7 0.347 46.4 46.7 0.350 48.9
CoT 44.2 0.401 45.3 42.2 0.392 43.9
PoT 27.7 0.299 30.6 31.1 0.289 33.0
Gemini-2.0-Flash Zero-Shot 40.7 0.370 47.3 38.6 0.345 45.4
Few-Shot 45.9 0.373 48.8 41.4 0.340 43.3
CoT 47.3 0.454 50.8 44.1 0.419 48.7
PoT 18.2 0.225 23.6 26.3 0.239 29.1
Mistral-Small-3.2 Zero-Shot 48.3 0.410 50.5 45.7 0.404 48.0
Few-Shot 50.3 0.373 51.6 43.9 0.365 45.2
CoT 52.6 0.454 53.1 51.5 0.446 51.7
PoT 29.8 0.278 29.9 20.5 0.241 26.4
Large Reasoning Models
Qwen3-14B 42.6 0.441 44.4 41.2 0.433 43.1
Qwen-QwQ 48.1 0.526 54.1 46.1 0.479 55.7
Qwen-Distill-32B 43.1 0.407 49.9 39.2 0.426 44.6
Llama-Distill-70B 44.6 0.483 52.4 40.5 0.455 50.9

LLM Backbones.

We benchmark a diverse set of state-of-the-art large language models, spanning both open-source and proprietary families, as well as reasoning-optimized variants for complex problem-solving. Specifically, we evaluate Llama-3.3-70B-Instruct dubey2024llama, GPT-4o-mini openai2023gpt, Gemini-2.0-Flash team2023gemini, and Mistral-Small-3.2-24B-Instruct-2506 jiang2024mixtral. Beyond these general-purpose models, we also assess Qwen3-14B, Qwen-32B-QwQ yang2025qwen3, Qwen-Distill-32B, and Llama-Distill-70B guo2025deepseek, which are specialized for reasoning tasks. All models are evaluated using default hyperparameters and a fixed decoding temperature (τ=0.1\tau=0.1) for consistency across runs. Following wang2023chain, each table is linearized into a pipe-separated format and concatenated with its query across models.

Baselines.

We evaluate two baseline categories: (i) prompting strategies and (ii) table reasoning methods developed specifically for tabular data. For prompting, we adopt four standard paradigms: (i) Zero-shot prompting, where the model directly answers the table–question pair; (ii) Few-shot prompting chen2023large, with four in-context examples; (iii) Chain-of-Thought (CoT) wei2022chain, encouraging intermediate reasoning steps; and (iv) Program-of-Thought (PoT) chen2023program, which incorporates executable programs as intermediate reasoning. For table reasoning methods, we use GPT-4o-mini and Llama-3.3-70B as LLM backbones to evaluate six state-of-the-art approaches: BlendSQL glenn2024blendsql, a hybrid framework embedding SQL-style reasoning within natural prompts; Chain-of-Table wang2023chain, which performs stepwise table updates for interpretable reasoning; ProTrix wu2024protrix, integrating SQL planning with compositional reasoning; TabSQLify nahid2024tabsqlify, which uses SQL to partition large tables into sub-tables for scalable inference; TableMaster cao2025tablemaster, combining textual and symbolic reasoning via adaptive table verbalization; and NormTab nahid2024normtab, normalizing table structures and values to improve symbolic interpretability. Additional implementation details are provided in Appendix B.3.

Evaluation Metrics.

For fairness and consistency, all models are evaluated under identical input and output constraints, focusing on accuracy and generation quality. Each model is instructed to produce concise, self-contained natural language answers; for SQL-based methods, query output is post-processed and verbalized in natural language for comparability. Following pasupat2015compositional; zhang2023crt, we report Exact Match (EM) as the primary metric. We further relax the evaluation with BLEU papineni2002bleu to capture nn-gram overlap and an LLM-as-a-Judge (LLM-Score) evaluation using GPT-4o-mini to assess semantic equivalence. This combination provides complementary signals for lexical accuracy, surface fluency, and semantic faithfulness. For more details, see Appendix B.2.

3.1 Main Results

Table 4 reports the performance of different LLM backbones on RUST-BENCH using Exact Match (EM), BLEU, and LLM-score. Overall, Qwen-QwQ achieves the highest performance across all metrics, with an LLM-score reaching 54.1 and 55.7 for RB-Science and RB-Sports, respectively. Furthermore, it can be seen that CoT consistently outperforms Zero-Shot and Few-Shot for smaller models, highlighting the importance of explicit reasoning in this setting. In contrast, PoT exhibits the weakest performance across all models, likely due to the semi-structured nature of the data. In Table 5, we present a comparison of table reasoning baselines on RUST-BENCH, implemented using GPT-4o-mini and Llama-3.3-70B as the backbones. Among these, TableMaster achieves the best overall results, reaching 42.3% EM on RB-Science and 43.1% on RB-Sports. In contrast, symbolic or SQL-based methods such as TabSQLify and BlendSQL perform worse, achieving EM scores of 15.3% and 13.6%, respectively. These findings suggest that purely symbolic reasoning pipelines are insufficient for the flexible, context-driven inference required by RUST-BENCH, which is consistent with our findings.

3.2 Impact of Table Size

To investigate how table size affects reasoning accuracy in our setting, we analyze model performance across naturally occurring tables grouped by total token count, spanning from 10K to 85K tokens. As illustrated in Figure 3, GPT-4o-mini, Gemini-2.0-Flash, and LLaMA-3.3-70B exhibit a consistent, monotonic decline in Exact Match accuracy as table size increases, with degradation becoming particularly pronounced beyond the 35K–50K token threshold. Notably, this performance drop occurs well within the nominal context windows of modern LLMs (typically 128k+ tokens), suggesting that the bottleneck arises from reasoning and attention limitations rather than raw context length. This degradation can be attributed to LLMs’ difficulty in retrieving and integrating dispersed evidence across long sequences liu2023lost, difficulty in locating relevant information, and increased multi-hop reasoning complexity. Unlike existing benchmarks that predominantly feature concise tables under 5000 tokens (pasupat2015compositional; chen2019tabfact), RUST-BENCH includes substantially longer and more heterogeneous tables where critical information is often scattered across extensive contexts. These findings highlight the need for improved query-specific data extraction mechanisms to effectively handle large-scale tabular reasoning tasks.

Refer to caption
Figure 3: Accuracy comparison of LLMs across varying token count bins. The x-axis represents token length ranges, while the y-axis shows accuracy in percentage.
Table 5: Comparison of baselines on RUST-BENCH using GPT-4o-mini and Llama-3.3-70B using: (a) Exact Match (EM), (b) BLEU, and (c) LLM-as-a-judge (LLM-score), with higher values indicating better performance.
Method GPT-4o-mini Llama-3.3-70B
EM (%) BLEU LLM-score (%) EM (%) BLEU LLM-score (%)
TabSQLify 15.3 0.206 22.3 14.4 0.120 18.6
BlendSQL 13.6 0.186 20.2 11.7 0.145 13.6
ProTrix 32.6 0.319 33.9 28.3 0.265 31.5
Chain-of-Table 30.1 0.247 35.1 33.2 0.358 36.9
NormTab 33.9 0.338 36.8 30.9 0.279 34.9
TableMaster 42.3 0.431 44.2 43.1 0.386 45.4

3.3 Impact of Real-World Table Complexity

To assess how the combination of real-world structural complexity and multi-hop reasoning affects model performance, we compare two proprietary LLMs GPT-4o-mini and Gemini-2.0-Flash across WikiTQ (a general-knowledge benchmark) and RUST-BENCH. We evaluate both models under zero-shot and Chain-of-Thought (CoT) prompting settings. As shown in Figure 4, both models demonstrate strong performance on WikiTQ, with GPT-4o-mini achieving 59.4% accuracy in zero-shot and 64.5% with CoT, while Gemini-2.0-Flash reaches 69.7% and 80.4%, respectively. In contrast, performance on RUST-BENCH drops sharply to roughly 20-30% across all prompting strategies for both models. This substantial gap reveals the compounding challenges introduced by domain-specific reasoning, heterogeneous table schemas, long contexts, and multi-hop inference. Unlike WikiTQ’s short, homogeneous tables dominated by direct lookup queries, RUST-BENCH captures the full spectrum of real-world tabular reasoning, where multiple factors interact to create harder reasoning problems. Such a dramatic decline underscores the limits of current LLMs in generalizing beyond simplified benchmarks and highlights the pressing need for more robust and compositional reasoning mechanisms.

Refer to caption
Figure 4: Performance comparison of LLM backbones on RUST-BENCH and WikiTQ using EM accuracy. Unlike WikiTQ, RUST-BENCH tests LLMs with more challenging questions and tables, resulting in a reduced LLM performance.

3.4 Impact of Heterogeneous Data

While multi-hop evaluation has been extensively studied as a driver of task difficulty, the influence of data heterogeneity and structure on reasoning performance remains less explored. To investigate how the underlying data influences reasoning performance, we conduct controlled experiments on a subset of randomly sampled 100 RB-Sports tables in two settings: structured and unstructured. We convert the semi-structured tables while keeping the underlying content identical in both its variants. In the structured setting, information is normalized into explicit columns, minimizing free-form text; in the unstructured setting, each table row is verbalized into natural-language sentences and appended to the textual field, simulating highly heterogeneous inputs. We first evaluate symbolic reasoning methods, specifically Program-of-Thought (PoT) prompting, on the structured and semi-structured variants. As shown in Figure 5, PoT consistently achieves higher accuracy on the structured version across all models except Llama-3.3-70B, which performs comparably on both. This pattern indicates that symbolic reasoning benefits from explicit schema structure and reduced textual noise, confirming its reliance on syntactic regularity. Next, we assess text-based reasoning methods using Chain-of-Thought (CoT) prompting on the unstructured and semi-structured variants.

Refer to caption
Figure 5: Performance comparison on structured and semi-structured variants for different LLM backbones using Program-of-Thought (PoT) prompting.
Refer to caption
Figure 6: Performance comparison on unstructured and semi-structured variants for different LLM backbones using Chain-of-Thought (CoT) prompting.

Figure 6 shows that CoT yields higher accuracy on the unstructured representation, indicating that natural-language continuity facilitates stepwise reasoning when explicit structure is absent. Overall, these results show that semi-structured data presents the greatest reasoning challenge, as it combines the ambiguity of free-text with the rigidity of tabular schema, while purely structured or unstructured formats better align with the respective strengths of symbolic and semantic reasoning. We further perform an in-depth error analysis to characterize common failure modes and provide qualitative examples of reasoning diversity in RUST-BENCH in Appendices C, D, and E.

4 Related Work

General Table Reasoning.

Table reasoning tasks typically involve well-structured, short tables, often derived from Wikipedia-based sources. Datasets such as WikiTQ pasupat2015compositional, SQA iyyer2017search, WikiSQL zhong2017seq2sql, and Spider yu2018spider focus on question answering or text-to-SQL tasks that test reasoning over such tables. While WikiTQ and SQA include complex questions, WikiSQL pairs natural language questions with SQL queries, and Spider offers a large-scale, cross-domain collection with diverse databases and complex SQL. Beyond question answering, fact-verification datasets like TabFact chen2019tabfact and Infotabs gupta2020infotabs evaluate claim verification over Wikipedia data, while FetaQA nan2022fetaqa targets free-form question answering requiring reasoning over entity relations. However, these datasets primarily rely on short, factual tables with limited query diversity and shallow reasoning depth.

Semistructured and Complex Reasoning.

Datasets such as FEVEROUS aly2021fact, Hybrid-QA chen2020hybridqa, and OTT-QA chen2020open extend table reasoning to open-domain contexts combining text and tables, yet still exhibit limited diversity in reasoning types and structural variation. In contrast, reasoning-focused datasets like TempTabQA gupta2023temptabqa and TABMWP lu2022dynamic emphasize specific reasoning skills like temporal and numerical reasoning, respectively, but lack semi-structured contexts. CRT-QA zhang2023crt covers a broader range of reasoning types but remains constrained by structured-only, open domain data. Our dataset bridges these gaps by combining domain-specific, semi-structured tables with diverse, multi-hop reasoning tasks that span both structured and unstructured modalities.

Domain-Specific Datasets.

Datasets tailored to specific domains typically require specialized background knowledge and retrieval mechanisms to answer domain-grounded questions. In the finance domain, FinQA chen2021finqa, TAT-QA zhu2021tat, and MultiHiertt zhao2022multihiertt emphasize numerical and logical reasoning, often integrating heterogeneous data sources. SemTabFacts wang2021semeval and SciTAB lu2023scitab focus on claim verification using tables from scientific articles, while SciTabQA lu2023scitab extends this to question answering over mixed textual and tabular evidence. Despite their domain focus, these datasets generally contain small, homogeneous tables with limited semi-structured context, thereby constraining the study of complex, multi-hop reasoning. As illustrated in Figure 7, RUST-BENCH differs by unifying large-scale, heterogeneous, and domain-specific tables—capturing the full spectrum of real-world reasoning challenges.

Refer to caption
Figure 7: Overview of table reasoning datasets categorized by key challenges: on (a) domain-specific, (b) long, (c) semi-structured tables, and (d) complex queries. RUST-BENCH integrates datasets that span multiple dimensions of real-world complexity. In contrast, existing benchmarks satisfy only a subset or none of these criteria (e.g, WikiTQ, TabFact, etc.), limiting their applicability to practical, heterogeneous information systems.

5 Conclusion

We presented RUST-BENCH, the first benchmark that jointly evaluates LLMs on tabular reasoning across four fundamental challenges of real-world data: scale, heterogeneity, domain specificity, and multi-hop inference. Our experiments demonstrate that even the strongest proprietary and open-source models systematically fail under these conditions, as accuracy drops sharply with increasing table length, and multi-hop reasoning over semi-structured, domain-specific tables frequently breaks down. RUST-BENCH provides a robust evaluation framework and a foundation for advancing research in symbolic and structured reasoning, which is an essential step toward reliable real-world deployment. Future work on RUST-BENCH will emphasize broader coverage by adding diverse domains (e.g., healthcare, finance, climate), multilingual settings, and more complex table structures (hierarchical, nested, evolving) to better test cross-domain generalization. We will also introduce real-world noise, i.e., missing cells, typos, schema drift, and conflicting units—to assess robustness, calibration, and recovery under imperfect data. Finally, we will pair LLMs with tools for retrieval, schema induction, and execution, aiming for verifiable, scalable reasoning over semi-structured data.

Limitations

While RUST-BENCH marks a step forward in evaluating LLMs on realistic tabular reasoning, it could further incorporate multi-table and relational reasoning, introduce training splits to support fine-tuning and adaptation, and explore richer evaluation protocols that better capture semantic correctness in complex answers. These developments can help create robust and generalizable approaches to tabular reasoning in real-world applications.

Ethics Statement

We, the authors, affirm that our work adheres to the highest ethical standards in research and publication. We have carefully considered and addressed various ethical issues to ensure the responsible and fair use of computational linguistics methodologies. To facilitate reproducibility, we provide detailed information, including code, datasets (all publicly available and in compliance with their respective ethical standards), and other relevant resources. Our claims align with the experimental results, though some stochasticity is expected with black-box large language models, which we minimize by maintaining a fixed temperature. We provide comprehensive details on annotations, dataset splits, models used, and prompting methods, ensuring our work can be reliably reproduced.

Acknowledgments

This research was partially supported by the U.S. National Science Foundation (NSF) under Grant No. 2416728. We also extend our gratitude to the Complex Data Reasoning and Analysis Lab (CoRAL) at Arizona State University for providing essential computational resources, mentorship, and a collaborative research environment that greatly contributed to the progress of this work. We sincerely thank Beenaa Salian and Preethi Suresh for their assistance in data annotation, verification, and code implementation, which played a key role in ensuring the accuracy and reliability of our results. Finally, we appreciate the thoughtful and constructive feedback provided by the reviewers, which helped strengthen the quality and presentation of this research.

Appendix A More Details on Dataset Generation

A.1 Symbolic Approach

To enable QA pair generation using the symbolic approach, we curate a diverse collection of approximately 75 SQL query templates. These templates are designed to cover a broad spectrum of SQL constructs, including basic SELECT statement, conditional logic (AND, OR), aggregation (MAX, SUM, etc.), sorting (ORDER BY), grouping (GROUP BY), and joins. As shown in Figure 8, each template includes placeholder tokens for table names, columns, and filter conditions, allowing for broad applicability across different schemas. To instantiate these templates, we adopt a prompt-based generation approach leveraging large language models (LLMs). Specifically, we sample a template at random and prompt the LLM with task instructions and in-context exemplars to replace the template placeholders using schema-specific information derived from a target semi-structured table. This results in a fully instantiated SQL query tailored to the table (Figure 9). The generated SQL is then executed on the underlying table to obtain the corresponding answer. In a subsequent step, we prompt the LLM with the SQL query and its result to generate a natural language question that semantically aligns with the query logic but obscures the clauses. The final output is a question-answer pair, where the answer is grounded in the execution result of the SQL, and the question is a fluent natural language version reflecting the underlying semantics. This pipeline supports scalable QA dataset generation grounded in executable symbolic programs, enabling evaluation of models on structured reasoning tasks.

Refer to caption
Figure 8: Example of SQL templates used for QA generation.

A.2 Semantic Approach

As outlined in Section 2, we employ two prompting strategies: Single Row-Based and Multi-Row-Based to improve the quality, diversity, and verifiability of LLM-generated questions over large tabular data. Figure 10 illustrates both approaches. In the Single Row-Based method, we randomly sample one row from the table and use it as the entire input context. This localization helps the LLM focus on intra-row reasoning, such as retrieving or interpreting structured and unstructured cell content. It also simplifies verification, as each question-answer (QA) pair depends on a well-defined and constrained context. In contrast, the Multi-Row Based method is designed to enable multi-row reasoning by selecting a subset of rows that are semantically connected via a shared entity in a specific column. By narrowing the input to only a few rows, these strategies, as shown in Figure 10 (bottom), help overcome LLM limitations with long inputs by explicitly controlling context size and composition. They allow generating QA pairs that are diverse in type, grounded in the table content, and more easily verifiable.

Refer to caption
(a) Using code generation capabilities of LLMs to generate SQL queries.
Refer to caption
(b) Converting the SQL queries to natural language question-answer pairs.
Figure 9: QA pair generation using symbolic approach. We leverage LLMs’ code generation capabilities to generate SQL queries, which are then converted to natural language questions and answers by executing the SQL queries on the table data.
Refer to caption
Refer to caption
Figure 10: QA pair generation using semantic approach: (a) Single-Row Approach (top); (b) Multi-Row Approach (bottom), which forms questions on a subset of the table.

A.3 More Details on Data Validation

Refer to caption
Figure 11: Annotation Platform - User Interface.

Figure 11 illustrates the custom verification interface used during the human-in-the-loop annotation process. Each screen presents a question, its predicted answer, and a detailed explanation generated by the model, alongside an interactive table view displaying the relevant semi-structured data. Annotators could validate the question-answer pair using tools such as column-specific filters, row-level sorting, and a search bar to locate supporting evidence quickly. The interface also includes input fields for correcting errors and a checkbox for discarding invalid questions. This setup ensured that annotators had full contextual access while verifying QA pairs, improving both accuracy and efficiency. After one round of annotations, the samples were further verified by expert verifiers to ensure high-quality question-answer pairs. The entire process was conducted by annotators and reviewed by graduate students in Computer Science.

Appendix B Implementation Details

In this section, we describe the prompting strategies, evaluation metrics, and LLM-based table reasoning baselines used in our study, along with their implementation details.

B.1 Prompting Techniques

We implement four reasoning techniques to use LLMs to perform tabular reasoning. Figures 1213,  14, and 15 highlight the direct prompting (zero-shot), few-shot, chain-of-thought (CoT) and program-of-thought (PoT) prompts for the LLMs respectively.

Refer to caption
Figure 12: Prompt for Direct prompting.
Refer to caption
Figure 13: Prompt for Few Shot reasoning.
Refer to caption
Figure 14: Chain-of-Thought reasoning prompt.
Refer to caption
Figure 15: Program-of-Thought reasoning prompt.

B.2 Evaluation Metrics

Exact Match (EM).

Following WikiTQ pasupat2015compositional, we implement exact match (EM) as the metric for evaluating model performance. EM assigns a score of 1 if the predicted answer is exactly the same as the gold answer, and 0 if otherwise. The final EM accuracy is calculated by adding the individual exact match scores divided by the total number of samples in the set. However, despite ignoring regex, punctuations, and case-sensitivity, EM penalizes semantically correct generations that do not exactly match the ground truth. It becomes increasingly challenging to evaluate longer answers that contain short phrases or multiple entities as the answer. We thus explore more relaxed metrics that do not penalize semantically correct generations.

BLEU Score.

BLEU score papineni2002bleu is a metric used in machine translation to compare the quality of machine-translated text with a set of reference translations. It measures the n-gram overlap between the reference text and the prediction, assigning a score of 0-1 depending on the amount of overlap. Despite being better than EM at longer phrases, the BLEU score measures the word overlap, missing out on the semantic relevance between the prediction and the reference.

LLM-score.

To correctly measure the generation quality and take the semantic similarity between the outputs and the predictions, we use an LLM as a judge to evaluate and score the generated outputs. As illustrated in Figure 16, the LLM is tasked to assign a score on a scale of 0-5 based on the correctness of the prediction. With a score of 4 representing less than 5% error between the ground truth and the prediction, the final accuracy is calculated by summing the total number of samples reporting a score of 4 or more, divided by the total samples. This enables us to gauge the answers semantically and return a better metric to evaluate the answers semantically.

Refer to caption
Figure 16: Prompt for using LLM-as-a-judge to output LLM-score.

B.3 Baselines

BlendSQL

glenn2024blendsql is a unified dialect that integrates SQL logic with large language model (LLM) reasoning across semi-structured data. It serves as a superset of SQLite, enabling complex hybrid question answering tasks involving multi-hop reasoning. The implementation utilizes the open-source repository blendsql222https://github.com/parkervg/blendsql, with dataset-specific in-context examples and default parameters.

Chain-of-Table

wang2023chain is a prompting framework that extends Chain-of-Thought by incorporating tabular data explicitly in the reasoning chain. It guides LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. The implementation follows the official GitHub repository chain-of-table333https://github.com/google-research/chain-of-table with the in-context examples tailored to our dataset.

ProTrix

wu2024protrix introduces a Plan-then-Reason framework that plans the reasoning path using the query and context, then assigns each step to either textual or program-based reasoning to arrive at the final answer. We modify their official repository444https://github.com/WilliamZR/ProTrix in-context examples to suit RUST-BENCH and use their default hyperparameters.

TabSQLify

nahid2024tabsqlify is a semantic parsing-based method that translates natural language questions into executable SQL queries over structured tables. It leverages text-to-SQL generation to decompose tables into smaller, relevant sub-tables containing only essential information for answering questions or verifying statements. We utilize tabsqlify555https://github.com/mahadi-nahid/TabSQLify with updated in-context examples for inference.

TableMaster

cao2025tablemaster is a unified framework that combines multiple techniques for table reasoning. The method first retrieves relevant table content and enriches it with semantic verbalizations, and employs adaptive reasoning to flexibly choose between textual and symbolic reasoning depending on each query. We adopt the official repository TableMaster666https://github.com/zzlang-c/TableMaster, retaining their default hyperparameters for fair comparison.

NormTab

nahid2024normtab focuses on improving symbolic interpretability by normalizing table structures and values prior to reasoning. It standardizes heterogeneous column names and formats, reducing schema variance and enabling more consistent SQL-based reasoning across diverse tables. We utilize the public normtab777https://github.com/mahadi-nahid/NormTab repository, following default parameters and adapting the prompts to our dataset.

Appendix C Reasoning Diversity in RUST-BENCH

Distribution of Question Types.

To understand the reasoning diversity in RUST-BENCH, we adopt and extend the taxonomy proposed in CRT-QA zhang2023crt, which builds on the BIG-bench framework srivastava2022beyond. As shown in Table 6, our annotation covers a broad spectrum of reasoning types—from high-frequency operations such as filtering and temporal reasoning to more complex forms including multi-hop, implicit, and counterfactual reasoning. This diversity underscores the layered cognitive demands required for real-world table understanding. Filtering and temporal reasoning are the most common types, reflecting the frequent need to locate relevant records and interpret time-dependent relationships. However, a significant proportion of questions also require multi-hop reasoning (26.18%), numerical computation (26.83%), and logical composition (27.85%), highlighting the dataset’s emphasis on compositional and quantitative reasoning. Although rarer, counterfactual, commonsense, and causal reasoning further test model generalization beyond surface-level retrieval.

Table 6: Distribution of reasoning types. Categories are non-exclusive; percentages may not sum to 100%.
Reasoning Type Percentage (%)
Filtering / Selection 75.89
Temporal Reasoning 39.33
Logical Reasoning 27.85
Numerical 26.83
Multi-hop Reasoning 26.18
Aggregation 23.97
Comparison 17.43
Implicit Reasoning 11.36
Unanswerable 6.83
Sorting / Ranking 5.47
Causal Reasoning 5.20
Commonsense Reasoning 0.44
Spatial Reasoning 0.24
Counterfactual / Negative 0.19

Unanswerable Questions.

In practical table reasoning, not all queries are grounded in the available data. Distinguishing answerable from unanswerable questions is therefore crucial for reliable model deployment in domains such as finance and science. To evaluate this capability, RUST-BENCH incorporates explicitly unanswerable questions following zhang2023crt—queries that cannot be resolved using the table content alone. Examples include those that require external knowledge or contain logical contradictions. A model is considered correct only if it abstains by responding with phrases such as “cannot answer” or “not enough information.” We manually verify outputs to measure accuracy. As shown in Figure 17, models struggle considerably with this task: even under Chain-of-Thought prompting, Gemini-2.0-Flash achieves only 52.27% accuracy in RB-Sports and 26.97% in RB-Science, indicating the persistent challenge of reliable unanswerable detection in table QA.

Refer to caption
Figure 17: Accuracy of GPT-4o-mini and Gemini-2.0-Flash models on RB-Science and RB-Sports datasets, evaluated on questions that include unanswerable/ambiguous cases.

Appendix D Qualitative Analysis

Semi-structured tables in RUST-BENCH pose a unique challenge for LLMs, as they require reasoning that spans both structured schema elements (e.g., categorical or numeric fields) and unstructured text (e.g., summaries or descriptions). Such inputs expose the limitations of models that excel in either symbolic precision or semantic understanding, but not both. As illustrated in Figure 18, answering ‘How many projects focus on children and how many children did the earliest project address?’ requires scanning abstracts for child-related projects, counting across rows, and applying temporal reasoning to identify the earliest award. Crucially, the abstract of the 2016 brain connectivity project mentions developmental trajectories without specifying participant numbers, so the correct response must acknowledge the absence of detail. Similarly, for the question ‘In a March game at TD Garden, which player from the losing team had the highest points and what was the point difference between him and the leading scorer of the winning team?’ (Figure 19) requires filtering structured fields to locate the relevant March 2019 Celtics–Nuggets game, extracting top scorers from the unstructured summary, aligning them with their teams, and performing arithmetic to compute the score difference. This case exemplifies hybrid reasoning across structured and unstructured inputs, combined with entity disambiguation and grounded numeric comparison. These cases underscore how RUST-BENCH questions move beyond single-field lookup, requiring schema filtering, semantic interpretation, aggregation, and handling capabilities that remain fragile in current LLMs.

Appendix E Error Analysis

To analyze the sources of performance degradation, we manually examined 100 randomly sampled erroneous predictions from Gemini-2.0-Flash (CoT). Errors were grouped into four major categories reflecting distinct failure modes: (i) Interpretation Error: counting or lookup mistakes caused by complex table structures and increased token load from unstructured fields; (ii) Logical Inconsistency Errors: contradictory or incomplete reasoning chains, particularly in multi-hop settings; (iii) Misalignment Errors: outputs that deviate from the expected answer schema or provide only partial results; and (iv) Extraction Errors: incorrect or missed retrievals from structured or unstructured regions of the table. The breakdown in Table 7 shows that no single type dominates; instead, errors stem from the interaction between structural complexity, multi-step reasoning, and representational inconsistencies introduced by semi-structured inputs.

Table 7: Breakdown of 100 randomly sampled erroneous predictions from Gemini-2.0-Flash (CoT).
Error Type Percentage
Interpretation Error 22%
Logical Inconsistencies 31%
Misalignment Error 27%
Extraction Error 20%

Extraction Error.

These involve failures to retrieve key information from structured fields or unstructured text. The model may skip valid rows or miss implicit cues, such as differences in project counts across years (Figure 20) or mentions of child-related studies buried in abstracts (Figure 21).

Logical Inconsistency Error.

These occur when the model generates an apparently coherent reasoning chain but produces a final answer inconsistent with its intermediate analysis. For example, as shown in Figure 22, the model may identify both Standard Grant and Continuing Grant as valid answers but report only one, revealing a collapse between reasoning and final output generation.

Interpretation Error.

Here, the model misreads the scope of the question or the table structure, overlooking relevant rows or applying filters incorrectly. As illustrated in Figure 23, it may compute time gaps based on a single record while ignoring other valid entries, leading to incomplete evidence gathering and erroneous conclusions.

Misalignment Error.

In some cases, the model’s reasoning is correct, but the output format deviates from the expected answer schema—for instance, returning a sum instead of individual attendance values (Figure 24). Collectively, these patterns show that while LLMs can perform multi-step reasoning, they often lose alignment between reasoning, evidence retrieval, and output generation particularly when operating on semi-structured data that demands both symbolic precision and semantic understanding.

Refer to caption
Figure 18: Example from the RB-Science subset. The question requires understanding data from unstructured fields, aggregation across rows, temporal reasoning to identify the earliest project, and recognition of underspecified information, highlighting challenges beyond surface retrieval.
Refer to caption
Figure 19: Example from the RB-Sports subset. Answering the question requires filtering by structured fields (month, stadium), extracting top scorers from unstructured summaries, and performing arithmetic comparison, illustrating hybrid multi-hop reasoning across modalities.
Refer to caption
Figure 20: Extraction Error. The LLM fails to extract the relevant information from the structured table. Instead of identifying the number of projects sanctioned in October 2022 and comparing it with October 2023, it wrongly concludes that no “previous October” exists.
Refer to caption
Figure 21: Extraction Error. The LLM fails to extract relevant information from the unstructured portion of the table. While only one project explicitly mentions the term children in its title, two additional projects are related but require a deeper comprehension of the unstructured content to be correctly identified and extracted.
Refer to caption
Figure 22: Logical Inconsistency Error. Owing to the large number of rows, the LLM engages in extensive reasoning and correctly identifies both Standard Grant and Continuing Grant. However, the final answer only lists Standard Grant, revealing a collapse between reasoning and output under heavy analysis.
Refer to caption
Figure 23: Interpretation Error, where the LLM misinterprets both the question and the table. While tasked with finding the week gap between the earliest and latest amendment dates for five-investigator projects, it only considers a single row and ignores other valid rows. This leads to an incorrect calculation, showing how errors in interpreting table structure and question scope can cascade into a wrong final answer.
Refer to caption
Figure 24: Misalignment Error. The reasoning correctly identifies the relevant rows and extracts the attendance figures. However, instead of listing these individual values as expected, the LLM sums them up. This misalignment between the required output format and the final answer leads to an incorrect response despite accurate intermediate reasoning.