Have your Queries Already Seen the Data? Data-Privilege in Tabular Benchmarks

Investigation into what information is leaked into queries in popular tabular benchmarks.

Note: This blog post is a deep-dive into a critical issue surfaced in our recent paper “Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis“ accepted to the AI for Tabular Data Workshop at EurIPS 2025.

TL;DR: Many popular benchmarks for evaluating natural language interfaces to tabular data contain “data-privileged” queries—questions that reference specific database structures, internal codes, or data containers that real users wouldn’t know about in open-domain settings. Our analysis of 15 datasets reveals that up to 70% of queries in complex analysis benchmarks and 26-27% in widely-used text-to-SQL datasets contain such privileged information, fundamentally undermining evaluations by providing unrealistic shortcuts. To properly test open-domain capabilities, we need to either carefully adapt existing datasets to remove privileged information while maintaining realistic data scope specification, or shift to query-first construction methodologies that mirror how real users formulate information needs.

When Your Test Data Knows Too Much

Natural language interfaces to databases have come a long way. What started as systems that let users query a single, known database with questions like “How many employees work in the sales department?” are evolving into something far more ambitious: open-domain systems that identify relevant tabular data from vast, unknown corpora before answering queries. This shift is happening across the board—from text-to-SQL generation¹ ² to question answering³ ⁴ ⁵ to full-scale data analysis⁶ ⁷.

But there’s a problem hiding in plain sight. As researchers start evaluating open-domain systems they plainly adapt existing datasets from closed-domain settings (where users know exactly which database they’re querying) or device new specific datasets. While users formulate queries over an unknown corpus of tabular data, going from query to data, these datasets are constructed the other way around, starting out with (a) specific table(s) and deriving queries from this. If not explicitly addressed, such dataset construction methods inadvertently introduce an unrealistic advantage into the data: many queries in these benchmarks contain privileged information that real users would never have access to.

Consider this query from the HiTab dataset: “What was the number of the ashrs of polish refugees?”⁸ The term “ ashrs” isn’t a word you’d find in a dictionary, it’s a column header copied directly from a specific table. Or this one from DA-Eval: “Check if the RHO_OLD column follows a normal distribution.”⁹ A real user asking questions about data they haven’t seen wouldn’t know that a column called “RHO_OLD” exists, let alone reference it by its exact database identifier.

These aren’t isolated examples. In our analysis of 15 popular datasets spanning question answering, text-to-SQL, and data analysis tasks, we found that some benchmarks have up to 70% of their queries containing such privileged information. This matters because when we evaluate systems on these queries, we’re not testing their ability to work in realistic open-domain settings, instead we’re testing them on a fundamentally easier task where users magically know the underlying data structure, directly tying a query to specific data.

What Is Data-Privilege, and Why Should We Care?

At its core, a data-privileged query is one that betrays knowledge the user shouldn’t have. In a true open-domain setting, users approach the system with information needs grounded in their understanding of the world and the system they are interacting with ¹⁰, not in knowledge of how some specific dataset happens to be structured. When a user asks about “quarterly revenue trends for technology companies,” they’re expressing a natural information need. When they ask about “the avg_revenue_q1 column for rows where industry_code=’TECH’,” they’re revealing that they’ve already seen the data.

This distinction matters profoundly for evaluation. Data-privileged queries provide an unrealistic signal that makes the task appear simpler than it actually is. They create a shortcut: instead of having to understand the user’s conceptual query and map it to whatever data structures might be relevant in a massive corpus, the system can often pattern-match directly to the referenced structural elements. This fundamentally undermines what we claim to be testing in open-domain scenarios.

We identify three distinct manifestations of data-privilege in queries:

1. Structural References - When Queries Speak Database

Structural references occur when queries use terminology that sounds more like database schema than natural language—phrases that feel “copied” rather than “composed.” The most obvious cases involve programming conventions: queries asking about “SalePrice” (camelCase) or “gdpPercap_1982” (underscores and suffixes) are clearly referencing specific field names. But the pattern can be subtler: asking for “Aaron Doran’s potential score” rather than his “skill level” or “rating” suggests familiarity with how a sports database labels its columns¹¹.

Database-specific concepts provide another giveaway. Queries that ask for “the record with index 5” or “the primary key” use language from data management, not everyday information seeking. These concepts are highly indicative of directly interacting with a database instead of leaving this task to the system. Consider this query from TableBench: “What is the correlation between a country’s ‘carbon dioxide emissions per year (tons per person)’ and its ‘average emission per km² of land’?”¹². Those precise metric definitions in quotes strongly suggest knowledge of exactly how these measurements are labeled in a specific dataset.

2. Value References - Knowing the Database Contents

Value references reveal knowledge of what specific data values exist in underlying tables. The clearest cases involve internal identifiers, i.e., codes or keys that exist purely within a dataset’s organizing logic. Consider this from MMQA: “Which clubs located in ‘AKW’ have members holding either ‘President’ or ‘Vice-President’ positions?”¹³ Unless “AKW” is publicly known, this three-letter code suggests the user has looked at the table. Similarly, queries about “authors who publish books in both ‘MM’ and ‘LT’ series”¹⁴ or “clients whose complaint type is ‘TT’“¹¹ use cryptic codes that point to internal categorization schemes.

Not all value references indicate leakage of privileged information. Publicly knowable named entities like people, places, organizations, or dates don’t necessarily indicate data-privilege. Asking about “Janja Garnbret” or “the 2024 Olympics” uses world knowledge, not dataset-specific knowledge. The distinction lies in whether the specificity comes from general knowledge or from having seen the particular data.

3. Container References - Breaking the Fourth Wall

Container references explicitly acknowledge working with a data artifact, breaking the illusion of asking about the world. Phrases like “in the dataset,” “according to the table,” or “using the provided spreadsheet” directly reference the data container. Consider the query “Load the data into the SQLite database” from DA-Code ¹⁵, this isn’t maintaining any fiction of open-domain interaction. Even subtle conceptual references like “using the provided dataset, find the top five most frequent qualifications” assume a bounded artifact that has been “provided” rather than discovered.

Data-Privilege Across 15 Popular Benchmarks

To get a grasp on the prevalence of data-privileged queries, we systematically analyzed 15 datasets commonly used to evaluate natural language interfaces to tabular data, spanning single-table question answering, multi-table reasoning, text-to-SQL, and data analysis tasks (see Table 1). Using LLM-based classifiers validated against expert annotations, we labeled queries for structural references, value references, and container references. Figure 1 presents the distribution of data-privileged queries across these datasets.

Dataset	Task	Open-Domain	#Queries
WikiTableQuestions¹⁶	Single Table QA	❌	14,151
TabMWP¹⁷	Single Table QA	✅	38,901
CRT-QA¹⁸	Single Table QA	✅	728
HiTab⁸	Single Table QA	✅	10,672
OpenWikiTables¹⁹	Single Table QA	✅	67,023
OTT-QA⁵	Single Table QA	✅	4,372
FeTaQA⁴	Single Table QA	✅	10,330
TableBench¹²	Single Table QA	❌	886
QTSumm²⁰	Multi Table QA	❌	10,440
MMQA¹³	Multi Table QA	✅	3,313
Spider¹⁴	Text-to-SQL	❌	11,840
BIRD¹¹	Text-to-SQL	❌	10,962
DA-Code¹⁵	Tabular Analysis	❌	500
KramaBench²¹	Tabular Analysis	~	104
DA-Eval⁹	Tabular Analysis	❌	257

Table 1: Overview of analyzed datasets and their characteristics.

The most striking pattern is the correlation between task complexity and data-privilege rates. Complex tabular analysis benchmarks show dramatically higher data-privilege: DA-Eval ⁹ at 70% (dominated by structural references at 63%) and DA-Code at 59% (primarily container references)¹⁵. In contrast, some simpler question-answering datasets achieve much better data-independence, with FeTaQA ⁴ at just 0.4%, OTT-QA ⁵ at 5.4%, and HiTab⁸ at 9.6%. This makes intuitive sense, specifying complex analytical operations without referencing specific data structures is genuinely harder. We cannot properly assess systems’ abilities to translate complex insight needs into appropriate analyses if our test queries already encode privileged knowledge about data organization.

Figure 1: Distribution of data-privileged queries across 15 tabular benchmarks, broken down by reference type.

A critical finding for the community concerns text-to-SQL datasets. Spider¹⁴ and BIRD ¹¹, while originally designed for closed-domain scenarios, are increasingly being repurposed to evaluate open-domain text-to-SQL systems¹ ². Yet our analysis reveals that 27% and 26% of their queries respectively contain data-privileged information¹⁴ ¹¹. While these rates are more moderate than some of the complex analysis benchmarks, they represent hundreds of queries that provide unrealistic shortcuts in open-domain settings. Researchers adapting these datasets should be aware that a substantial fraction of queries assume knowledge users wouldn’t have when querying unknown databases. Even among datasets explicitly intended for open-domain evaluation, design choices matter: OpenWikiTables, despite its open-domain goals, shows 32% data-privilege¹⁹, likely because their methodology for decontextualization with language models insufficiently accounts for the leakage of privileged information.

Beyond Data-Privilege - Aligning Queries with Users’ Mental Model

Data-privileged queries are actually just one manifestation the broader challenge of aligning queries with users’ mental models when interacting with open-domain systems¹⁰. While queries should be data-independent, the queries only make sense if they specify a data scope, that is, when they provide information on what sort of data a query is targeting ²². In a closed-domain setting, the data context itself provides contextual scaffolding that grounds the data boundaries of a query.

Consider this query from Spider: “What clubs are there?”¹⁴ No structural references, no value references, no container references—yet it’s nonsensical in open-domain. What clubs? Where? A real user would specify the data scope, for instance to “What soccer clubs are there in Manchester?”. The query only works because Spider provides a closed database where “clubs” has unambiguous meaning.

Closed-domain settings provide enormous implicit information. The database context itself bounds the scope and resolves ambiguities. Open-domain systems eliminate this scaffolding. Users must either explicitly specify data scope or rely on systems to make reasonable inferences, a fundamentally different kind of completeness than closed-domain queries require. For a detailed framework around this division of labor between users and systems in grounding queries, see our full paper.

Looking Forward: Toward Realistic Open-Domain Evaluation

The prevalence of data-privilege across popular benchmarks demands that we adapt existing datasets and fundamentally rethinking how we construct new ones, when we want to evaluate open-domain systems.

If we want to use existing datasets, we should adapt them thoroughly to the open-domain setting. Problematic queries should be adapted by removing privileged information while ensuring they retain sufficient data scope specification to enable realistic retrieval. This requires careful manual review to maintain the query’s intent while aligning it with users’ natural mental models. For datasets where such adaptation isn’t feasible, researchers should at minimum document data-privilege rates and consider whether the benchmark appropriately tests their evaluation objectives.

An interesting direction for constructing new datasets is to shift the paradigm from data-first to query-first methodologies. Current practices—starting with tables and deriving queries from them—has a high likelihood of leaking structural knowledge into the queries themselves. Instead, we should investigate collection approaches that mirror realistic user workflows, gathering information needs from domain experts or users who haven’t seen the underlying data, then identifying relevant tables to satisfy those needs. This inverts the construction process to match how open-domain systems actually operate. Such methodologies are more resource-intensive but essential for benchmarks that are truly aligned with realistic use, truly testing open-domain capabilities.

Most critically, the research community must develop awareness of what our benchmarks actually measure. When adapting closed-domain datasets for open-domain evaluation, we cannot simply assume the queries transfer appropriately. When claiming to evaluate table retrieval or open-domain analysis capabilities, we must verify that our test queries don’t provide unrealistic shortcuts that bypass the very capabilities we’re trying to assess. The queries we use fundamentally shape what capabilities systems develop. Aligning systems with many of our current benchmarks may be teaching them to exploit privileged information rather than to genuinely understand user information needs and discover relevant data.

If you found this blog post interesting have a look at our full paper: Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data

Cite as:

@inproceedings{gommAreWeAsking2025,
  title = {Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis},
  shorttitle = {Are We Asking the Right Questions?},
  booktitle = {AI for Tabular Data Workshop at EurIPS 2025},
  author = {Gomm, Daniel and Wolff, Cornelius and Hulsebos, Madelon},
  year = 2025,
  url = {https://arxiv.org/abs/2511.04584}
}

References

X. Zhang, D. Wang, L. Dou, Q. Zhu, and W. Che, “MURRE: Multi-Hop Table Retrieval with Removal for Open-Domain Text-to-SQL,” in Proceedings of the 31st International Conference on Computational Linguistics, O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, and S. Schockaert, Eds., Abu Dhabi, UAE: Association for Computational Linguistics, Jan. 2025, pp. 5789–5806. ↩ ↩²
M. Kothyari, D. Dhingra, S. Sarawagi, and S. Chakrabarti, “CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds., Singapore: Association for Computational Linguistics, Dec. 2023, pp. 14054–14066. doi: 10.18653/v1/2023.emnlp-main.868. ↩ ↩²
J. Herzig, T. Müller, S. Krichene, and J. Eisenschlos, “Open Domain Question Answering over Tables via Dense Retrieval,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, K. Toutanova et al., Eds., Online: Association for Computational Linguistics, Jun. 2021, pp. 512–519. doi: 10.18653/v1/2021.naacl-main.43. ↩
L. Nan et al., “FeTaQA: Free-form Table Question Answering,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 35–49, 2022, doi: 10.1162/tacl_a_00446. ↩ ↩² ↩³
W. Chen, M.-W. Chang, E. Schlinger, W. Y. Wang, and W. W. Cohen, “Open Question Answering over Tables and Text,” presented at the International Conference on Learning Representations, Oct. 2020. ↩ ↩² ↩³
J. Wang and G. Li, “AOP: Automated and Interactive LLM Pipeline Orchestration for Answering Complex Queries,” presented at the Conference on Innovative Database Research, 2025. ↩
K. Kong et al., “OpenTab: Advancing Large Language Models as Open-domain Table Reasoners,” presented at the The Twelfth International Conference on Learning Representations, Oct. 2023. ↩
Z. Cheng et al., “HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds., Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 1094–1110. doi: 10.18653/v1/2022.acl-long.78. ↩ ↩² ↩³
X. Hu et al., “InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks,” in Proceedings of the 41st International Conference on Machine Learning, PMLR, Jul. 2024, pp. 19544–19572. ↩ ↩² ↩³
D. A. Norman, “Some Observations on Mental Models,” in Mental Models, Psychology Press, 1983. ↩ ↩²
J. Li et al., “Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs,” in Advances in Neural Information Processing Systems, 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵
X. Wu et al., “TableBench: A Comprehensive and Complex Benchmark for Table Question Answering,” Mar. 18, 2025, arXiv: arXiv:2408.09174. doi: 10.48550/arXiv.2408.09174. ↩ ↩²
J. Wu, L. Yang, D. Li, Y. Ji, M. Okumura, and Y. Zhang, “MMQA: Evaluating LLMs with Multi-Table Multi-Hop Complex Questions,” presented at the The Thirteenth International Conference on Learning Representations, Oct. 2024. ↩ ↩²
T. Yu et al., “Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task,” Feb. 02, 2019, arXiv: arXiv:1809.08887. doi: 10.48550/arXiv.1809.08887. ↩ ↩² ↩³ ↩⁴ ↩⁵
Y. Huang et al., “DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Oct. 2024. doi: 10.18653/v1/2024.emnlp-main.748. ↩ ↩² ↩³
P. Pasupat and P. Liang, “Compositional Semantic Parsing on Semi-Structured Tables,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong and M. Strube, Eds., Beijing, China: Association for Computational Linguistics, Jul. 2015, pp. 1470–1480. doi: 10.3115/v1/P15-1142. ↩
P. Lu et al., “Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning,” presented at the The Eleventh International Conference on Learning Representations, Sep. 2022. ↩
Z. Zhang, X. Li, Y. Gao, and J.-G. Lou, “CRT-QA: A Dataset of Complex Reasoning Question Answering over Tabular Data,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds., Singapore: Association for Computational Linguistics, Dec. 2023, pp. 2131–2153. doi: 10.18653/v1/2023.emnlp-main.132. ↩
S. Kweon, Y. Kwon, S. Cho, Y. Jo, and E. Choi, “Open-WikiTable : Dataset for Open Domain Question Answering with Complex Reasoning over Table,” in Findings of the Association for Computational Linguistics: ACL 2023, A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds., Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 8285–8297. doi: 10.18653/v1/2023.findings-acl.526. ↩ ↩²
Y. Zhao, Y. Li, C. Li, and R. Zhang, “MultiHiertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds., Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 6588–6600. doi: 10.18653/v1/2022.acl-long.454. ↩
E. Lai et al., “KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes,” Jun. 06, 2025, arXiv: arXiv:2506.06541. doi: 10.48550/arXiv.2506.06541. ↩
D. Gomm and M. Hulsebos, “Metadata Matters in Dense Table Retrieval,” in ELLIS workshop on Representation Learning and Generative Models for Structured Data, Feb. 2025. ↩