publications
publications by categories in reversed chronological order.
2025
- SQaLe: A large text-to-SQL corpus grounded in real schemasCornelius Wolff, Daniel Gomm, and Madelon HulsebosIn AI for Tabular Data Workshop at EurIPS 2025, 2025
Advances in large language models have accelerated progress in text-to-SQL, methods for converting natural language queries into valid SQL queries. A key bottleneck for developing generalizable text-to-SQL models is the lack of large-scale datasets with sufficient schema and query complexity, domain coverage, and task diversity. We introduce SQaLe, a large-scale semi-synthetic text-to-SQL dataset built on 135,875 relational database schemas expanded from a collection of real-world schemas, SchemaPile. We establish a principled generation pipeline which combines schema sampling, question synthesis, and SQL construction, and produce 517,676 high-quality (question, schema, query) triples. The SQaLe dataset captures realistic schema size variability, diverse query patterns, and natural language ambiguity while maintaining execution validity. We provide an analysis of its contents and characteristics, and find that SQaLe introduces the most realistic large-scale text-to-SQL dataset to date in comparison with existing benchmarks and datasets. We discuss how SQaLe enables our vision for data scaling and model generalization in text-to-SQL research. The dataset is accessible at: https://huggingface.co/datasets/trl-lab/SQaLe-text-to-SQL-dataset.
@inproceedings{wolffSqale2025, title = {SQaLe: A large text-to-SQL corpus grounded in real schemas}, booktitle = {AI for Tabular Data Workshop at EurIPS 2025}, author = {Wolff, Cornelius and Gomm, Daniel and Hulsebos, Madelon}, year = {2025}, urldate = {2025-11-24}, langid = {english} } - Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data AnalysisDaniel Gomm, Cornelius Wolff, and Madelon HulsebosIn AI for Tabular Data Workshop at EurIPS 2025, 2025
Natural language interfaces to tabular data must handle inherent query ambiguity. Instead of treating ambiguity as a deficiency, we reframe it as a feature of *cooperative interaction*, where the responsibility of query specification is shared among the user and the system. We develop a principled framework distinguishing cooperative queries, i.e., queries that yield a resolvable interpretation, from uncooperative queries that cannot be resolved. Applying the framework to evaluations for tabular question answering and analysis, we analyze the queries in 15 popular datasets, and observe an uncontrolled mixing of query types that conflates the evaluation of system’s execution accuracy with interpretation capabilities. Our framework and analysis of queries shifts the perspective from fixing query ambiguity to embracing *cooperative grounding*. This reflection deepens our understanding of how we can elevate natural language interfaces for tabular data by more informed design and evaluation, for which we outline implications and directions for future research.
@inproceedings{gommAreWeAsking2025, title = {Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis}, booktitle = {AI for Tabular Data Workshop at EurIPS 2025}, author = {Gomm, Daniel and Wolff, Cornelius and Hulsebos, Madelon}, year = {2025}, urldate = {2025-10-30}, langid = {english} } - Unlocking the Full Potential of Data Science Requires Tabular Foundation Models, Agents, and HumansTianji Cong, Julian Martin Eisenschlos, Daniel Gomm, and 21 more authors2025
Despite its vast potential, data science remains constrained by manual workflows and fragmented tools. Meanwhile, foundation models have transformed natural language and computer vision — and are beginning to bring similar breakthroughs to structured data, particularly the ubiquitous tabular data central to data science. At the same time, there are strong claims that fully autonomous agentic data science systems will emerge. We argue that, rather than replacing data scientists, the future of data science lies in a new paradigm that amplifies their impact: collaborative systems that tightly integrate agents and tabular foundation models (TFMs) with human experts. In this paper, we discuss the potential and challenges of navigating the interplay between these three and present a research agenda to guide this disruption toward a more accessible, robust, and human-centered data science.
@article{congUnlockingFullPotential2025, title = {Unlocking the Full Potential of Data Science Requires Tabular Foundation Models, Agents, and Humans}, author = {Cong, Tianji and Eisenschlos, Julian Martin and Gomm, Daniel and Grinsztajn, Leo and Mueller, Andreas C. and Sanghi, Anupam and Bodensohn, Jan-Micha and Borisov, Vadim and Cochez, Michael and Eggensperger, Katharina and Geerts, Floris and Kim, Myung Jun and Kipf, Andreas and Li, Xue and Ovcharenko, Olga and Papotti, Paolo and Purucker, Lennart and Schelter, Sebastian and Trummer, Immanuel and Varoquaux, Ga{\"e}l and Vogel, Liane and Binnig, Carsten and Hulsebos, Madelon and Hutter, Frank}, year = {2025}, } - CoDy: Counterfactual Explainers for Dynamic GraphsZhan Qu, Daniel Gomm, and Michael FaerberIn Forty-second International Conference on Machine Learning (ICML), 2025
Temporal Graph Neural Networks (TGNNs) are widely used to model dynamic systems where relationships and features evolve over time. Although TGNNs demonstrate strong predictive capabilities in these domains, their complex architectures pose significant challenges for explainability. Counterfactual explanation methods provide a promising solution by illustrating how modifications to input graphs can influence model predictions. To address this challenge, we present CoDy—Counterfactual Explainer for Dynamic Graphs—a model-agnostic, instance-level explanation approach that identifies counterfactual subgraphs to interpret TGNN predictions. CoDy employs a search algorithm that combines Monte Carlo Tree Search with heuristic selection policies, efficiently exploring a vast search space of potential explanatory subgraphs by leveraging spatial, temporal, and local event impact information. Extensive experiments against state-of-the-art factual and counterfactual baselines demonstrate CoDy’s effectiveness, with improvements of 16% in AUFSC+ over the strongest baseline. Our code is available at: https://github.com/daniel-gomm/CoDy
@inproceedings{qu2025cody, title = {CoDy: Counterfactual Explainers for Dynamic Graphs}, author = {Qu, Zhan and Gomm, Daniel and Faerber, Michael}, booktitle = {Forty-second International Conference on Machine Learning (ICML)}, year = {2025}, url = {https://openreview.net/forum?id=FE9QN8d536}, } - Metadata Matters in Dense Table RetrievalDaniel Gomm, and Madelon HulsebosIn ELLIS workshop on Representation Learning and Generative Models for Structured Data, 2025
Recent advances in Large Language Models have enabled powerful systems that perform tasks by reasoning over tabular data. While these systems typically assume relevant data is provided with a query, real-world use cases are mostly open-domain, meaning they receive a query without context regarding the underlying tables. Retrieving relevant tables is typically done over dense embeddings of serialized tables. Yet, there is a limited understanding of the effectiveness of different inputs and serialization methods for using such off-the-shelf text-embedding models for table retrieval. In this work, we show that different serialization strategies result in significant variations in retrieval performance. Additionally, we surface shortcomings in commonly used benchmarks applied in open-domain settings, motivating further study and refinement.
@inproceedings{gomm2025metadata, title = {Metadata Matters in Dense Table Retrieval}, author = {Gomm, Daniel and Hulsebos, Madelon}, booktitle = {ELLIS workshop on Representation Learning and Generative Models for Structured Data}, year = {2025}, url = {https://openreview.net/forum?id=rELWIvq2Qy}, }