Rethinking data in NL2SQL: a survey of what we have and what we expect

Abstract

Natural Language to SQL (NL2SQL) has become a cornerstone task for enabling natural language interfaces to relational databases. With the emergence of large language models, NL2SQL systems have achieved remarkable performance gains. However, despite the focus on architectural innovations and benchmark achievements, we argue that NL2SQL is fundamentally a data-centric task — where the quality, structure, and utilization of data play a more critical role than often acknowledged. In this survey, we re-examine the NL2SQL landscape through the lens of how data are used throughout the system pipeline. Specifically, we offer a brief overview of the task challenges and evolutionary process of NL2SQL. Next, we categorize the major data types and analyze how these data sources are leveraged throughout the NL2SQL lifecycle. We then introduce the datasets and metrics used to evaluate NL2SQL systems. Finally, we highlight the remaining challenges and outline promising directions for future research. We hope that this survey can serve as a quick reference to existing work and motivate future research from a data perspective.

1 Introduction

In the era of big data, a substantial portion of information is stored in relational databases, which form the backbone of data management systems across many organizations. As data volumes continue to grow, the ability to efficiently query and utilize this data has become a critical factor in maintaining a competitive edge across various industries. However, querying relational databases requires proficiency in SQL, a specialized skill that poses a barrier for non-technical users seeking to access and interact with the data.

Fig. 1

Example of NL2SQL system workflow. Given the database schema and the user’s utterance, the system outputs a corresponding SQL query to retrieve the result from the database

Natural language to SQL (i.e., NL2SQL, also known as Text-to-SQL), which converts a natural language (NL) query into a corresponding SQL query, can significantly lower the barrier for both lay users (and also expert users) in accessing massive datasets and deriving insights [1,2,3,4,5,6,7,8,9,10,11,12,13]. To illustrate, Fig. 1 presents the general workflow: An end-user who wants to query data about the flight numbers of a specific aircraft from a database containing two tables, named Flight and Aircraft. The user simply inputs the NL query “Show all flight numbers with aircraft Airbus A340-330”, the NL2SQL model then automatically generates the corresponding SQL query for the user:

After that, the SQL query can be executed against the database to get the results, namely flight numbers 7 and 13 in this example.

Researchers have made substantial progress in this area. Initially, template-based and rule-based methods [14,15,16] were employed. These approaches involved creating SQL templates for various scenarios. While template-based methods showed promise, they required significant manual effort. With the rapid advancement of deep learning, Sequence-to-sequence (Seq2seq) methods [17,18,19,20,21,22] have emerged as the mainstream approach. Seq2Seq models provide an end-to-end solution that directly maps NL input to SQL output using an encoder-decoder architecture. Among Seq2Seq methods, pre-trained language models (PLMs), which serve as predecessors to large language models (LLMs), show promise in NL2SQL tasks. As model sizes and training data continue to grow, PLMs naturally evolve into LLMs, exhibiting even greater power. The remarkable capabilities of LLMs have prompted research into their application for NL2SQL tasks [5, 6, 8, 13, 23,24,25,26,27].

However, amid this prevailing model-centric momentum, a critical factor has received comparatively less attention: data. Unlike many natural language processing (NLP) tasks, where the input and output spaces are relatively fixed, NL2SQL operates at the intersection of unstructured NL with structured database schemas. This intersection makes the quality, diversity, and contextualization of data not just important but central to the success of such systems. In fact, the performance and generalizability of NL2SQL models are often constrained more by data limitations than by model architecture itself.

Fig. 2

Evolution of NL2SQL

This survey aims to shift the perspective from model to data, rethinking NL2SQL through a data-centric lens. We argue that current bottlenecks in generalization, robustness, and real-world applicability often stem less from modeling limitations than from issues in how data are constructed, represented, and used. To this end, we systematically review recent NL2SQL techniques and present a comprehensive survey of how data is involved in the following aspects:

Overview. We provide a brief introduction to the technical challenges and evolutionary process of NL2SQL, the trajectory of NL2SQL, and a review of existing solutions, categorizing them into four major groups presented in Fig. 2.

Data with NL2SQL. We begin by categorizing the primary data types involved, then analyze how these data sources are leveraged in existing NL2SQL solutions.

Benchmarks. We describe the commonly used datasets and benchmarks for evaluating NL2SQL systems. We discuss their characteristics, complexity, and the challenges they pose for development and evaluation, along with the evaluation metrics.

Analysis. We conduct a comparative analysis of experimental results from existing studies to further investigate the practical value and applicability of methods and models.

Summarization: We summarize the limitations of existing NL2SQL solutions, with a focus on data utilization: what we have and what we expect. We outline future directions and opportunities for improvement.

To summarize our technical contributions, we present the first NL2SQL survey in the community from a data perspective, establishing a taxonomy of data types and usage patterns, highlighting their influence on model design. Then, we analyze the task challenges and provide a more detailed and comprehensive summary of existing NL2SQL solutions and benchmarks. Finally, we highlight the remaining challenges in robustness and real-world deployment and offer actionable directions for future research.

2 Overview

In this section, we first formalize the definition of the NL2SQL task (Sect. 2.1). We then discuss the key challenges in this task (Sect. 2.2). Finally, we describe the evolution of NL2SQL systems and review existing systems accordingly (Sect. 2.3).

2.1 Task formulation

NL2SQL Translation. Formally, given a user question Q expressed in NL and a database D defined by a schema S and populated with instances I, the NL2SQL task aims to translate Q into a corresponding SQL query L using a translation model M:

$$\begin{aligned} L = M(Q, D) \end{aligned}$$

(1)

The resulting query L should be executable on D to produce a result R that answers the original question q.

2.2 Task challenges

We examine the primary technical challenges involved in developing robust NL2SQL solutions for real‑world applications:

C1: Ambiguous and Under-specified NL Query. NL queries often exhibit uncertainty due to ambiguity and under-specification [28]. In the context of the NL2SQL task, these problems can be categorized as follows:

Lexical Ambiguity: Arises when a word has multiple meanings. For example, “apple” could refer to a fruit or a technology company.

Syntactic Ambiguity: Occurs when a sentence permits multiple grammatical interpretations. For instance, in “I shot an elephant in my pajamas,” it is unclear whether the speaker was wearing pajamas or the elephant was.

Under-specification: Happens when expressions lack sufficient detail to determine a specific meaning, especially in everyday conversation. For example, if someone says “Let’s meet at the station,” it’s unclear which station they mean without additional context.

C2: Database Complexity. The NL2SQL task demands a deep understanding of the database schema, including table names, columns, relationships, and data values. The complexity of modern schemas and the scale of data further compound the challenge.

Table Joins: Queries often require retrieving information spread across multiple tables. This necessitates understanding how tables are related—typically through foreign keys—and determining the correct join conditions to produce meaningful results.

Ambiguity in Attributes and Values: Ambiguous values and attributes in a database can make it difficult for systems to accurately determine the intended context.

Large Data Volumes: Efficiently handling vast data volumes in large databases is critical, as processing all data as input is impractical [29]. Additionally, dirty data, such as missing values, duplicates, or inconsistencies, can lead to erroneous query results (e.g., affecting clauses) if not properly managed.

C3: NL2SQL Translation. The NL2SQL task differs fundamentally from the compilation of a high-level programming language to a low-level machine language, as it usually has a one-to-many mapping between the input NL, DB and output SQL. Specifically, the NL2SQL task faces several unique challenges:

Unstructured NL vs. Structured SQL: NL is inherently unstructured and often ambiguous, while SQL is a formal language with strict syntactic rules. Bridging this gap requires accurate interpretation of user intent and precise generation of executable queries.

One-to-Many Mapping: A single NL query can correspond to multiple SQL queries that fulfill the query intent, leading to ambiguity in determining appropriate SQL translation.

NL2SQL Trustworthiness and Explainability: Beyond accuracy, the system must produce queries that users can trust and understand. Explainability is crucial, as it allows users to verify how the NL input was interpreted and how the SQL query was constructed [9]. Transparent and interpretable translations help build confidence in the system’s outputs and facilitate error detection and correction.

2.3 Evolutionary process

Over the years, the NL2SQL research field has made substantial progress in both the database and the NLP communities. It has transitioned from early rule-based solutions to deep learning techniques and, more recently, to the adoption of pretrained and large language models. Figure 2 provides a visual summary of this evolution:

Rule-based Methods: Early solutions relied heavily on rule-based methods [14,15,16, 30, 31], where manually crafted rules and heuristics were used to map NL questions to SQL queries. Rule-based systems excel at generating syntactically correct queries by relying on predefined templates and strict grammatical rules, ensuring adherence to SQL syntax and structure. Their deterministic nature guarantees consistent outputs for specific input patterns, minimizing variability in query generation. However, these methods struggle to interpret linguistically complex or ambiguous natural language questions, such as those involving nested clauses, coreference, or ellipsis, and often fail to map these elements to database schema components, such as tables or columns. Their reliance on manual rules limits generalization to unseen question patterns or schema structures, requiring frequent updates for new domains. Additionally, encoding complex schemas with multi-table relationships into usable rules is labor-intensive and error-prone.

Machine Learning Methods: With the rise of deep neural networks, sequence-to-sequence (Seq2Seq) models and encoder-decoder architectures were adopted to generate SQL queries from natural language input [17, 98]. SQLNet [18], for example, employs a sketch-based approach and frames the NL2SQL task as a slot-filling problem. TypeSQL [32] builds on SQLNet by incorporating type information extracted from either knowledge graphs or table content, enabling a better understanding of entities and numerical values. Bogin et al. [33] propose using graph neural networks (GNNs) to encode the database schema, thereby enriching the representation. IRNet [20] introduces an intermediate representation that captures higher-level abstractions than raw SQL and leverages custom type vectors to enhance the modeling of both natural language queries and database schemas. Despite their adaptability, these approaches are all based on the Seq2Seq generation paradigm, which struggles to effectively handle complex queries.

PLM-based Methods: PLMs have emerged as a powerful solution for NL2SQL, leveraging the vast amounts of linguistic knowledge and semantic understanding captured during the pre-training process. The early adoption of PLMs in NL2SQL primarily focused on fine-tuning off-the-shelf PLMs, such as BERT [34] and RoBERTa, on standard NL2SQL datasets. These PLMs, pre-trained on large amounts of training corpus, captured rich semantic representations and language understanding capabilities. By fine-tuning them on NL2SQL tasks, researchers aimed to leverage the semantic and linguistic understanding of PLMs to generate accurate SQL queries [19, 21, 35, 36]. Another line of research focuses on incorporating schema information into PLMs to improve their understanding of database structures and enable them to generate more executable SQL queries. Schema-aware PLMs are designed to capture the relationships and constraints present in the database structure [19]. While PLMs generate more accurate SQL than deep learning-based methods, they still struggle with complex operations such as outer joins and aggregations, often producing syntactically flawed queries. Besides that, cross-domain performance drops significantly when the schema or vocabulary differs from the training data, necessitating resource-intensive fine-tuning.

LLM-based Methods: LLMs, such as the GPT series [37, 38], have gained significant attention in recent years due to their ability to generate coherent and fluent text. Researchers have begun exploring the potential of LLMs for NL2SQL by leveraging their extensive knowledge reserves and superior generation capabilities [6, 11,12,13, 25, 26, 39, 40]. These approaches often involve prompt engineering to guide proprietary LLMs in SQL generation [26, 39] or fine-tuning open-source LLMs on NL2SQL datasets [13]. The integration of LLMs in NL2SQL is still an emerging research area with significant potential for further exploration and improvement. Researchers are investigating ways to better leverage LLMs’ knowledge and reasoning capabilities, incorporate domain-specific knowledge [8], and develop more efficient fine-tuning strategies [40]. As the field continues to evolve, we anticipate the development of more advanced and superior LLM-based implementations that will elevate the performance and generalization of NL2SQL to new heights.

3 Data with NL2SQL

Data plays a central role in the NL2SQL task, as the quality and structure of the data directly impact the system’s performance and accuracy [41,42,43]. In this section, we first introduce the five key data types involved and then examine how each type contributes to different stages of the NL2SQL pipeline.

Fig. 3

Data types involved in NL2SQL lifecycle

3.1 Data types involved in NL2SQL

Figure 3 illustrates the five key data types we identify as essential to the NL2SQL lifecycle. Different data sources can serve as both the foundation and catalyst for the NL2SQL process by offering diverse, high-quality inputs that enhance data pre-processing, enrich model training, and ultimately lead to more accurate SQL generation:

External Knowledge refers to supplementary information beyond the database itself, such as domain-specific ontologies, knowledge graphs, or common-sense reasoning resources. We further categorize this type of data into two main types:

External Knowledge (domain) refers to structured or curated resources specific to a particular field or industry, such as medical ontologies, legal taxonomies, or scientific knowledge bases, which provide contextual or semantic enrichment relevant to the database content.

External Knowledge (LLM) denotes information derived from large language models trained on vast corpora of general or specialized text, enabling inference, common-sense reasoning, or bridging knowledge gaps not explicitly covered by the database or domain-specific resources.

Incorporating external knowledge sources enables more context-aware and semantically precise SQL generation, especially in complex or open-domain settings.

Text Corpora refers to a collection of texts, which may be either raw or annotated, and serves as a fundamental resource in the NL2SQL pipeline. These corpora are commonly used to train or fine-tune foundational translation models, enhancing their ability to comprehend and generate NL queries [13, 40]. Similarly, we also divide this type of data into two main types:

Text Corpora (dataset) refers to annotated datasets curated by humans, typically consisting of NL questions paired with corresponding SQL queries.

Text Corpora (raw) refers to unannotated text sources, such as user query logs and documentation. Despite lacking explicit labels, these corpora can be leveraged during pretraining to enhance a model’s linguistic understanding and foundational capabilities.

By leveraging diverse, high-quality text corpora, NL2SQL systems can achieve more robust generalization and improved translation accuracy across domains.

Database Schema describes the structural blueprint of a database, including table names, column names, data types, and relationships such as foreign keys. It serves as a critical reference for interpreting user queries and mapping natural language elements to their corresponding SQL components. By understanding the schema, NL2SQL systems can generate syntactically valid queries that reflect the data structure and maintain referential integrity.

Database Instances consist of the actual records stored within a database—i.e., the rows of data populating each table. These instances provide semantic grounding for model training and inference by revealing how schema elements are populated in practice. Access to instance-level data allows NL2SQL systems to learn contextual patterns, resolve ambiguity in user input, and validate query accuracy through example-based reasoning.

Execution Feedback encompasses the runtime outputs of SQL queries, including both the returned results and any error messages (or the absence of errors). This feedback loop plays a vital role in iterative model improvement by enabling techniques such as reinforcement learning, error correction, and post-hoc verification. By analyzing execution outcomes, NL2SQL systems can refine their predictions, correct invalid queries, and improve alignment with user intent over time.

3.2 NL2SQL lifecycle with data

The NL2SQL lifecycle involves a series of stages that transform NL queries into executable SQL statements, including NL query understanding, schema linking, SQL query generation, and query post-refinement. Below, we discuss the lifecycle in detail, and outline how existing solutions leverage different data sources at each stage in Table 1.

NL Query Understanding. Given an NL query, the first stage is to grasp the user’s intent and identify key elements that help with the semantics understanding:

User Intent Understanding. Some existing works attempt to leverage external knowledge data to support intent recognition, as such knowledge can help compensate for missing or implicit semantics in natural language input. Depending on how external knowledge is used, it can be incorporated through query rewriting or, more commonly, via LLMs [8, 24, 29].

Entity Recognition. Entities mentioned in an NL query convey their rich semantics, such as names. In general, external knowledge data can also be used to achieve this [23].

Schema Linking. Next, the second stage involves examining the structural context—specifically, the database schema and instance data—to identify the relevant tables, columns, and cell values needed to construct the SQL query. Depending on the techniques employed, such as classifiers [7, 8, 10], GNNs [33], or prompting methods [11, 39], this process may vary significantly in complexity, accuracy, and execution efficiency. In addition to structural elements, metadata, such as column descriptions, data types, and foreign key constraints, can provide valuable semantic signals to improve schema linking performance, particularly for complex or ambiguous queries.

SQL Query Generation. This step typically treats the model as a black box and relies primarily on an underlying translation model to perform the task. It generally uses the database schema and instance data as input to provide the necessary context, and may also include a few-shot learning approach in LLM-based solutions [12, 23, 24, 27, 39, 40] for improved performance. In addition, to enhance the performance of a translation model, a common and effective approach is to train or fine-tune the underlying model using large-scale text corpora before performing inference. Such data can be sourced from websites as raw, unstructured text [13], or manually annotated by humans to provide higher-quality supervision [1, 7, 19, 40, 44].

Query Post-Refinement. The final stage is to refine the initially generated SQL query to improve its correctness and alignment with user intent. A straightforward approach involves executing the query to verify its executability and evaluating the returned results to determine whether they meet the expected outcome — an approach commonly referred to as execution feedback [46]. Additionally, external knowledge from LLMs [23, 24, 27, 39] or training a task-specific model [9, 10], can be leveraged to assist in assessing the semantic correctness of the query.

4 NL2SQL benchmarks

With advancements in NL2SQL, a variety of datasets have emerged to address the evolving challenges, as shown in Table 2. These range from domain-specific datasets with simple queries to cross-domain datasets and further to large-scale, real-world datasets, reflecting both the progress in the field and the emergence of new challenges for NL2SQL solutions.

Single-Table, Domain-specific Datasets. Early NL2SQL datasets focused on specific domains with relatively simple SQL queries, such as ATIS [31] for flight information and Geo [14] for U.S. geographical facts. Recently, larger single-domain datasets [17, 47] have been introduced, featuring more complex databases and SQL queries tailored to specific scenarios. This shift reflects an increased emphasis on assessing the performance and practical utility of NL2SQL systems in specific domains.

Multi-Tables Cross-Domain Datasets. After the development of early single-domain datasets, the NL2SQL field shifted toward cross-domain datasets to better evaluate systems’ ability to generalize across diverse SQL queries and database schemas. For instance, WikiSQL [17] was the first cross-domain dataset, drawing tables from Wikipedia spanning a wide range of topics. Spider [32], along with its various extensions [49,[50](#ref-CR50 “X. Deng, A.H. Awadallah, C. Meek, O. Po

Abstract

Abstract

Similar content being viewed by others

1 Introduction

2 Overview

2.1 Task formulation

2.2 Task challenges

Lexical Ambiguity: Arises when a word has multiple meanings. For example, “apple” could refer to a fruit or a technology company.

Syntactic Ambiguity: Occurs when a sentence permits multiple grammatical interpretations. For instance, in “I shot an elephant in my pajamas,” it is unclear whether the speaker was wearing pajamas or the elephant was.

Table Joins: Queries often require retrieving information spread across multiple tables. This necessitates understanding how tables are related—typically through foreign keys—and determining the correct join conditions to produce meaningful results.

Ambiguity in Attributes and Values: Ambiguous values and attributes in a database can make it difficult for systems to accurately determine the intended context.

Unstructured NL vs. Structured SQL: NL is inherently unstructured and often ambiguous, while SQL is a formal language with strict syntactic rules. Bridging this gap requires accurate interpretation of user intent and precise generation of executable queries.

One-to-Many Mapping: A single NL query can correspond to multiple SQL queries that fulfill the query intent, leading to ambiguity in determining appropriate SQL translation.

2.3 Evolutionary process

3 Data with NL2SQL

3.1 Data types involved in NL2SQL

External Knowledge (domain) refers to structured or curated resources specific to a particular field or industry, such as medical ontologies, legal taxonomies, or scientific knowledge bases, which provide contextual or semantic enrichment relevant to the database content.

Text Corpora (dataset) refers to annotated datasets curated by humans, typically consisting of NL questions paired with corresponding SQL queries.

3.2 NL2SQL lifecycle with data

4 NL2SQL benchmarks

Similar Posts