Itβs rare to talk about heuristic systems when thereβs so much hype around probabilistic ones. This post is boring, but stay with me.
My work at Atlan has touched SQL parsing since the beginning. I made early contributions to the query engine with policy-based authorization powering Insights, and we generate SQL lineage by parsing queries to power column-level lineage. Along the way, my colleagues and I have evaluated a lot of SQL parsers, both open source and commercial: SQLGlot, sqlparser-rs, sqloxide, Apache Calcite, Gudusoft GSP, JSqlParser, and others. Each with different tradeoffs.
This post is my attempt to distill what Iβve learned. Iβm not an expert, just someone curious enough to ask: why β¦
Itβs rare to talk about heuristic systems when thereβs so much hype around probabilistic ones. This post is boring, but stay with me.
My work at Atlan has touched SQL parsing since the beginning. I made early contributions to the query engine with policy-based authorization powering Insights, and we generate SQL lineage by parsing queries to power column-level lineage. Along the way, my colleagues and I have evaluated a lot of SQL parsers, both open source and commercial: SQLGlot, sqlparser-rs, sqloxide, Apache Calcite, Gudusoft GSP, JSqlParser, and others. Each with different tradeoffs.
This post is my attempt to distill what Iβve learned. Iβm not an expert, just someone curious enough to ask: why do so many SQL parsers exist? And what are they actually doing under the hood?
The idea for this writeup came during a drive back from SFO after dropping a friend off at the airport. That turned into about 10 hours of conversation with Claude to pull it all together.
Table of Contents
- What is a SQL Parser?
- The Full Pipeline
- Lexical Analysis (The Lexer)
- Syntactic Analysis (The Parser)
- The Abstract Syntax Tree
- Syntax vs Semantics
- Column-Level Lineage
- SQL Dialects
- Parsers vs Query Engines
- Comparing SQL Parsers
- Further Reading
What is a SQL Parser?
A SQL parser reads SQL text and converts it into a structured representation, usually a tree, that computers can work with. Itβs the βunderstandingβ step.
INPUT: "CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'"
β
[SQL PARSER]
β
OUTPUT: A tree structure representing the query's meaning
Think of it like how your brain parses a sentence to extract meaning. The parser does the same for SQL.
The Full Pipeline
Every SQL parser follows the same fundamental pipeline. This isnβt a design choice. Itβs a consequence of how language processing works.
Lexer: Breaks the SQL string into tokens. Keywords, identifiers, operators, literals. Like recognizing words in a sentence.
Parser: Takes tokens and builds a tree based on grammar rules. Like understanding βsubject-verb-objectβ structure.
AST: The Abstract Syntax Tree. A clean, navigable representation of the queryβs structure.
Semantic Analysis: Adds meaning. Does this table exist? What type is this column? This is where you need schema information.
Lexical Analysis (The Lexer)
The lexer converts a stream of characters into meaningful chunks called tokens.
Input:
CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'
Output:
Token(CREATE)
Token(TABLE)
Token(IDENTIFIER, "active_users")
Token(AS)
Token(SELECT)
Token(IDENTIFIER, "id")
Token(COMMA)
Token(IDENTIFIER, "name")
Token(COMMA)
Token(IDENTIFIER, "email")
Token(FROM)
Token(IDENTIFIER, "users")
Token(WHERE)
Token(IDENTIFIER, "status")
Token(EQUALS)
Token(STRING, "active")
The lexer handles dialect-specific decisions early:
| Decision | Standard SQL | MySQL | SQL Server | PostgreSQL |
|---|---|---|---|---|
| Identifier quote | "name" | \name`` | [name] | "name" |
| String quote | 'text' | 'text' or "text" | 'text' | 'text' |
| Line comment | -- | -- or # | -- | -- |
| Case sensitivity | Insensitive | Insensitive | Insensitive | Insensitive (keywords) |
Lexers are simple. They look at one character (or a few) at a time, donβt need to understand nesting or structure, and can be implemented with state machines or regex. This is why SQLGlotβs optional Rust tokenizer gives ~30% speedup1. Tokenization is pure CPU-bound character scanning.
Syntactic Analysis (The Parser)
The parser reads tokens and builds a tree structure based on grammar rules.
Input tokens:
[CREATE, TABLE, active_users, AS, SELECT, id, COMMA, name, COMMA, email, FROM, users, WHERE, status, =, 'active']
Grammar rule:
ctas_stmt := CREATE TABLE table_name AS select_stmt
select_stmt := SELECT column_list FROM table_name [WHERE condition]
Parser thinks:
- βI see CREATE TABLEβ¦ this must be a ctas_stmtβ
- βNext I need a table_nameβ¦ I see βactive_usersββ
- βNext I need ASβ¦ got itβ
- βNow I need a select_stmtβ¦β
- βI see SELECTβ¦ parsing the inner queryβ
- βColumn list: id, name, emailβ
- βFROM usersβ
- βWHERE status = βactiveββ
The parser is essentially a state machine following grammar rules, consuming tokens and building tree nodes.
Why Parsing is Harder Than Lexing
Lexing is pattern matching. Parsing is about structure.
SELECT * FROM (SELECT * FROM (SELECT * FROM t))
A lexer sees: (, (, (, ), ), )
A parser must match each ( with its corresponding ).
This is why regex canβt parse SQL. Regex canβt count balanced parentheses. Itβs mathematically proven (regular languages vs context-free languages).
The Abstract Syntax Tree
The AST represents the queryβs structure in a clean, navigable tree.
CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'
CreateTableAsSelect
βββ table_name: "active_users"
βββ query:
βββ SelectStatement
βββ columns:
β βββ Column { name: "id" }
β βββ Column { name: "name" }
β βββ Column { name: "email" }
βββ from:
β βββ Table { name: "users" }
βββ where:
βββ BinaryOp
βββ left: Column { name: "status" }
βββ op: Equals
βββ right: Literal { value: "active" }
The AST is the central data structure. Everything downstream (analysis, transformation, code generation) operates on it.
What You Can Do With an AST
| Operation | Description |
|---|---|
| Traversal | Walk the tree, collect information |
| Transform | Rewrite nodes (add filters, change structure) |
| Generate | Convert back to SQL string (round-trip) |
| Transpile | Generate SQL for a different dialect |
| Lineage | Trace where data comes from |
Syntax vs Semantics
This is where things get interesting.
Syntactic analysis (parsing): Is this valid SQL grammar?
Semantic analysis: Does this SQL make sense given the database schema?
CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'
| Syntactic (Parser answers) | Semantic (Analyzer answers) |
|---|---|
| β Valid CTAS structure | Does βusersβ table exist? |
| β Valid SELECT clause | Does βusersβ have βidβ, βnameβ, βemailβ columns? |
| β Valid WHERE condition | Is βstatusβ a valid column? |
| β Correct keyword order | Is comparing status to string valid? |
Key insight: A parser with no schema information can only do syntactic analysis. Semantic analysis requires external knowledge about the database.
What Syntactic Analysis Catches
SELECT FROM users -- Missing column list
FROM users SELECT * -- Wrong keyword order
SELECT (a + b -- Unbalanced parentheses
What Syntactic Analysis Cannot Catch
SELECT * FROM nonexistent -- Parser doesn't know if table exists
SELECT foo FROM users -- Parser doesn't know columns
WHERE name > 5 -- Parser doesn't know types
Column-Level Lineage
Lineage traces where data comes from and where it goes.
CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'
Table-level lineage (easy, no schema needed):
READ: [users]
WRITE: [active_users]
Column-level lineage (needs schema):
active_users.id β users.id (direct)
active_users.name β users.name (direct)
active_users.email β users.email (direct)
Data Flow vs Control Flow
This is where lineage gets nuanced.
CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'
Question: Does status contribute to the lineage of active_users?
| Perspective | status in lineage? | Reasoning |
|---|---|---|
| Data Flow | β No | status doesnβt appear in output columns |
| Control Flow | β Yes | status affects which rows are included |
Most βlineageβ discussions mean data flow. But impact analysis needs both. Changing status column could break this query even though itβs not in the output.
SQL Dialects
SQL is a βstandardβ that nobody fully implements.
The Dialect Landscape
Every database vendor implements a subset of the SQL standard, adds proprietary extensions, and has different syntax for the same operations.
βββββββββββββββββββ
β SQL Standard β
β (SQL-92, 99, β
β 2003, 2016) β
ββββββββββ¬βββββββββ
β
ββββββββββββ¬ββββββββββ¬ββββ΄ββββ¬ββββββββββ¬βββββββββββ
βΌ βΌ βΌ βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βPostgreSQLββ MySQL ββOracleββSQL ββSnowflake ββ BigQuery β
β ββ ββ ββServerββ ββ β
β +ARRAY ββ +LIMIT ββ+ROWNUMβ +TOP ββ +FLATTEN ββ +STRUCT β
β +JSONB ββ +BACKTICKββ+DUAL ββ +[] ββ +VARIANT ββ +UNNEST β
β +::cast ββ +AUTO_INCββ+PLSQLββ+T-SQLββ +$$ ββ +SAFE_ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Each dialect: ~80% common SQL + ~20% proprietary extensions
This fragmentation is why so many SQL parsers exist. A parser built for PostgreSQL wonβt understand MySQLβs backtick identifiers. A parser built for standard SQL wonβt handle Snowflakeβs FLATTEN function.
Identifier Quoting
| Dialect | Quote Style | Example |
|---|---|---|
| Standard SQL | "double quotes" | SELECT "Column" FROM "Table" |
| MySQL | \backticks`` | SELECT \Column` FROM `Table`` |
| SQL Server | [brackets] | SELECT [Column] FROM [Table] |
| BigQuery | \backticks`` | SELECT \Column` FROM `Table`` |
-- MySQL, PostgreSQL, SQLite
SELECT * FROM users LIMIT 10 OFFSET 5
-- SQL Server
SELECT TOP 10 * FROM users
-- Or (SQL Server 2012+)
SELECT * FROM users ORDER BY id OFFSET 5 ROWS FETCH NEXT 10 ROWS ONLY
-- Oracle (traditional)
SELECT * FROM users WHERE ROWNUM <= 10
-- Oracle 12c+
SELECT * FROM users FETCH FIRST 10 ROWS ONLY
Type Casting
-- Standard SQL
CAST(x AS INTEGER)
-- PostgreSQL
x::INTEGER
-- MySQL
CAST(x AS SIGNED) -- No INTEGER, use SIGNED/UNSIGNED
-- BigQuery
CAST(x AS INT64)
SAFE_CAST(x AS INT64) -- Returns NULL instead of error
Function Name Differences
The same operation, different names across dialects:
| Operation | PostgreSQL | MySQL | SQL Server | Snowflake |
|---|---|---|---|---|
| Current timestamp | NOW() | NOW() | GETDATE() | CURRENT_TIMESTAMP() |
| String length | LENGTH() | LENGTH() | LEN() | LENGTH() |
| If null | COALESCE() | IFNULL() | ISNULL() | NVL() |
| Date add | + INTERVAL '1 day' | DATE_ADD() | DATEADD() | DATEADD() |
How Parsers Handle Dialects
Dialect flags (sqlparser-rs): Single parser with ~50 boolean flags like supports_filter_during_aggregation, supports_group_by_expr. Simple but canβt handle major syntax differences.
Parameterized grammar (SQLGlot): Base parser with overridable methods per dialect. Each dialect class inherits and overrides specific parsing methods. Flexible but has complex inheritance.
Separate grammars (Gudusoft): One complete grammar file per database. Complete accuracy but high maintenance burden. When a database releases a new version, you update that grammar file.
Parsers vs Query Engines
This distinction trips people up. A parser and a query engine are not the same thing.
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
β PARSER β β QUERY ENGINE β
β β β β
β β’ Lexical analysis β β β’ Query planning β
β β’ Syntactic analysis β β β’ Query optimization β
β β’ AST construction β β β’ Physical execution β
β β’ (Optional) Semantic β β β’ Data access β
β analysis β β β’ Join algorithms β
β β β β’ Aggregation β
β INPUT: SQL string β β β’ Sorting β
β OUTPUT: AST or errors β β β’ Result materialization β
β β β β
β NO data access β β READS/WRITES data β
β NO execution β β EXECUTES query β
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
Examples: Examples:
β’ SQLGlot β’ PostgreSQL
β’ sqlparser-rs β’ DuckDB
β’ JSqlParser β’ Apache Spark
β’ Presto/Trino
The Full Query Processing Pipeline
When you run a query in a database, it goes through many stages. Parsing is just the first.
βββββββββββββββββββ
β SQL String β
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β PARSER β βββ SQLGlot, sqlparser-rs stop here
ββββββββββ¬βββββββββ
β AST
βΌ
βββββββββββββββββββ
β ANALYZER β βββ Semantic analysis (name resolution, types)
ββββββββββ¬βββββββββ
β Analyzed AST
βΌ
βββββββββββββββββββ
β PLANNER β βββ Logical plan (relational algebra)
ββββββββββ¬βββββββββ
β Logical Plan
βΌ
βββββββββββββββββββ
β OPTIMIZER β βββ Rule-based and cost-based optimization
ββββββββββ¬βββββββββ
β Optimized Plan
βΌ
βββββββββββββββββββ
β EXECUTOR β βββ Physical operators, data access
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β RESULTS β
βββββββββββββββββββ
What Parsers Do
| Capability | Parser Does |
|---|---|
| Syntax validation | β
Detect SELECT FROM (missing columns) |
| AST construction | β Build tree structure |
| Transpilation | β Convert MySQL β PostgreSQL |
| Lineage extraction | β Find table/column dependencies |
| Query formatting | β Pretty-print SQL |
What Parsers Donβt Do
| Capability | Parser | Engine |
|---|---|---|
| Execute query | β | β |
| Return results | β | β |
| Optimize execution | β | β |
| Choose join order | β | β |
| Manage transactions | β | β |
Why This Matters
If youβre building a data catalog and need lineage, you need a parser. You donβt need a query engine.
If youβre building an IDE with autocomplete, you need a parser. You donβt need to execute anything.
If youβre building a transpiler to migrate queries from Snowflake to Databricks, you need a parser with good dialect support. You donβt need to run those queries.
The tools are different because the problems are different.
At a Glance
| Parser | Language | License | Dialects | One-liner |
|---|---|---|---|---|
| SQLGlot | Python | MIT | 31 | Most feature-complete open-source |
| sqlparser-rs | Rust | Apache 2.0 | ~15 | Fast, minimal, foundation for Rust engines |
| Apache Calcite | Java | Apache 2.0 | ~10 | Full query planning framework |
| Gudusoft GSP | Java/C# | Commercial | 25+ | Enterprise, stored procedure support |
| JSqlParser | Java | Apache/LGPL | ~6 | Mature, simple Java parser |
Layer Support
| Parser | Lexer | Parser | AST | Semantic | Lineage |
|---|---|---|---|---|---|
| SQLGlot | β | β | β | β | β |
| sqlparser-rs | β | β | β | β | β |
| Calcite | β | β | β | β | β οΈ |
| Gudusoft GSP | β | β | β | β | β |
| JSqlParser | β | β | β | β | β |
Features
| Parser | Transpile | Format | Lineage | Schema | Round-trip |
|---|---|---|---|---|---|
| SQLGlot | β | β | β | β | β |
| sqlparser-rs | β | β οΈ | β | β | β οΈ |
| Calcite | β οΈ | β | β οΈ | β | β |
| Gudusoft GSP | β οΈ | β | β | β | β |
| JSqlParser | β | β οΈ | β | β | β |
Quick Decision Matrix
| Your Situation | Recommended |
|---|---|
| Python, need transpilation | SQLGlot |
| Python, need lineage | SQLGlot |
| Rust query engine | sqlparser-rs |
| Browser-based SQL tool | sqlparser-rs (WASM) |
| Java, basic parsing | JSqlParser |
| Java, query planning | Apache Calcite |
| Enterprise, stored procs | Gudusoft GSP |
Further Reading
Still Interested? Thereβs more to read. Hopefully I have piqued your interest. For any comments or feedback, please reach out to and Iβll address them.
Parsing Algorithms
- Recursive Descent β Top-down, hand-written parsers. Most SQL parsers use this.
- Pratt Parsing β Elegant handling of operator precedence.
- LR Parsing β Bottom-up, table-driven. Used by parser generators like Bison.
- Parser Combinators β Functional composition of parsers.
Parser Libraries
- SQLGlot β Python, 31 dialects, transpilation, lineage
- sqlparser-rs β Rust, fast, WASM support
- Apache Calcite β Java, full query planning
- JSqlParser β Java, mature, simple
- Gudusoft GSP β Commercial, enterprise features