Text2SQL Real-World Test — When LLM Writes SQL Directly

2026-03-04

Treeru

Ask an LLM "How much does an americano cost at the cafe?" and it confidently answers $4.50 — a complete hallucination. RAG retrieves evidence from documents, but when your domain spans 13 tables with 386 rows of structured data, direct database queries are far more accurate. That's why we adopted Text2SQL: an LLM router classifies query intent, Text2SQL generates SQL against PostgreSQL, and a second LLM call produces the natural language response. Across 40 real-world test questions, accuracy jumped from 10.8% to 70.1%.

10.8%→70.1%

Fact Accuracy Gain

100%

SQL Syntax Validity

90%

Router Accuracy

1,926ms

Avg Response Time

The Limits of Synonym Dictionaries

In our previous hybrid search system, we used synonym dictionaries to map user queries to database schemas — "price" → price, "menu" → menu_items. This worked fine with 3–4 tables. But when we scaled to 13 tables, the approach collapsed.

Why Synonym Dictionaries Break Down

✗

Synonym explosion — 13 tables × 5 columns avg = 65 mappings. Words like 'cost', 'fee', 'price', 'rate' each point to different columns in different tables

✗

Multi-table queries — 'Average treatment cost by department' requires JOINing 3 tables (departments + medical_records + billing). Keywords alone can't determine this

✗

Aggregation diversity — 'Most expensive', 'average', 'total', 'greater than' — the same MAX operation has dozens of natural language expressions

✗

Maintenance nightmare — Every new table requires manually adding synonyms. Missing an entry means that query category fails completely

Testing the same 40 questions with the synonym dictionary approach yielded just 41.2% accuracy. Simple lookups worked, but anything involving JOINs or aggregations failed rapidly. As the table count grows, synonym management becomes unrealistic — so we switched to Text2SQL, where the LLM reads the schema directly and writes SQL.

LLM Router + Text2SQL Architecture

The pipeline has four stages. It calls the LLM twice, but SQL execution and RAG retrieval take just milliseconds, keeping total response time under 2 seconds.

Processing Flow

🧭

LLM Router~900ms

Natural language query → Intent classification (DB query / general chat / RAG)

📝

Text2SQL GenerationIncluded in routing

Table schema + question → SQL query generation (LLM)

⚡

SQL Execution~7ms

Run query on PostgreSQL → Return result rows

💬

Answer Generation~1,000ms

SQL results + original question → Natural language response (LLM)

The LLM router is the critical piece. It decides whether a question needs a database query, a general conversation response, or RAG retrieval. If the router misclassifies, no amount of SQL quality can save the answer. Router accuracy sets the ceiling for the entire pipeline.

Schema prompt design: We inject the DDL (CREATE TABLE statements) for all 13 tables directly into the LLM prompt — table names, column names, data types, and FK relationships. This adds about 2,000 tokens. The LLM reads this schema to determine which tables to JOIN and which columns to filter. The synonym dictionary required humans to define mappings; Text2SQL lets the LLM reason over the schema.

40-Question Real-World Test

The test domain is a general hospital with patients, doctors, departments, medical records, prescriptions, billing, rooms, and appointments — 13 tables, 386 rows of data. Questions were designed across 4 difficulty tiers.

Tier	Difficulty	Example Question	Questions	Accuracy
Tier 1	Simple Lookup	List all doctors in internal medicine	12	75%
Tier 2	JOIN	Show Kim's prescription history	10	88%
Tier 3	Aggregation	Average treatment cost by department?	10	64%
Tier 4	Complex	Most prescribed drug in March?	8	52%
Overall Average			40	70.1%

The surprising finding: Tier 2 (JOIN) scored higher than Tier 1 (simple lookup) at 88%. JOIN queries like "Kim's prescriptions" have explicit table relationships, so the LLM identifies foreign keys accurately. Simple lookups, however, sometimes require inferring status values like status = 'admitted' from context — which is harder than following a clear FK path.

SQL Generation Quality Metrics

100%

SQL Generation Rate

All 40 questions produced SQL queries

100%

Syntax Validity

Every generated query was syntactically valid

95%

Execution Success

38/40 queries ran without errors

70.1%

Answer Accuracy

Results matched the correct answer

The 100% SQL generation and syntax validity rates come from injecting actual DDL into the prompt. The LLM sees the real table structure, so it rarely references non-existent columns. The accuracy gap (100% → 70.1%) represents cases where the SQL was syntactically correct but returned semantically wrong results.

Three-Way Accuracy Comparison

We ran the same 40 questions through three approaches. The contrast with LLM-only responses is stark.

LLM Only (Knowledge-Based)10.8%

No database access at all. Every number, name, and figure is hallucinated

Synonym Dictionary + DB41.2%

Keyword mapping handles simple lookups but collapses on JOINs and aggregations

Text2SQL + LLM Router70.1%

LLM reads schema and writes SQL directly. Handles complex multi-table queries

With LLM only, every patient name, treatment cost, and prescription is fabricated. The 10.8% accuracy comes from yes/no questions that happened to be correct by chance. Synonym dictionaries work for simple lookups but hit a wall as table count increases and mapping maintenance becomes unsustainable.

Why 70.1% matters: In our hybrid search benchmark, RAG achieved 92.5% fact accuracy — but only for "information that exists in documents." Text2SQL accesses all data in every table via SQL. Aggregations (SUM, AVG, MAX) and conditional filtering are impossible with RAG but natural with SQL. In production, the best approach combines RAG + Text2SQL together.

Latency Analysis

Total response time averages 1,926ms, well under our 2,500ms target. The bottleneck is clear — it's the LLM calls.

Stage	Duration	% of Total	Notes
LLM Routing + SQL Gen	~900ms	46.7%	1st LLM call
SQL Execution (PostgreSQL)	~7ms	0.4%	Indexed, 386 rows
RAG Retrieval (parallel)	~18ms	0.9%	Runs alongside DB query
Answer Generation (LLM)	~1,000ms	51.9%	2nd LLM call
Total	~1,926ms	100%	Under 2,500ms target

SQL execution (7ms) and RAG retrieval (18ms) combined take just 25ms. 98.7% of total response time is LLM inference. The only meaningful optimization path is faster LLM inference — GPU upgrades or AWQ quantization to increase tok/s will reduce response time proportionally.

LLM Router Performance

90%

Routing Accuracy

36/40 correctly classified

Misclassifications

DB queries sent to general chat

Reverse Errors

No general queries sent to DB

All 4 routing errors were in the same direction: DB queries misclassified as general conversation. Zero reverse errors means no unnecessary SQL execution ever occurred. The router learned a conservative strategy — when unsure, default to general chat rather than risk a pointless database query.

Limitations & Next Steps

While 70.1% is a major improvement over the synonym dictionary (41.2%), it's not yet production-ready on its own. Analyzing the 30% failure cases reveals clear patterns.

Date calculations33% of errors

Symptom: Fails to convert 'last 3 months' into SQL DATE_SUB function

Fix: Inject current date into prompt + add few-shot examples of date functions

Empty result handling25% of errors

Symptom: SQL is correct but returns 0 rows; LLM hallucinates instead of saying 'no data found'

Fix: Explicitly inject 'no data found' context when SQL returns 0 rows

Amount formatting17% of errors

Symptom: Unit conversion errors when formatting currency values in natural language

Fix: Pass raw numbers from SQL; handle formatting in a post-processing layer

Implicit conditions25% of errors

Symptom: 'Currently admitted' status value doesn't exist in schema, so inference fails

Fix: Add enum values and business rules as comments in the schema prompt

Every fix listed above is achievable through prompt engineering alone — no model changes or architecture rewrites needed. Adding date function examples, enum value lists, and zero-row handling rules to the schema prompt should push accuracy above 80%.

Hallucination Defense in Practice

Text2SQL's biggest advantage: if the data doesn't exist, SQL returns empty results. An LLM alone would fabricate patient names and treatment costs. Text2SQL returns 0 rows. The challenge is how the LLM interprets those empty results.

"Patient Lee's records" → SQL 0 rows → "No medical records found for this patient" ✓

"Dermatology avg cost" → SQL 0 rows → "Average cost is approximately $500" ✗ (hallucination)

Cases where the LLM hallucinates on zero-row results are addressed by adding explicit guardrails to the prompt.

Conclusion

The verdict on switching from synonym dictionaries to Text2SQL is clear. Once you exceed 5 tables, synonym maintenance becomes unsustainable. Having the LLM read the schema directly and write SQL wins on both accuracy and scalability.

Final Comparison

Metric	LLM Only	Synonym Dict	Text2SQL
Fact Accuracy	10.8%	41.2%	70.1%
Scalability (new tables)	No change	Manual synonym entry	Just add DDL to prompt
Aggregation Queries	Impossible	Very limited	SUM/AVG/MAX supported
Maintenance Cost	None	High	Low
Response Time	~1,000ms	~1,200ms	~1,926ms

Response time is longest at 1,926ms due to two LLM calls, but this scales directly with GPU performance. Given the accuracy improvement (10.8% → 70.1%) and dramatic reduction in maintenance cost, it's a worthwhile trade-off.

The AI chatbot's data retrieval pipeline now has three complete layers: RAG for unstructured document search, hybrid search for combined DB+RAG retrieval, and Text2SQL for direct structured data queries. The LLM router selects the optimal path based on query intent, maximizing each approach's strengths.

What's next: Unifying RAG + Text2SQL + hybrid search under a single router that automatically selects the optimal retrieval path for each query type. Target: router accuracy from 90% → 95%+, powered by few-shot examples and a feedback loop.

Treeru

Sharing practical insights on web development, IT infrastructure, and AI solutions. Treeru — your partner in digital transformation.

Text2SQL LLM Router Natural Language Query Database AI Chatbot PostgreSQL Qwen3 SQL Generation

Text2SQL Real-World Test — When LLM Writes SQL Directly

The Limits of Synonym Dictionaries

Why Synonym Dictionaries Break Down

LLM Router + Text2SQL Architecture

Processing Flow

40-Question Real-World Test

SQL Generation Quality Metrics

Three-Way Accuracy Comparison

Latency Analysis

LLM Router Performance

Limitations & Next Steps

Hallucination Defense in Practice

Conclusion

Final Comparison

Related Posts

DB + RAG Hybrid Search — How We Improved LLM Fact Accuracy by 5x

Building a Local RAG Pipeline — From Embedding Model Selection to Hallucination Elimination

LLM Hallucination Test — Which Local Models Fabricate Information?