Ask an LLM "How much does an americano cost at the cafe?" and it confidently answers $4.50 — a complete hallucination. RAG retrieves evidence from documents, but when your domain spans 13 tables with 386 rows of structured data, direct database queries are far more accurate. That's why we adopted Text2SQL: an LLM router classifies query intent, Text2SQL generates SQL against PostgreSQL, and a second LLM call produces the natural language response. Across 40 real-world test questions, accuracy jumped from 10.8% to 70.1%.
10.8%→70.1%
Fact Accuracy Gain
100%
SQL Syntax Validity
90%
Router Accuracy
1,926ms
Avg Response Time
The Limits of Synonym Dictionaries
In our previous hybrid search system, we used synonym dictionaries to map user queries to database schemas — "price" → price, "menu" → menu_items. This worked fine with 3–4 tables. But when we scaled to 13 tables, the approach collapsed.
Why Synonym Dictionaries Break Down
Testing the same 40 questions with the synonym dictionary approach yielded just 41.2% accuracy. Simple lookups worked, but anything involving JOINs or aggregations failed rapidly. As the table count grows, synonym management becomes unrealistic — so we switched to Text2SQL, where the LLM reads the schema directly and writes SQL.
LLM Router + Text2SQL Architecture
The pipeline has four stages. It calls the LLM twice, but SQL execution and RAG retrieval take just milliseconds, keeping total response time under 2 seconds.
Processing Flow
Natural language query → Intent classification (DB query / general chat / RAG)
Table schema + question → SQL query generation (LLM)
Run query on PostgreSQL → Return result rows
SQL results + original question → Natural language response (LLM)
The LLM router is the critical piece. It decides whether a question needs a database query, a general conversation response, or RAG retrieval. If the router misclassifies, no amount of SQL quality can save the answer. Router accuracy sets the ceiling for the entire pipeline.
Schema prompt design: We inject the DDL (CREATE TABLE statements) for all 13 tables directly into the LLM prompt — table names, column names, data types, and FK relationships. This adds about 2,000 tokens. The LLM reads this schema to determine which tables to JOIN and which columns to filter. The synonym dictionary required humans to define mappings; Text2SQL lets the LLM reason over the schema.
40-Question Real-World Test
The test domain is a general hospital with patients, doctors, departments, medical records, prescriptions, billing, rooms, and appointments — 13 tables, 386 rows of data. Questions were designed across 4 difficulty tiers.
| Tier | Difficulty | Example Question | Questions | Accuracy |
|---|---|---|---|---|
| Tier 1 | Simple Lookup | List all doctors in internal medicine | 12 | 75% |
| Tier 2 | JOIN | Show Kim's prescription history | 10 | 88% |
| Tier 3 | Aggregation | Average treatment cost by department? | 10 | 64% |
| Tier 4 | Complex | Most prescribed drug in March? | 8 | 52% |
| Overall Average | 40 | 70.1% | ||
The surprising finding: Tier 2 (JOIN) scored higher than Tier 1 (simple lookup) at 88%. JOIN queries like "Kim's prescriptions" have explicit table relationships, so the LLM identifies foreign keys accurately. Simple lookups, however, sometimes require inferring status values like status = 'admitted' from context — which is harder than following a clear FK path.
SQL Generation Quality Metrics
100%
SQL Generation Rate
All 40 questions produced SQL queries
100%
Syntax Validity
Every generated query was syntactically valid
95%
Execution Success
38/40 queries ran without errors
70.1%
Answer Accuracy
Results matched the correct answer
The 100% SQL generation and syntax validity rates come from injecting actual DDL into the prompt. The LLM sees the real table structure, so it rarely references non-existent columns. The accuracy gap (100% → 70.1%) represents cases where the SQL was syntactically correct but returned semantically wrong results.
Three-Way Accuracy Comparison
We ran the same 40 questions through three approaches. The contrast with LLM-only responses is stark.
No database access at all. Every number, name, and figure is hallucinated
Keyword mapping handles simple lookups but collapses on JOINs and aggregations
LLM reads schema and writes SQL directly. Handles complex multi-table queries
With LLM only, every patient name, treatment cost, and prescription is fabricated. The 10.8% accuracy comes from yes/no questions that happened to be correct by chance. Synonym dictionaries work for simple lookups but hit a wall as table count increases and mapping maintenance becomes unsustainable.
Why 70.1% matters: In our hybrid search benchmark, RAG achieved 92.5% fact accuracy — but only for "information that exists in documents." Text2SQL accesses all data in every table via SQL. Aggregations (SUM, AVG, MAX) and conditional filtering are impossible with RAG but natural with SQL. In production, the best approach combines RAG + Text2SQL together.
Latency Analysis
Total response time averages 1,926ms, well under our 2,500ms target. The bottleneck is clear — it's the LLM calls.
| Stage | Duration | % of Total | Notes |
|---|---|---|---|
| LLM Routing + SQL Gen | ~900ms | 46.7% | 1st LLM call |
| SQL Execution (PostgreSQL) | ~7ms | 0.4% | Indexed, 386 rows |
| RAG Retrieval (parallel) | ~18ms | 0.9% | Runs alongside DB query |
| Answer Generation (LLM) | ~1,000ms | 51.9% | 2nd LLM call |
| Total | ~1,926ms | 100% | Under 2,500ms target |
SQL execution (7ms) and RAG retrieval (18ms) combined take just 25ms. 98.7% of total response time is LLM inference. The only meaningful optimization path is faster LLM inference — GPU upgrades or AWQ quantization to increase tok/s will reduce response time proportionally.
LLM Router Performance
90%
Routing Accuracy
36/40 correctly classified
4
Misclassifications
DB queries sent to general chat
0
Reverse Errors
No general queries sent to DB
All 4 routing errors were in the same direction: DB queries misclassified as general conversation. Zero reverse errors means no unnecessary SQL execution ever occurred. The router learned a conservative strategy — when unsure, default to general chat rather than risk a pointless database query.
Limitations & Next Steps
While 70.1% is a major improvement over the synonym dictionary (41.2%), it's not yet production-ready on its own. Analyzing the 30% failure cases reveals clear patterns.
Symptom: Fails to convert 'last 3 months' into SQL DATE_SUB function
Fix: Inject current date into prompt + add few-shot examples of date functions
Symptom: SQL is correct but returns 0 rows; LLM hallucinates instead of saying 'no data found'
Fix: Explicitly inject 'no data found' context when SQL returns 0 rows
Symptom: Unit conversion errors when formatting currency values in natural language
Fix: Pass raw numbers from SQL; handle formatting in a post-processing layer
Symptom: 'Currently admitted' status value doesn't exist in schema, so inference fails
Fix: Add enum values and business rules as comments in the schema prompt
Every fix listed above is achievable through prompt engineering alone — no model changes or architecture rewrites needed. Adding date function examples, enum value lists, and zero-row handling rules to the schema prompt should push accuracy above 80%.
Hallucination Defense in Practice
Text2SQL's biggest advantage: if the data doesn't exist, SQL returns empty results. An LLM alone would fabricate patient names and treatment costs. Text2SQL returns 0 rows. The challenge is how the LLM interprets those empty results.
Cases where the LLM hallucinates on zero-row results are addressed by adding explicit guardrails to the prompt.
Conclusion
The verdict on switching from synonym dictionaries to Text2SQL is clear. Once you exceed 5 tables, synonym maintenance becomes unsustainable. Having the LLM read the schema directly and write SQL wins on both accuracy and scalability.
Final Comparison
| Metric | LLM Only | Synonym Dict | Text2SQL |
|---|---|---|---|
| Fact Accuracy | 10.8% | 41.2% | 70.1% |
| Scalability (new tables) | No change | Manual synonym entry | Just add DDL to prompt |
| Aggregation Queries | Impossible | Very limited | SUM/AVG/MAX supported |
| Maintenance Cost | None | High | Low |
| Response Time | ~1,000ms | ~1,200ms | ~1,926ms |
Response time is longest at 1,926ms due to two LLM calls, but this scales directly with GPU performance. Given the accuracy improvement (10.8% → 70.1%) and dramatic reduction in maintenance cost, it's a worthwhile trade-off.
The AI chatbot's data retrieval pipeline now has three complete layers: RAG for unstructured document search, hybrid search for combined DB+RAG retrieval, and Text2SQL for direct structured data queries. The LLM router selects the optimal path based on query intent, maximizing each approach's strengths.
What's next: Unifying RAG + Text2SQL + hybrid search under a single router that automatically selects the optimal retrieval path for each query type. Target: router accuracy from 90% → 95%+, powered by few-shot examples and a feedback loop.