DB + RAG Hybrid Search — How We Improved LLM Fact Accuracy from 17.5% to 92.5%
RAG alone isn't enough. When users ask "How much is an Americano?", vector search retrieves entire menu documents while a simple SQL query returns the exact price instantly. By running PostgreSQL and Qdrant searches in parallel and assembling results through a 4-case routing logic, we boosted fact accuracy from 17.5% to 92.5% — with only 18ms of additional latency.
17.5%→92.5%
Fact Accuracy Gain
18.1ms
Search Overhead
95%
Schema Match Rate
4-Case
Routing Logic
Why RAG Alone Falls Short on Numerical Queries
Retrieval-Augmented Generation is excellent at pulling contextual knowledge from unstructured documents. It dramatically reduces hallucination compared to a bare LLM — in our testing, RAG pushed fact accuracy from 17.5% (LLM-only) to 92.5%. But RAG has a structural weakness: it treats every question as a document retrieval problem.
When a user asks "What's the price of an Americano?", RAG retrieves the full menu document chunk, feeds it to the LLM, and hopes the model extracts the right number. It usually works, but not always — especially when chunk boundaries split relevant data, or when the answer requires precise numerical matching across multiple fields.
A database query, on the other hand, returns WHERE name = 'Americano' → $4.50, with zero ambiguity. The insight is straightforward: structured data belongs in a database, unstructured context belongs in RAG, and you should search both simultaneously.
Accuracy Breakdown: Three Modes Compared
We tested 20 real customer questions across five categories — pricing, hours, location, policies, and events — against three retrieval modes:
| Category | Questions | LLM Only | RAG Only | DB + RAG |
|---|---|---|---|---|
| Pricing | 5 | 40% | 100% | 80% |
| Hours | 2 | 0% | 100% | 100% |
| Location | 2 | 0% | 100% | 100% |
| Policies / Facilities | 9 | 11% | 89% | 89% |
| Events / Promotions | 2 | 25% | 75% | 75% |
| Overall (20 questions) | 20 | 17.5% | 92.5% | 87.5% |
An interesting observation: the DB+RAG hybrid measured at 87.5% — lower than RAG-only at 92.5%. This isn't because the answers were worse. The actual responses from DB+RAG were more accurate, but the LLM's internal reasoning (thinking tokens) sometimes reformatted numbers (e.g., "$4,500" became "$4500"), causing string-match evaluation to fail. Disabling thinking mode or adding format normalization in post-processing resolves this discrepancy.
The Parallel Search Architecture
The core idea is simple: when a user query arrives, fire off both a database search and a vector search simultaneously. Don't wait for one to finish before starting the other. Merge the results, classify them into one of four cases, and construct a context-enriched prompt for the LLM.
Pipeline Flow
Step 1 — Query Intake: The user's natural language question enters the system.
Step 2a — DB Path (parallel): A schema matching engine maps the query to a database table using a synonym dictionary. It generates a templated SQL query, executes it against PostgreSQL, and wraps the result in a [FACT] tag.
Step 2b — RAG Path (parallel): The query is embedded using BGE-M3, searched against Qdrant for the top 3 most relevant document chunks, and wrapped in a [CONTEXT] tag.
Step 3 — Assembly: Both results are merged and classified into one of four cases. The assembled context is injected into the LLM system prompt for answer generation.
Latency Impact
| Stage | Avg. Time | % of Total |
|---|---|---|
| Search (DB + RAG parallel) | 18.1ms | 0.9% |
| LLM Generation | 2,051ms | 99.1% |
| Total Response | 2,069ms | 100% |
The bottleneck is entirely in LLM generation (99.1%). The parallel search adds just 18.1ms — effectively free. This is the key advantage of running both searches concurrently rather than sequentially: you get the benefits of both data sources with negligible performance cost.
The 4-Case Routing Logic
Not every query produces results from both the database and the vector store. The 4-case routing logic handles all combinations gracefully, ensuring the LLM receives appropriately framed context:
Case 1: Both DB and RAG Return Results (75% of queries)
This is the ideal scenario. The LLM receives precise numerical data from the DB tagged as [FACT], supplemented by contextual information from RAG tagged as [CONTEXT]. Example: "An Americano is $4.50. We also offer a $0.30 discount when you bring your own tumbler."
Case 2: DB Only (20% of queries)
The LLM delivers a concise, fact-only response. Example: "Americano $4.50, Latte $5.50." No filler, no guessing.
Case 3: RAG Only (0% in our test set)
The LLM uses contextual information but adds a disclaimer about potential inaccuracies. This case didn't appear in our 20-question test but is critical for production safety — it prevents the model from presenting unverified document snippets as established facts.
Case 4: Neither (5% of queries)
The LLM honestly responds with "I don't have that information — let me check and get back to you." In our test, one query triggered this case: "What time do you close today?" — the keywords "today" and "what time" weren't registered in the synonym dictionary, so the schema matcher couldn't generate a DB query.
Schema Matching: The Bridge Between Natural Language and SQL
The schema matching engine is what makes the DB path work without sending every query to an LLM for SQL generation. It uses a synonym dictionary to map natural language keywords to database tables and columns.
For example, "How much is the Americano?" triggers the synonym entry for "price" → maps to the menu table → generates SELECT price FROM menu WHERE name = 'Americano'. This approach achieved a 95% success rate — 19 out of 20 test queries were correctly routed.
The one failure was the "What time today?" query where keywords weren't in the dictionary. This limitation is inherent to dictionary-based matching: it can only handle patterns you've anticipated. The next evolution — covered in our follow-up article on Text2SQL — has the LLM generate SQL directly, which handles novel query patterns but introduces its own accuracy trade-offs.
AI-Powered Data Structuring: From Documents to Database
A hybrid search system requires structured data in the database. Manually entering menu items, prices, hours, and policies is tedious. We tested whether an LLM could automatically extract structured data from unstructured markdown documents.
Menu Extraction Results
From a markdown menu document, the LLM extracted 19 out of 19 menu items. Price accuracy was 100%, and category accuracy was 100%. Name recall was 94.7% — the single miss was a typo where a Korean character was misrecognized (a common issue with visually similar Korean syllables). Total extraction time was 5.92 seconds.
Business Information Extraction
For business metadata (phone number, WiFi availability, hours, address), the LLM matched 5 out of 7 fields correctly (71.4%). The two mismatches were an address normalization issue ("Seoul" vs "Seoul Metropolitan City") and a type comparison problem — both easily fixable with post-processing.
The practical takeaway: AI auto-structuring is reliable enough for initial data entry assistance. Prices and categories come through perfectly. Names and addresses need a human review pass, but the manual effort drops by roughly 80%.
Conclusion: Architecture Lessons
Building a production-grade AI chatbot that handles factual queries requires more than just RAG. Here are the four key takeaways from this implementation:
- DB handles numbers, RAG handles context. Pricing, inventory, and reservations belong in a database. Policies, descriptions, and FAQs belong in a vector store. Separating concerns gives you both precision and richness.
- Parallel search is effectively free. Running DB and RAG searches concurrently adds only 18ms to a 2-second response cycle — 0.9% overhead. There's no reason to choose one over the other when you can have both.
- 4-case routing builds a safety net. By classifying results based on which sources returned data, you ensure the LLM never fabricates answers when it has real data available, and honestly admits ignorance when it doesn't.
- Dictionary-based matching has a ceiling. At 95% accuracy, synonym dictionaries work well for predictable query patterns. But complex or novel queries require the LLM to generate SQL directly — the Text2SQL approach we explore in the next article in this series.
The entire pipeline — parallel search, 4-case routing, result assembly, and LLM generation — completes in under 2.1 seconds with a fact accuracy of 87.5–92.5%. For customer-facing chatbots where wrong prices or hours mean real business consequences, this architecture provides a solid foundation that a bare LLM (17.5% accuracy) simply cannot match.