DB + RAG Hybrid Search — How We Improved LLM Fact Accuracy by 5x

2026-03-04

Treeru

AI · March 4, 2026

RAG alone isn't enough. When users ask "How much is an Americano?", vector search retrieves entire menu documents while a simple SQL query returns the exact price instantly. By running PostgreSQL and Qdrant searches in parallel and assembling results through a 4-case routing logic, we boosted fact accuracy from 17.5% to 92.5% — with only 18ms of additional latency.

17.5%→92.5%

Fact Accuracy Gain

18.1ms

Search Overhead

95%

Schema Match Rate

4-Case

Routing Logic

Why RAG Alone Falls Short on Numerical Queries

Retrieval-Augmented Generation is excellent at pulling contextual knowledge from unstructured documents. It dramatically reduces hallucination compared to a bare LLM — in our testing, RAG pushed fact accuracy from 17.5% (LLM-only) to 92.5%. But RAG has a structural weakness: it treats every question as a document retrieval problem.

When a user asks "What's the price of an Americano?", RAG retrieves the full menu document chunk, feeds it to the LLM, and hopes the model extracts the right number. It usually works, but not always — especially when chunk boundaries split relevant data, or when the answer requires precise numerical matching across multiple fields.

A database query, on the other hand, returns WHERE name = 'Americano' → $4.50, with zero ambiguity. The insight is straightforward: structured data belongs in a database, unstructured context belongs in RAG, and you should search both simultaneously.

Accuracy Breakdown: Three Modes Compared

We tested 20 real customer questions across five categories — pricing, hours, location, policies, and events — against three retrieval modes:

Category	Questions	LLM Only	RAG Only	DB + RAG
Pricing	5	40%	100%	80%
Hours	2	0%	100%	100%
Location	2	0%	100%	100%
Policies / Facilities	9	11%	89%	89%
Events / Promotions	2	25%	75%	75%
Overall (20 questions)	20	17.5%	92.5%	87.5%

An interesting observation: the DB+RAG hybrid measured at 87.5% — lower than RAG-only at 92.5%. This isn't because the answers were worse. The actual responses from DB+RAG were more accurate, but the LLM's internal reasoning (thinking tokens) sometimes reformatted numbers (e.g., "$4,500" became "$4500"), causing string-match evaluation to fail. Disabling thinking mode or adding format normalization in post-processing resolves this discrepancy.

The Parallel Search Architecture

The core idea is simple: when a user query arrives, fire off both a database search and a vector search simultaneously. Don't wait for one to finish before starting the other. Merge the results, classify them into one of four cases, and construct a context-enriched prompt for the LLM.

Pipeline Flow

Step 1 — Query Intake: The user's natural language question enters the system.
Step 2a — DB Path (parallel): A schema matching engine maps the query to a database table using a synonym dictionary. It generates a templated SQL query, executes it against PostgreSQL, and wraps the result in a [FACT] tag.
Step 2b — RAG Path (parallel): The query is embedded using BGE-M3, searched against Qdrant for the top 3 most relevant document chunks, and wrapped in a [CONTEXT] tag.
Step 3 — Assembly: Both results are merged and classified into one of four cases. The assembled context is injected into the LLM system prompt for answer generation.

Latency Impact

Stage	Avg. Time	% of Total
Search (DB + RAG parallel)	18.1ms	0.9%
LLM Generation	2,051ms	99.1%
Total Response	2,069ms	100%

The bottleneck is entirely in LLM generation (99.1%). The parallel search adds just 18.1ms — effectively free. This is the key advantage of running both searches concurrently rather than sequentially: you get the benefits of both data sources with negligible performance cost.

The 4-Case Routing Logic

Not every query produces results from both the database and the vector store. The 4-case routing logic handles all combinations gracefully, ensuring the LLM receives appropriately framed context:

Case 1: Both DB and RAG Return Results (75% of queries)

This is the ideal scenario. The LLM receives precise numerical data from the DB tagged as [FACT], supplemented by contextual information from RAG tagged as [CONTEXT]. Example: "An Americano is $4.50. We also offer a $0.30 discount when you bring your own tumbler."

Case 2: DB Only (20% of queries)

The LLM delivers a concise, fact-only response. Example: "Americano $4.50, Latte $5.50." No filler, no guessing.

Case 3: RAG Only (0% in our test set)

The LLM uses contextual information but adds a disclaimer about potential inaccuracies. This case didn't appear in our 20-question test but is critical for production safety — it prevents the model from presenting unverified document snippets as established facts.

Case 4: Neither (5% of queries)

The LLM honestly responds with "I don't have that information — let me check and get back to you." In our test, one query triggered this case: "What time do you close today?" — the keywords "today" and "what time" weren't registered in the synonym dictionary, so the schema matcher couldn't generate a DB query.

Schema Matching: The Bridge Between Natural Language and SQL

The schema matching engine is what makes the DB path work without sending every query to an LLM for SQL generation. It uses a synonym dictionary to map natural language keywords to database tables and columns.

For example, "How much is the Americano?" triggers the synonym entry for "price" → maps to the menu table → generates SELECT price FROM menu WHERE name = 'Americano'. This approach achieved a 95% success rate — 19 out of 20 test queries were correctly routed.

The one failure was the "What time today?" query where keywords weren't in the dictionary. This limitation is inherent to dictionary-based matching: it can only handle patterns you've anticipated. The next evolution — covered in our follow-up article on Text2SQL — has the LLM generate SQL directly, which handles novel query patterns but introduces its own accuracy trade-offs.

AI-Powered Data Structuring: From Documents to Database

A hybrid search system requires structured data in the database. Manually entering menu items, prices, hours, and policies is tedious. We tested whether an LLM could automatically extract structured data from unstructured markdown documents.

Menu Extraction Results

From a markdown menu document, the LLM extracted 19 out of 19 menu items. Price accuracy was 100%, and category accuracy was 100%. Name recall was 94.7% — the single miss was a typo where a Korean character was misrecognized (a common issue with visually similar Korean syllables). Total extraction time was 5.92 seconds.

Business Information Extraction

For business metadata (phone number, WiFi availability, hours, address), the LLM matched 5 out of 7 fields correctly (71.4%). The two mismatches were an address normalization issue ("Seoul" vs "Seoul Metropolitan City") and a type comparison problem — both easily fixable with post-processing.

The practical takeaway: AI auto-structuring is reliable enough for initial data entry assistance. Prices and categories come through perfectly. Names and addresses need a human review pass, but the manual effort drops by roughly 80%.

Conclusion: Architecture Lessons

Building a production-grade AI chatbot that handles factual queries requires more than just RAG. Here are the four key takeaways from this implementation:

DB handles numbers, RAG handles context. Pricing, inventory, and reservations belong in a database. Policies, descriptions, and FAQs belong in a vector store. Separating concerns gives you both precision and richness.
Parallel search is effectively free. Running DB and RAG searches concurrently adds only 18ms to a 2-second response cycle — 0.9% overhead. There's no reason to choose one over the other when you can have both.
4-case routing builds a safety net. By classifying results based on which sources returned data, you ensure the LLM never fabricates answers when it has real data available, and honestly admits ignorance when it doesn't.
Dictionary-based matching has a ceiling. At 95% accuracy, synonym dictionaries work well for predictable query patterns. But complex or novel queries require the LLM to generate SQL directly — the Text2SQL approach we explore in the next article in this series.

The entire pipeline — parallel search, 4-case routing, result assembly, and LLM generation — completes in under 2.1 seconds with a fact accuracy of 87.5–92.5%. For customer-facing chatbots where wrong prices or hours mean real business consequences, this architecture provides a solid foundation that a bare LLM (17.5% accuracy) simply cannot match.