treeru.com

Building a Local RAG Pipeline — From Embedding Model Selection to Hallucination Elimination

Retrieval-Augmented Generation (RAG) promises to ground LLM responses in factual data, but the gap between a demo and a production-ready pipeline is enormous. We built an end-to-end RAG system on our local GPU infrastructure, benchmarked every component individually, and measured exactly how much accuracy we gained — and at what cost. Here is the full breakdown.

Why RAG Matters for Local LLM Deployments

Large language models generate fluent text, but they confidently fabricate details — prices, dates, specifications — when the answer is not baked into their weights. In enterprise settings where a wrong number in an invoice or a fabricated product spec can cause real damage, RAG provides a structural fix: retrieve the relevant documents first, then let the model answer only from what it can cite.

Cloud-hosted RAG services exist, but they introduce data-residency risks and per-query costs that compound fast at scale. Running the entire pipeline locally — embedding, vector search, and generation — keeps sensitive documents on-premise while giving you full control over latency and throughput.

Pipeline Architecture: Four Stages

Our pipeline follows a straightforward four-stage design: Chunk → Embed → Search → Generate. Each stage was tested independently before integration so we could isolate bottlenecks.

Stage 1 — Chunking. Source documents (product manuals, pricing sheets, internal FAQs) are split into 512-token chunks with 64-token overlap. We tested chunk sizes from 256 to 1024 tokens; 512 hit the sweet spot between recall and embedding throughput.

Stage 2 — Embedding. Each chunk is encoded into a dense vector using a multilingual embedding model. We compared two leading candidates: BGE-M3 (BAAI, 568M parameters) andmE5-Large (Microsoft, 560M parameters). Both run comfortably on a single RTX Pro 6000 with 48 GB VRAM.

Stage 3 — Vector Search. Vectors are stored in Qdrant, an open-source vector database optimized for filtered search. At query time, the user question is embedded with the same model and the top-K nearest neighbors are retrieved.

Stage 4 — Generation. The retrieved chunks are injected into the LLM prompt as context. The model (Qwen3-14B-AWQ in our case) generates an answer grounded in the retrieved evidence.

Embedding Model Comparison: BGE-M3 vs mE5-Large

We created a benchmark set of 30 Korean-language queries spanning product pricing, delivery policies, and technical specifications. Each query had a known ground-truth chunk. The key metric was recall: did the correct chunk appear in the top-K results?

ModelDimensionsEncoding SpeedTop-1 RecallTop-3 Recall
BGE-M3102438.2 ms/query75.0%100%
mE5-Large102437.1 ms/query66.7%86.7%

BGE-M3 achieved 100% Top-3 recall on our Korean benchmark — every single ground-truth chunk appeared within the top three results. mE5-Large missed four queries entirely at Top-3, dropping to 86.7%. The encoding speed difference was negligible (1.1 ms), so BGE-M3 was the clear winner for Korean-language retrieval.

Vector Search Performance in Qdrant

With BGE-M3 selected, we loaded 2,847 chunks into Qdrant and ran latency benchmarks. Average search time was1.9 ms per query for Top-3 retrieval — fast enough that the search stage contributes almost nothing to end-to-end latency.

Accuracy breakdown by retrieval depth:

  • Top-1: 75.0% — three out of four queries return the ideal chunk as the first result
  • Top-3: 100% — expanding to three candidates catches every remaining case

This confirms that Top-3 retrieval is the minimum viable setting for production. Top-1 alone would miss 25% of correct answers, while Top-3 closes the gap completely without adding meaningful latency.

RAG Overhead: Only 2.9% Slower Than Baseline

The critical question for production is how much RAG slows down inference. We compared two scenarios: the LLM generating answers without context (baseline) versus the full RAG pipeline (embed query → search → generate with context).

ModeAvg Response TimeOverhead
Baseline (no RAG)1,916 ms
Full RAG Pipeline1,971.7 ms+2.9%

The RAG overhead — embedding the query (38.2 ms) plus vector search (1.9 ms) — adds roughly55.7 ms to the total pipeline. That is a 2.9% increase over baseline generation time. In practice, users cannot perceive this difference. The cost of grounding answers in real data is essentially free from a latency perspective.

Hallucination Correction: Real Before-and-After Examples

The whole point of RAG is eliminating made-up facts. We tested this with adversarial queries designed to trigger common LLM hallucination patterns — asking about specific prices, delivery conditions, and product details that the model could not have memorized from pretraining.

Example 1: Product Pricing

Query: "How much does the premium package cost?"

Without RAG: The LLM confidently answered "approximately 1,500 won" — a number it apparently fabricated from statistical patterns in its training data.

With RAG: The pipeline retrieved the actual pricing document and returned "4,500 won per unit based on the 2024 pricing sheet" — the correct figure, with a citable source.

Example 2: Delivery Policy

Query: "What is the delivery timeframe for bulk orders?"

Without RAG: "Standard delivery takes 3-5 business days." This sounds plausible but was completely wrong — the actual policy for bulk orders has different terms.

With RAG: The system retrieved the logistics policy document and provided the correct bulk-order timeline with specific conditions, directly quoting the source material.

Across our test set, RAG eliminated 100% of factual hallucinations in cases where the answer existed in the indexed documents. The model still occasionally paraphrased awkwardly, but it no longer invented facts.

Lessons Learned and Production Recommendations

After building and benchmarking this pipeline, several practical insights emerged:

  • Always use Top-3 retrieval minimum. Top-1 misses too many edge cases. Top-3 achieved perfect recall in our tests with negligible latency cost.
  • BGE-M3 is the best multilingual embedding model for Korean. The 13.3 percentage-point recall advantage over mE5-Large at Top-3 is decisive. If your documents include Korean text, BGE-M3 should be your default.
  • RAG overhead is not a valid concern. At 2.9% additional latency, there is no performance reason to skip RAG. The accuracy gains are massive; the cost is invisible.
  • Chunk size matters more than model size. We saw bigger recall improvements from tuning chunk size (512 tokens) and overlap (64 tokens) than from switching between similarly-sized embedding models.
  • Monitor retrieval quality, not just generation quality. When RAG answers are wrong, the root cause is almost always a retrieval failure (wrong chunks returned), not a generation failure. Build your logging around the search stage.

Full Pipeline Specifications

ComponentSpecification
Embedding ModelBGE-M3 (568M params, 1024-dim)
Vector DatabaseQdrant (self-hosted)
LLMQwen3-14B-AWQ (4-bit quantized)
Chunk Size512 tokens, 64-token overlap
Retrieval DepthTop-3
Search Latency1.9 ms average
Total RAG Overhead+55.7 ms (+2.9%)
Hallucination Rate (with RAG)0% (on indexed documents)
GPURTX Pro 6000 (48 GB VRAM)

RAG is not a silver bullet — it cannot answer questions about topics not in your document index, and it adds complexity to your deployment. But for the specific problem of factual hallucination in enterprise contexts, the numbers speak clearly: 100% Top-3 recall, 0% hallucination on indexed content, and only 2.9% latency overhead. That is a trade-off worth making every time.