Akash/ai·ml
Back to blog
March 18, 2025·8 min read

Building a Production RAG System That Doesn't Hallucinate

Most RAG tutorials show you how to get an answer. I'll show you how to get the right answer — reliably, at scale, with measurable hallucination rates under 2%.

RAGLangChainOpenAIVector DB

Every client that comes to me with a RAG project has the same problem: their prototype works great in demos but breaks in production. Answers are wrong, confidence is unwarranted, and users stop trusting the system within days.

After building RAG pipelines for 10+ production deployments, I've found the gaps are almost never about the LLM. They're about retrieval quality, context assembly, and a missing evaluation loop. Here's how I build systems that stay honest.

The Core Problem: Retrieval ≠ Relevance

The most common mistake is treating vector similarity as a proxy for relevance. It isn't. A chunk can be semantically close to a query while containing completely different information. If your retrieval step returns 5 chunks and 3 are noise, you've handed the LLM enough rope to hallucinate confidently.

My baseline retrieval stack for production:

retriever = EnsembleRetriever(
    retrievers=[
        vector_store.as_retriever(search_kwargs={"k": 8}),
        bm25_retriever,                          # keyword fallback
    ],
    weights=[0.6, 0.4],
)

The BM25 fallback is non-negotiable for queries with proper nouns, product names, or specific numeric values — domains where dense embeddings routinely miss exact matches.

Context Compression Before Injection

Raw chunks are expensive and noisy. Before injecting context into the prompt I run a contextual compression step that strips irrelevant sentences from each retrieved chunk:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=ensemble_retriever,
)

This typically reduces token usage by 40–60% while improving answer quality — the LLM has less noise to reason through.

Grounding the Prompt

The system prompt architecture matters more than most people realize. I use a three-part structure:

  1. Identity + scope — what the assistant is, what domain it covers
  2. Grounding instruction — explicit instruction to answer only from provided context, and to say "I don't know" when the context doesn't support an answer
  3. Citation format — require the model to cite the source chunk ID in every factual claim

That last point is critical. Citations force the model to trace its answers back to source material. Any answer without a traceable source gets flagged by our evaluation layer.

The Evaluation Loop You Can't Skip

None of the above matters without measurement. I instrument every production RAG system with three metrics tracked per query:

  • Context recall — did the retriever surface the chunks needed to answer correctly?
  • Answer faithfulness — does the answer contradict anything in the context?
  • Answer relevance — does the answer address what the user actually asked?

I use RAGAS for automated scoring and pipe the results into a Grafana dashboard. When faithfulness drops below 0.85 on a query cluster, that cluster gets flagged for human review and the retriever config gets tuned.

Results

The last production system I shipped with this approach: 94% answer faithfulness on day one, 97% after two weeks of evaluation-driven tuning. Hallucination complaints from end users dropped to near zero within a month.

RAG is not a solved problem — but it's a solvable one if you treat evaluation as a first-class concern from day one, not an afterthought.

If you're building a RAG system and hitting a wall, reach out. I'm happy to do a 30-minute architecture review.


Building something with AI?

I help teams ship production-grade AI systems. Let's talk.

Get in touch