Every client that comes to me with a RAG project has the same problem: their prototype works great in demos but breaks in production. Answers are wrong, confidence is unwarranted, and users stop trusting the system within days.
After building RAG pipelines for 10+ production deployments, I've found the gaps are almost never about the LLM. They're about retrieval quality, context assembly, and a missing evaluation loop. Here's how I build systems that stay honest.
The Core Problem: Retrieval ≠ Relevance
The most common mistake is treating vector similarity as a proxy for relevance. It isn't. A chunk can be semantically close to a query while containing completely different information. If your retrieval step returns 5 chunks and 3 are noise, you've handed the LLM enough rope to hallucinate confidently.
My baseline retrieval stack for production:
retriever = EnsembleRetriever(
retrievers=[
vector_store.as_retriever(search_kwargs={"k": 8}),
bm25_retriever, # keyword fallback
],
weights=[0.6, 0.4],
)
The BM25 fallback is non-negotiable for queries with proper nouns, product names, or specific numeric values — domains where dense embeddings routinely miss exact matches.
Context Compression Before Injection
Raw chunks are expensive and noisy. Before injecting context into the prompt I run a contextual compression step that strips irrelevant sentences from each retrieved chunk:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=ensemble_retriever,
)
This typically reduces token usage by 40–60% while improving answer quality — the LLM has less noise to reason through.
Grounding the Prompt
The system prompt architecture matters more than most people realize. I use a three-part structure:
- Identity + scope — what the assistant is, what domain it covers
- Grounding instruction — explicit instruction to answer only from provided context, and to say "I don't know" when the context doesn't support an answer
- Citation format — require the model to cite the source chunk ID in every factual claim
That last point is critical. Citations force the model to trace its answers back to source material. Any answer without a traceable source gets flagged by our evaluation layer.
The Evaluation Loop You Can't Skip
None of the above matters without measurement. I instrument every production RAG system with three metrics tracked per query:
- Context recall — did the retriever surface the chunks needed to answer correctly?
- Answer faithfulness — does the answer contradict anything in the context?
- Answer relevance — does the answer address what the user actually asked?
I use RAGAS for automated scoring and pipe the results into a Grafana dashboard. When faithfulness drops below 0.85 on a query cluster, that cluster gets flagged for human review and the retriever config gets tuned.
Results
The last production system I shipped with this approach: 94% answer faithfulness on day one, 97% after two weeks of evaluation-driven tuning. Hallucination complaints from end users dropped to near zero within a month.
RAG is not a solved problem — but it's a solvable one if you treat evaluation as a first-class concern from day one, not an afterthought.
If you're building a RAG system and hitting a wall, reach out. I'm happy to do a 30-minute architecture review.