May 15, 2026 8 min read

Three RAG Patterns That Actually Make It to Production

After deploying three enterprise RAG applications in 18 months, here's what the architecture looks like when the demo phase is over, and the mistakes I'd avoid the first time.

Most RAG tutorials end at the demo. You have a vector store, an embedding model, a retrieval step, and a language model. You type a question, you get an answer. It looks impressive.

Then someone asks: “What happens when the documents change?” or “How do we know the answers are correct?” or “Can we audit why it said that?” And the demo architecture doesn’t have answers.

After shipping three production RAG applications inside a large enterprise, here are the patterns that held up.

Pattern 1: Retrieval as a service, not a pipeline

The most common RAG architecture mistake is building retrieval directly into the inference flow. You index documents, embed the query at inference time, retrieve chunks, and pass them to the model, all in one request path.

This creates three production problems:

  • Retrieval latency is in your user’s path
  • Document updates require you to re-embed and re-index in real time
  • You can’t tune retrieval independently of generation

What works better: treat retrieval as a separate, asynchronous service. Documents are indexed in a background pipeline triggered by document events (new upload, version change, deletion). The retrieval service is a separate API endpoint that returns ranked chunks. The generation service calls retrieval and then generation as two independent steps.

This lets you:

  • Cache retrieval results for hot queries
  • Update the index without touching the inference path
  • Run retrieval A/B tests without changing the model
  • Observe and debug retrieval quality independently of generation quality

Pattern 2: Explicit provenance in every response

Enterprise users need to know where an answer came from. Not as a “sources” list at the bottom. What’s needed is structured, auditable metadata attached to every response.

We built a ResponseProvenance object that travels with every RAG response:

interface ResponseProvenance {
  retrievedChunks: Array<{
    documentId: string;
    documentTitle: string;
    chunkIndex: number;
    score: number;
    usedInGeneration: boolean;
  }>;
  modelId: string;
  timestamp: string;
  queryHash: string; // for deduplication and caching
}

This object gets logged, stored alongside the response, and in some UIs surfaces as an expandable “sources” panel. More importantly, it’s what lets you go back and ask “why did the model say X on this date?” Enterprise compliance and audit teams will ask.

Without this, you’re flying blind. With it, you have the foundation for an evaluation pipeline.

Pattern 3: Evaluation is a first-class system, not an afterthought

The question no one asks until it’s urgent: “How do we know the answers are getting better or worse over time?”

RAG quality degrades in subtle ways. Documents change. The query distribution shifts. The embedding model gets updated. A new category of question starts failing. Without systematic evaluation, you find out about these regressions when a user complains.

The evaluation setup we built has three layers:

Automated regression suite: a curated set of ~200 question-answer pairs, run on every deployment. If the pass rate drops more than 3%, the deployment is flagged. This catches obvious regressions.

Retrieval quality metrics: precision and recall against a golden set of queries with known relevant documents. Retrieval failures are often more impactful than generation failures, and they’re easier to measure.

Human-in-the-loop sampling: 1% of production responses get routed to a review queue. Domain experts score them. This catches the failures that automated metrics miss. The ones where the answer is technically correct but wrong in context.

None of this is glamorous. But the teams that skip it spend six months firefighting model quality issues they can’t diagnose.

The pattern I’d skip next time

Multi-hop retrieval for the first version. The idea is compelling: retrieve, reason about what additional context you need, retrieve again. In practice, the latency is high, the failure modes multiply, and most enterprise queries don’t need it. Get single-hop working reliably first, with good evaluation. Then add complexity only where the data shows you need it.