Applied AI

RAG in production: the real engineering challenges behind retrieval-augmented generation

RAG demos are impressive. RAG in production is a different problem entirely — one involving chunking strategies, hybrid search, latency budgets, access control, and continuous evaluation.

3/3/20267 min readAI
RAG in production: the real engineering challenges behind retrieval-augmented generation

Executive summary

RAG demos are impressive. RAG in production is a different problem entirely — one involving chunking strategies, hybrid search, latency budgets, access control, and continuous evaluation.

Last updated: 3/3/2026

Executive summary

Retrieval-Augmented Generation (RAG) has become the dominant architectural pattern for building enterprise knowledge applications with LLMs. The core concept is elegant: instead of relying solely on what a language model learned during training, augment each generation request with dynamically retrieved, up-to-date context from your organization's documents and databases.

The demo looks compelling. A user asks a question; the system retrieves relevant document chunks; the LLM synthesizes an accurate, grounded answer. In practice, moving from this demo to a production system that is reliable, fast, accurate, and compliant requires solving problems that no RAG tutorial covers.

This post covers those problems — and the engineering decisions that determine whether your RAG system becomes a production asset or a maintenance burden.

The retrieval pipeline: where most RAG systems break

A RAG pipeline consists of three primary stages: ingestion (processing your documents into retrievable chunks), retrieval (finding the right chunks for a given query), and generation (synthesizing an answer from retrieved context). Most engineering effort concentrates on generation, but most production failures originate in ingestion and retrieval.

Chunking: the underestimated foundation

Chunking transforms raw documents into retrievable units. The naive approach — splitting documents into fixed-size text blocks of 512 or 1,024 tokens — is sufficient for demos and systematically incorrect for production.

Why fixed-size chunking fails in production:

  • A product manual section on "Installation Requirements" and "Safety Warnings" may be semantically distinct but physically adjacent. Fixed-size chunking splits them arbitrarily, polluting both chunks with irrelevant context.
  • Legal documents have hierarchical structure: contract → clause → sub-clause → definition. A chunk that includes the sub-clause without the parent clause has lost the context that makes it interpretable.
  • Code documentation often contains examples that only make sense alongside the function signature that precedes them by 600 tokens.

Production chunking strategies:

  • Semantic chunking: Embed sentences and split when embedding cosine distance between adjacent sentences exceeds a threshold — chunks follow topical shifts rather than token counts
  • Hierarchical chunking: Preserve document structure by indexing both full sections (for broad retrieval) and sub-sections (for precise retrieval), then using the broader chunk for LLM context
  • Document-aware chunking: Apply different chunking strategies per document type — PDFs use heading detection, code uses function/class boundaries, emails treat each thread as a unit

The hybrid search imperative

Pure vector search (semantic similarity) misses exact matches. Pure keyword search (BM25/TF-IDF) misses semantic relationships. Neither alone is sufficient for enterprise knowledge bases that contain both technical specifications (where exact term matches matter) and conceptual documentation (where semantic similarity matters).

Production hybrid search architecture:

ComponentRoleTypical implementation
Dense retrievalSemantic similarityOpenAI embeddings + pgvector, Pinecone, or Qdrant
Sparse retrievalKeyword matchingElasticsearch BM25 or OpenSearch
RerankerSecond-pass relevance scoringCross-encoder model (e.g., Cohere Rerank)
FusionCombining dense + sparse resultsReciprocal Rank Fusion (RRF)

The reranker is the component most teams skip in MVP and most regret skipping in production. Without it, the combined dense + sparse results include many borderline-relevant chunks that dilute the LLM's attention and degrade answer quality.

Latency: the silent production killer

A typical RAG request chain:

  1. Embed the user query: ~50ms
  2. Vector search over document index: ~100-300ms
  3. Keyword search: ~50-200ms
  4. Reranking: ~200-500ms
  5. LLM generation with retrieved context: ~1,000-4,000ms

Total: 1.4-5 seconds for a single RAG request — before accounting for network latency, cold starts, or load spikes.

For applications where users expect sub-second responses, this pipeline requires significant engineering:

  • Query result caching: Cache the full RAG response for identical or near-identical queries. A customer support application handles the same 100 questions 80% of the time — caching these responses eliminates most retrieval overhead.
  • Async retrieval prefetching: In conversational applications, begin retrieval for the expected follow-up question while the current LLM response is streaming.
  • Index warm-up and co-location: Keep vector indexes in memory on the same network segment as the API servers that query them. Cross-region retrieval adds 50-200ms and makes latency percentiles unpredictable.
  • Model selection for context size: Using a frontier model with a 128K context window to process 4 document chunks that fit in 2K tokens is economically and latency-wise irrational. Size your model to your actual retrieved context.

Document access control: the compliance time bomb

Enterprise knowledge bases contain documents with different access levels: public product documentation, internal engineering specs, HR policies restricted to HR staff, financial data restricted to finance, GDPR-subject personal data restricted by role and jurisdiction.

Most RAG implementations retrieve from the full document store, regardless of who is asking. This means an employee asking the AI assistant a question about "project timeline" might receive context retrieved from a confidential strategic plan they are not authorized to access.

Access control architecture for production RAG:

  • Index segmentation: Maintain separate vector indexes per access tier. Public documents in one index, confidential documents in another. Route retrieval to only the indexes the requesting user is authorized to query.
  • Document-level ACL in metadata: Store access control lists as document metadata in the vector store. Filter search results by ACL before returning to the LLM context. Both pgvector and Qdrant support metadata filtering at query time.
  • Audit logging of retrievals: Every retrieved document chunk must be logged with the user identity, timestamp, and query context. For regulated industries, this audit trail is a compliance requirement, not an optional observability feature.

Continuous evaluation: the operational requirement teams ignore

RAG systems degrade silently. When your document corpus changes (new products, updated policies, deprecated features), retrieval quality for existing queries may decline without any observable error.

Production evaluation framework:

  • Golden dataset: Maintain a set of 100-500 question-answer pairs with known correct answers, drawn from your actual user queries. Run this evaluation weekly against the live system and alert on quality regression.
  • Automated hallucination detection: After generation, verify that factual claims in the LLM response are supported by retrieved context. Flag and log responses where the LLM introduces information absent from the retrieved chunks.
  • Retrieval quality metrics: Track Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) for your retrieval component separately from end-to-end answer accuracy. This isolates whether quality issues originate in retrieval or generation.
  • User feedback integration: Thumbs up/down signals from users, routed back to your evaluation pipeline, provide the ground truth that automated metrics cannot.

Decision prompts for engineering leaders

  • Does your RAG system enforce document-level access control, or does every user query against the full document corpus regardless of their permissions?
  • What is your P95 latency for a complete RAG request under your expected production load? Have you measured it?
  • Do you have a continuous evaluation pipeline that would detect a 20% drop in answer accuracy within 24 hours?
  • When was the last time your document ingestion pipeline was tested against documents that violated your expected format assumptions?

Building a knowledge application with RAG that needs to be both accurate and production-reliable? Talk to Imperialis about RAG architecture design, evaluation frameworks, and enterprise-grade deployment for knowledge systems.

Sources

Related reading