Retrieval-Augmented Generation is one of the most practically useful patterns in applied AI. It lets you build question-answering systems grounded in your own content — documentation, policies, product catalogues, support history — without retraining a model. It also fails in specific, predictable ways that demos do not reveal. Understanding those failure modes before you deploy is the difference between a system that builds user trust and one that quietly erodes it.
The RAG Pipeline in Production
A production RAG system has more components than a prototype:
- Ingestion pipeline: Document loading → cleaning → chunking → embedding → vector store upsert. This runs continuously as your knowledge base changes.
- Query pipeline: User query → query transformation (optional) → embedding → vector retrieval → reranking (optional) → context assembly → LLM generation → response + source attribution
- Evaluation layer: Retrieval metrics, generation quality metrics, latency tracking, and hallucination detection running continuously against a golden test set
- Feedback loop: User feedback (thumbs up/down, corrections, escalations) feeding back into evaluation and triggering retrieval or generation improvement cycles
Chunking Strategy: The Decision That Affects Everything
How you split documents into chunks is the single most impactful decision in RAG architecture. Poor chunking causes poor retrieval, which causes poor answers, which damages user trust. The key trade-off:
- Too large: Chunks contain multiple topics. The embedding represents an average of all of them, which reduces retrieval precision. The retrieved chunk also contains irrelevant context that confuses the model.
- Too small: Individual sentences lose context. A chunk that says "This applies from January 2025" is meaningless without knowing what "this" refers to.
Practical guidance by content type:
- Technical documentation: Chunk by section heading (H2/H3). Each section is a coherent unit of meaning. Include the parent section heading in every chunk's text for context.
- Legal/policy documents: Chunk by clause, not paragraph. Include the clause number and document title in every chunk.
- FAQ content: Keep question and answer together as a single chunk. Splitting them destroys retrieval quality.
- Conversational transcripts: Chunk by speaker turn or dialogue exchange — not by fixed token count.
- Product catalogues: One chunk per product, including all attributes. Sparse structured data often benefits from hybrid search (vector + BM25 keyword) rather than pure semantic retrieval.
Use overlapping chunks (10-20% overlap) for long-form narrative documents to avoid splitting key sentences across chunk boundaries. Implement a late-chunking or contextual chunking approach if your content is highly interconnected.
Retrieval Metrics You Should Be Tracking
Retrieval and generation failures are different problems requiring different fixes. Track them separately:
- Retrieval recall @ k: Of the questions where the answer exists in your knowledge base, what percentage have the correct source document in the top-k retrieved chunks? Low recall means your embedding or chunking needs improvement.
- Context precision: Of the chunks included in the context window, what percentage are actually relevant to the question? Low precision means you are feeding the model noise, which dilutes the signal and increases hallucination risk.
- Mean reciprocal rank (MRR): How high does the most relevant chunk appear in your retrieval results? If it is consistently ranked 3rd or 4th, a reranker can improve answer quality significantly.
- No-retrieval rate: What percentage of queries return zero relevant results? These queries should be handled gracefully (clarification request or escalation) rather than generating an answer with no grounding.
The Failure Modes Users Actually See
When RAG fails, it typically fails in one of these ways:
- Confident wrong answers from irrelevant retrieval: The most damaging failure. The model receives loosely related chunks and generates a fluent, confident-sounding answer that is factually wrong for the user's question. Mitigation: reranking to filter low-relevance results; confidence thresholding; answer grounding checks before delivery.
- Partial answers: The correct answer is spread across multiple chunks, and retrieval only surfaces one of them. The model answers part of the question accurately and omits the rest. Mitigation: increase k in retrieval; use document-level retrieval as a supplement to chunk-level.
- Outdated answers: The knowledge base has not been updated, and the model retrieves a chunk that was accurate when ingested but is now out of date. Mitigation: include document metadata (last updated date) in chunks and surface it to users; implement chunk expiry or staleness warnings.
- Answer refusal on in-scope questions: The retrieval system fails to surface relevant chunks for a valid question, and the model correctly declines to answer (or incorrectly says "I don't have information on that"). Mitigation: query expansion, hybrid search, and query rewriting to bridge vocabulary gaps between user queries and document terminology.
- Source hallucination: The model cites a source that does not exist or attributes a claim to the wrong document. Mitigation: build citation from retrieval metadata before generation, not from the model's output; verify cited sources exist before displaying them.
Query Transformation to Bridge Vocabulary Gaps
User queries and document language often differ significantly. A user asks "how do I cancel?" but your policy document says "subscription termination procedure." Pure semantic search can miss this. Mitigations:
- Hypothetical Document Embeddings (HyDE): generate a hypothetical document that would answer the query, embed it, and use that embedding for retrieval rather than the raw query embedding
- Query expansion: generate 3-5 alternative phrasings of the user's question and run retrieval for each, then merge results
- Hybrid search: combine vector similarity with BM25 keyword search, weighting each based on the query type (keyword-heavy queries benefit more from BM25; abstract questions benefit more from semantic similarity)
Latency Budget for Customer-Facing RAG
User experience in conversational AI is sensitive to latency. Typical budget breakdown:
- Query embedding: 50-150ms (can be parallelised with retrieval setup)
- Vector retrieval (top-20): 20-80ms for well-optimised vector stores
- Reranking: 100-300ms with cross-encoder rerankers
- LLM generation (first token): 500-2000ms depending on model and context length
Use streaming for generation output so users see the first tokens quickly. Cache embeddings for repeated queries. Consider separate fast-path and slow-path retrieval — quick vector search first, with optional reranking only for ambiguous or high-stakes queries.
RAG systems built correctly are transformative for knowledge-intensive customer interactions. Built without attention to retrieval quality and failure handling, they erode trust faster than no AI at all. We build production AI systems as part of our AI and ML service. If you are moving a RAG prototype towards production, get in touch for a technical review of your current architecture.
