RAG is easy to demo and hard to ship. After building FabrizioAI and teaching 1,300+ developers, here are the failure modes nobody warns you about.
Everyone's first RAG demo looks the same: a PDF, a vector store, a retriever, a generative model, and a chat interface that answers questions with appropriate hedging. It works beautifully. Then you try to deploy it.
THE RETRIEVAL PROBLEM
Most tutorials chunk documents at fixed token lengths with 20% overlap and call it done. This works for clean, well-structured documents. It fails for everything else - mixed-format PDFs, scanned documents, code-heavy content, or any domain where semantic meaning crosses artificial chunk boundaries.
The chunk is not the document. The chunk is your approximation of what the model needs. Get that approximation wrong and no amount of model quality will save you.
What I've found actually works is a hierarchy of chunking strategies: document-level summaries at the top, section-level chunks in the middle, and sentence-level precision at the bottom. Retrieval should operate across all three levels simultaneously, with a re-ranking pass to surface the most relevant combination.
CONTEXT POISONING
When your retrieval returns bad chunks, the model doesn't reject them - it uses them. This is context poisoning, and it produces answers that are confidently wrong in ways that are very difficult to detect without ground-truth evaluation.
// Example: retrieval drift over time const results = await retriever.invoke(query); // Without relevance scoring, stale/irrelevant chunks // will silently poison your context window const scored = results.filter(r => r.score > THRESHOLD);
WHAT PRODUCTION ACTUALLY REQUIRES
Production RAG needs three things that tutorials skip entirely:
- Retrieval evaluation infrastructure - a way to measure whether what you retrieved was actually relevant, before it reaches the model
- Answer faithfulness monitoring - automated checks that the generated answer is grounded in the retrieved context, not hallucinated
- Latency budgets per query type - not all queries need the same pipeline depth; simple factual lookups shouldn't use the same retrieval chain as complex multi-hop reasoning
Building the FabrizioAI chatbot taught me that the ETL pipeline - how you ingest, chunk, enrich, and index the source content - is more important than the retrieval strategy itself. Garbage in, garbage retrieved.
Further Reading
For the theory: the original RAG paper by Lewis et al. (2020), and the RAGAS evaluation framework. For the practical: your own production logs, which will tell you more in a week than any paper will in a year.