[{"data":1,"prerenderedAt":462},["ShallowReactive",2],{"article-alternates":3,"article-\u002Fen\u002Fai\u002Frag-production-retrieval-quality-over-cost":13},{"i18nKey":4,"paths":5},"ai-003-2026-06",{"de":6,"en":7,"es":8,"fr":9,"it":10,"ru":11,"tr":12},"\u002Fde\u002Fai\u002Frag-production-retrieval-qualitaet-vor-kosten","\u002Fen\u002Fai\u002Frag-production-retrieval-quality-over-cost","\u002Fes\u002Fai\u002Frag-production-calidad-recuperacion-primero","\u002Ffr\u002Fai\u002Fproduction-rag-retrieval-quality-before-cost","\u002Fit\u002Fai\u002Frag-production-retrieval-quality-first","\u002Fru\u002Fai\u002Frag-production-retrieval-quality-first","\u002Ftr\u002Fai\u002Fproductionda-rag-retrieval-kalitesi-costtan-once-gelir",{"_path":7,"_dir":14,"_draft":15,"_partial":15,"_locale":16,"title":17,"description":18,"publishedAt":19,"modifiedAt":19,"category":14,"i18nKey":4,"tags":20,"readingTime":26,"author":27,"body":28,"_type":456,"_id":457,"_source":458,"_file":459,"_stem":460,"_extension":461},"ai",false,"","RAG in Production: Retrieval Quality Comes Before Cost","Without proper embedding models, chunking strategy, and eval setup, your RAG system becomes a hallucination machine. Lessons from production experience.","2026-06-01",[21,22,23,24,25],"rag","embedding","retrieval","llm-eval","production-ai",8,"Roibase",{"type":29,"children":30,"toc":440},"root",[31,39,46,51,56,61,68,73,79,84,89,94,220,226,231,237,242,247,282,287,293,298,304,309,314,319,325,330,335,341,346,400,405,410,416,421,426,431,435],{"type":32,"tag":33,"props":34,"children":35},"element","p",{},[36],{"type":37,"value":38},"text","RAG systems face two fates after hitting production: either they shut down within three weeks due to hallucinations, or retrieval quality reaches 90+ F1 and they become business-critical pipelines. The difference lies hidden in embedding selection, chunking strategy, and eval setup. Cost optimization is secondary—if you don't solve retrieving the right document first, cheaper models just produce expensive errors.",{"type":32,"tag":40,"props":41,"children":43},"h2",{"id":42},"embedding-model-alignment-matters-more-than-dimension",[44],{"type":37,"value":45},"Embedding Model: Alignment Matters More Than Dimension",{"type":32,"tag":33,"props":47,"children":48},{},[49],{"type":37,"value":50},"The reflexive first choice in embedding selection is \"larger model always embeds better.\" text-embedding-3-large (3072 dim) isn't universally superior to text-embedding-3-small (1536 dim). MTEB benchmarks measure against general corpora—if your domain is finance, medical, or e-commerce, those scores mislead.",{"type":32,"tag":33,"props":52,"children":53},{},[54],{"type":37,"value":55},"In production, we observed: a 768-dimensional domain-specific model (sentence-transformers\u002Fall-mpnet-base-v2 fine-tuned on domain data) delivered 12% better recall@10 than a 3072-dimensional general model. The reason is straightforward: the embedding space doesn't understand domain jargon. The semantic distance between \"Conversion rate optimization\" and \"CRO\" is 0.68 in the general model, but 0.91 in the domain-tuned one.",{"type":32,"tag":33,"props":57,"children":58},{},[59],{"type":37,"value":60},"The dimension tradeoff is clear: 3072 dim indexes at 4.2GB, 768 dim at 1.1GB. Query latency is 47ms and 18ms respectively (FAISS HNSW, m=16). If retrieval recall loss is under 5%, the smaller model wins—both cost and speed. Making this decision without measurement is engineering on speculation.",{"type":32,"tag":62,"props":63,"children":65},"h3",{"id":64},"fine-tuning-decision",[66],{"type":37,"value":67},"Fine-Tuning Decision",{"type":32,"tag":33,"props":69,"children":70},{},[71],{"type":37,"value":72},"Embedding fine-tuning becomes mandatory in two cases: (1) domain vocabulary is highly specific (medical terms, crypto token names), (2) query-document pair distribution is asymmetric (questions short, documents long). OpenAI Embedding API doesn't accept fine-tuning; use sentence-transformers or Cohere embed-v3. Start with 500-1000 labeled pairs—more yields marginal gains.",{"type":32,"tag":40,"props":74,"children":76},{"id":75},"chunking-semantics-over-size",[77],{"type":37,"value":78},"Chunking: Semantics Over Size",{"type":32,"tag":33,"props":80,"children":81},{},[82],{"type":37,"value":83},"There's no rule that \"chunk size of 512 tokens is good.\" We tested three strategies: (1) fixed 512 tokens, (2) markdown header-based (cut at H2\u002FH3 boundaries), (3) semantic chunking (LLM reads paragraph context, splits at semantic transitions). Result: markdown-based chunking delivered 18% better NDCG@5 but took 2.3x longer to build indexes.",{"type":32,"tag":33,"props":85,"children":86},{},[87],{"type":37,"value":88},"Fixed chunking's problem is cutting mid-sentence. \"If you integrate server-side tracking with first-party data architecture...\" gets cut at token 510, and the next chunk starts with \"...integrate, attribution accuracy improves\"—context lost. The retriever finds this chunk for \"attribution\" queries but the LLM can't generate a response due to missing context. That's where hallucination begins.",{"type":32,"tag":33,"props":90,"children":91},{},[92],{"type":37,"value":93},"Semantic chunking (not LangChain's RecursiveCharacterTextSplitter, but asking gpt-4o-mini \"does this paragraph transition to a new idea?\") works better but costs more: chunking a 10K-page knowledge base cost $47 (0.15$\u002F1M input tokens). The tradeoff matters: index building is one-time cost, retrieval quality is continuous value. We chose semantic, but if your document set updates dynamically (weekly), you might revert to fixed chunking.",{"type":32,"tag":95,"props":96,"children":97},"table",{},[98,132],{"type":32,"tag":99,"props":100,"children":101},"thead",{},[102],{"type":32,"tag":103,"props":104,"children":105},"tr",{},[106,112,117,122,127],{"type":32,"tag":107,"props":108,"children":109},"th",{},[110],{"type":37,"value":111},"Strategy",{"type":32,"tag":107,"props":113,"children":114},{},[115],{"type":37,"value":116},"Avg Chunk Size",{"type":32,"tag":107,"props":118,"children":119},{},[120],{"type":37,"value":121},"NDCG@5",{"type":32,"tag":107,"props":123,"children":124},{},[125],{"type":37,"value":126},"Build Time (10K docs)",{"type":32,"tag":107,"props":128,"children":129},{},[130],{"type":37,"value":131},"Cost",{"type":32,"tag":133,"props":134,"children":135},"tbody",{},[136,165,192],{"type":32,"tag":103,"props":137,"children":138},{},[139,145,150,155,160],{"type":32,"tag":140,"props":141,"children":142},"td",{},[143],{"type":37,"value":144},"Fixed 512",{"type":32,"tag":140,"props":146,"children":147},{},[148],{"type":37,"value":149},"489 tokens",{"type":32,"tag":140,"props":151,"children":152},{},[153],{"type":37,"value":154},"0.71",{"type":32,"tag":140,"props":156,"children":157},{},[158],{"type":37,"value":159},"4 min",{"type":32,"tag":140,"props":161,"children":162},{},[163],{"type":37,"value":164},"$0",{"type":32,"tag":103,"props":166,"children":167},{},[168,173,178,183,188],{"type":32,"tag":140,"props":169,"children":170},{},[171],{"type":37,"value":172},"Markdown-based",{"type":32,"tag":140,"props":174,"children":175},{},[176],{"type":37,"value":177},"680 tokens",{"type":32,"tag":140,"props":179,"children":180},{},[181],{"type":37,"value":182},"0.84",{"type":32,"tag":140,"props":184,"children":185},{},[186],{"type":37,"value":187},"9 min",{"type":32,"tag":140,"props":189,"children":190},{},[191],{"type":37,"value":164},{"type":32,"tag":103,"props":193,"children":194},{},[195,200,205,210,215],{"type":32,"tag":140,"props":196,"children":197},{},[198],{"type":37,"value":199},"Semantic (LLM)",{"type":32,"tag":140,"props":201,"children":202},{},[203],{"type":37,"value":204},"520 tokens",{"type":32,"tag":140,"props":206,"children":207},{},[208],{"type":37,"value":209},"0.81",{"type":32,"tag":140,"props":211,"children":212},{},[213],{"type":37,"value":214},"22 min",{"type":32,"tag":140,"props":216,"children":217},{},[218],{"type":37,"value":219},"$47",{"type":32,"tag":40,"props":221,"children":223},{"id":222},"overlap-strategy",[224],{"type":37,"value":225},"Overlap Strategy",{"type":32,"tag":33,"props":227,"children":228},{},[229],{"type":37,"value":230},"Adding overlap between chunks improves retrieval recall—but inflates index size 1.4-1.8x. With 50-token overlap, we saw 6% recall gain (recall@10: 0.78 → 0.83). You can activate overlap selectively for long documents (>2000 tokens) and disable for short content—conditional overlap logic.",{"type":32,"tag":40,"props":232,"children":234},{"id":233},"eval-setup-offline-metric-online-ab",[235],{"type":37,"value":236},"Eval Setup: Offline Metric → Online A\u002FB",{"type":32,"tag":33,"props":238,"children":239},{},[240],{"type":37,"value":241},"Building an eval pipeline before going to production is mandatory. \"The LLM output looks good\" isn't enough—retrieval precision\u002Frecall and LLM factuality must be measured separately.",{"type":32,"tag":33,"props":243,"children":244},{},[245],{"type":37,"value":246},"We measure two layers:",{"type":32,"tag":248,"props":249,"children":250},"ol",{},[251,272],{"type":32,"tag":252,"props":253,"children":254},"li",{},[255,261,263,270],{"type":32,"tag":256,"props":257,"children":258},"strong",{},[259],{"type":37,"value":260},"Retrieval layer:",{"type":37,"value":262}," Precision@k, Recall@k, NDCG@k, MRR. Ground truth: manually labeled query-document pairs (320 in our case). Ragas library's ",{"type":32,"tag":264,"props":265,"children":267},"code",{"className":266},[],[268],{"type":37,"value":269},"context_precision",{"type":37,"value":271}," metric works without an LLM, suits fast iteration.",{"type":32,"tag":252,"props":273,"children":274},{},[275,280],{"type":32,"tag":256,"props":276,"children":277},{},[278],{"type":37,"value":279},"Generation layer:",{"type":37,"value":281}," Factual consistency (entailment between document and output), hallucination rate (how often LLM goes beyond the document), citation accuracy (LLM's correctness in referencing sources). We use the LLM-as-judge pattern—asking gpt-4o \"does this answer ground in the document?\"—with 0.89 agreement rate (vs. human eval).",{"type":32,"tag":33,"props":283,"children":284},{},[285],{"type":37,"value":286},"Offline eval runs daily in CI\u002FCD. Testing new chunking, new embedding, new reranker? These metrics must be green before commit. Online A\u002FB test is separate: we route 10% traffic to the new RAG version and monitor user feedback + session metrics (task completion, query reformulation rate). Even if offline NDCG improves by 0.02, online task completion might not change—in that case, we skip deployment.",{"type":32,"tag":62,"props":288,"children":290},{"id":289},"llm-as-judge-reliability",[291],{"type":37,"value":292},"LLM-as-Judge Reliability",{"type":32,"tag":33,"props":294,"children":295},{},[296],{"type":37,"value":297},"Don't blindly trust LLM-as-judge. GPT-4o marked itself hallucinating 6% of the time (false positive), missed real hallucinations 4% of the time (false negative). For critical use cases, human-in-the-loop eval is essential: randomly sampling 5% and having humans verify it. The calibration score is computed against this subset. If calibration drops below 0.85, we revise the judge prompt.",{"type":32,"tag":40,"props":299,"children":301},{"id":300},"reranker-the-power-of-a-second-pass",[302],{"type":37,"value":303},"Reranker: The Power of a Second Pass",{"type":32,"tag":33,"props":305,"children":306},{},[307],{"type":37,"value":308},"Initial retrieval fetches 20-50 chunks (recall-focused); reranker narrows to 3-5 (precision-focused). Cohere rerank-v3 delivered 14% precision gain (P@5: 0.68 → 0.78). Cost: $2 per 1M reranked tokens (10x more than embedding), but feeding the LLM 5 chunks instead of 50 reduces both token use and hallucination risk.",{"type":32,"tag":33,"props":310,"children":311},{},[312],{"type":37,"value":313},"Reranker's tradeoff is latency: embedding search takes 18ms, adding rerank brings it to 95ms. An async pipeline tolerates this—while the user sends a query, retrieval + rerank run in the background; when the LLM starts streaming, total time finishes in 400-500ms. Running synchronously degrades user experience.",{"type":32,"tag":33,"props":315,"children":316},{},[317],{"type":37,"value":318},"RAG without reranking assumes \"top-k embedding results are correct.\" This holds only if query-chunk overlap is high lexically. On semantic queries (e.g., \"How do I link first-party data architecture with server-side measurement?\"), embedding retrieves 4 irrelevant chunks in the top 10. The reranker's cross-attention cleans this noise. Production RAG without reranking is risky—citation accuracy drops 18%.",{"type":32,"tag":40,"props":320,"children":322},{"id":321},"hybrid-search-bm25-embedding",[323],{"type":37,"value":324},"Hybrid Search: BM25 + Embedding",{"type":32,"tag":33,"props":326,"children":327},{},[328],{"type":37,"value":329},"Embedding-only retrieval weakens in two scenarios: (1) exact-match searches (brand names, product codes), (2) rare terms (underrepresented in embedding space). BM25 (keyword-based) fills this gap. In Weaviate or Qdrant, hybrid search: 0.7 embedding weight + 0.3 BM25 weight. Recall@10: embedding-only 0.76, hybrid 0.83.",{"type":32,"tag":33,"props":331,"children":332},{},[333],{"type":37,"value":334},"BM25 indexes are 5-8x smaller than embedding indexes (inverted index structure). No latency penalty (runs in parallel). The only cost in hybrid setup is query planning—finding which weight ratio suits which query type, tested via A\u002FB. In our case, general queries use 0.8 embedding weight, those mentioning brands\u002Fproducts use 0.5.",{"type":32,"tag":40,"props":336,"children":338},{"id":337},"monitoring-in-production",[339],{"type":37,"value":340},"Monitoring in Production",{"type":32,"tag":33,"props":342,"children":343},{},[344],{"type":37,"value":345},"60% of RAG deployment is monitoring—preventing silent system degradation. Metrics we track:",{"type":32,"tag":347,"props":348,"children":349},"ul",{},[350,360,370,380,390],{"type":32,"tag":252,"props":351,"children":352},{},[353,358],{"type":32,"tag":256,"props":354,"children":355},{},[356],{"type":37,"value":357},"Retrieval coverage:",{"type":37,"value":359}," Query-to-document match rate (target >95%)",{"type":32,"tag":252,"props":361,"children":362},{},[363,368],{"type":32,"tag":256,"props":364,"children":365},{},[366],{"type":37,"value":367},"Avg context relevance:",{"type":37,"value":369}," What percentage of chunks fed to the LLM are truly relevant (target >0.8)",{"type":32,"tag":252,"props":371,"children":372},{},[373,378],{"type":32,"tag":256,"props":374,"children":375},{},[376],{"type":37,"value":377},"Hallucination rate:",{"type":37,"value":379}," How often LLM output ventures beyond documents (target \u003C5%)",{"type":32,"tag":252,"props":381,"children":382},{},[383,388],{"type":32,"tag":256,"props":384,"children":385},{},[386],{"type":37,"value":387},"Latency p95:",{"type":37,"value":389}," 95th percentile query completion time (target \u003C800ms)",{"type":32,"tag":252,"props":391,"children":392},{},[393,398],{"type":32,"tag":256,"props":394,"children":395},{},[396],{"type":37,"value":397},"Cost per query:",{"type":37,"value":399}," Embedding + rerank + LLM (target \u003C$0.02)",{"type":32,"tag":33,"props":401,"children":402},{},[403],{"type":37,"value":404},"These metrics push to Datadog; threshold breaches trigger Slack alerts. If retrieval coverage drops below 92% for two days, there's a gap in the knowledge base—content team gets notified. Rising hallucination rate means LLM prompt or chunk size needs revision. Latency spikes warrant vector database sharding review.",{"type":32,"tag":33,"props":406,"children":407},{},[408],{"type":37,"value":409},"Connecting RAG metrics to business outcomes is critical—does better retrieval quality also lift user satisfaction survey scores, or just inflate technical metrics? Correlation analysis shows the link.",{"type":32,"tag":40,"props":411,"children":413},{"id":412},"cost-vs-quality-balance",[414],{"type":37,"value":415},"Cost vs. Quality Balance",{"type":32,"tag":33,"props":417,"children":418},{},[419],{"type":37,"value":420},"Monthly cost for production RAG: 1M queries, avg 3 chunks per retrieval, gpt-4o-mini generation ≈ $420 (embedding $80, rerank $40, LLM $300). Dropping the reranker brings this to $380 but hallucination rate jumps from 5% to 11%—resulting in more support tickets, indirect cost $600+.",{"type":32,"tag":33,"props":422,"children":423},{},[424],{"type":37,"value":425},"The right way to cut cost: (1) caching layer (same query within 24 hours comes from cache, 23% of queries repeat), (2) smaller embedding model (domain-tuned 768 dim), (3) async rerank (skip reranking for non-critical queries). These drop it to $280 with \u003C2% quality loss.",{"type":32,"tag":33,"props":427,"children":428},{},[429],{"type":37,"value":430},"The wrong approach: replacing embedding with keyword search, LLM with rule-based templates. This produces a system you can't call \"AI\"—retrieval precision drops to 40%. Cost optimization must not sabotage retrieval quality.",{"type":32,"tag":432,"props":433,"children":434},"hr",{},[],{"type":32,"tag":33,"props":436,"children":437},{},[438],{"type":37,"value":439},"Shipping RAG to production is more than model selection—it requires eval discipline, monitoring rigor, and iterative refinement. You can trim embedding dimensions and gain latency but if recall suffers, the LLM halluccinates and users lose trust. First, push retrieval quality to 0.85+ F1, then optimize cost. Otherwise, you've built a cheap hallucination machine.",{"title":16,"searchDepth":441,"depth":441,"links":442},3,[443,447,448,449,452,453,454,455],{"id":42,"depth":444,"text":45,"children":445},2,[446],{"id":64,"depth":441,"text":67},{"id":75,"depth":444,"text":78},{"id":222,"depth":444,"text":225},{"id":233,"depth":444,"text":236,"children":450},[451],{"id":289,"depth":441,"text":292},{"id":300,"depth":444,"text":303},{"id":321,"depth":444,"text":324},{"id":337,"depth":444,"text":340},{"id":412,"depth":444,"text":415},"markdown","content:en:ai:rag-production-retrieval-quality-over-cost.md","content","en\u002Fai\u002Frag-production-retrieval-quality-over-cost.md","en\u002Fai\u002Frag-production-retrieval-quality-over-cost","md",1782079495820]