[{"data":1,"prerenderedAt":1129},["ShallowReactive",2],{"article-alternates":3,"article-\u002Fru\u002Fai\u002Fproduction-rag-retrieval-quality-priority":13},{"i18nKey":4,"paths":5},"ai-003-2026-06",{"de":6,"en":7,"es":8,"fr":9,"it":10,"ru":11,"tr":12},"\u002Fde\u002Fai\u002Frag-production-retrieval-qualitaet-vor-kosten","\u002Fen\u002Fai\u002Frag-production-retrieval-quality-over-cost","\u002Fes\u002Fai\u002Frag-production-calidad-recuperacion-primero","\u002Ffr\u002Fai\u002Fproduction-rag-retrieval-quality-before-cost","\u002Fit\u002Fai\u002Frag-production-retrieval-quality-first","\u002Fru\u002Fai\u002Frag-production-retrieval-quality-first","\u002Ftr\u002Fai\u002Fproductionda-rag-retrieval-kalitesi-costtan-once-gelir",{"_path":14,"_dir":15,"_draft":16,"_partial":16,"_locale":17,"title":18,"description":19,"publishedAt":20,"modifiedAt":20,"category":15,"i18nKey":4,"tags":21,"readingTime":27,"author":28,"body":29,"_type":1123,"_id":1124,"_source":1125,"_file":1126,"_stem":1127,"_extension":1128},"\u002Fru\u002Fai\u002Fproduction-rag-retrieval-quality-priority","ai",false,"","Production RAG: Retrieval Quality Comes Before Cost","How embedding models, chunking strategies, and evaluation setup determine retrieval quality in production RAG systems. Quality first, cost optimization second.","2026-06-20",[22,23,24,25,26],"rag","retrieval","embedding-models","chunking-strategy","llm-eval",8,"Roibase",{"type":30,"children":31,"toc":1111},"root",[32,40,47,52,57,62,69,156,162,167,172,208,213,218,694,699,705,710,718,737,745,763,768,773,781,786,792,797,809,814,832,837,843,859,983,988,994,999,1032,1037,1042,1048,1053,1086,1091,1096,1100,1105],{"type":33,"tag":34,"props":35,"children":36},"element","p",{},[37],{"type":38,"value":39},"text","When setting up RAG (Retrieval-Augmented Generation) in production, most teams start with cost optimization. A cheap embedding model is selected first, chunk size is locked at 512 tokens, and then comes the question: \"Why is it hallucinating?\" You need to reverse this approach: retrieval quality is the system's backbone, cost is a variable to optimize in the second iteration. In 2026, RAG is no longer a proof-of-concept — production systems process millions of queries daily and users say \"show me the source.\" Bad retrieval fails before the LLM prompt even matters.",{"type":33,"tag":41,"props":42,"children":44},"h2",{"id":43},"embedding-model-the-dimensionality-quality-tradeoff-isnt-parametric",[45],{"type":38,"value":46},"Embedding Model: The Dimensionality-Quality Tradeoff Isn't Parametric",{"type":33,"tag":34,"props":48,"children":49},{},[50],{"type":38,"value":51},"Reducing embedding dimensions cuts retrieval latency but sacrifices search precision. text-embedding-ada-002 runs at 1536 dimensions, text-embedding-3-small adjusts from 512 to 1536. Pick a smaller dimension and vectors from different semantic domains overlap — \"user authentication\" and \"user onboarding\" distances compress.",{"type":33,"tag":34,"props":53,"children":54},{},[55],{"type":38,"value":56},"In production, we first built a test pipeline: 200 real user queries + ground truth document pairs. We measured each model by retrieval@5 and retrieval@10. Between ada-002 (1536 dim) and embedding-3-small (1536 dim), there was an 18% latency difference but zero quality difference. Cut embedding-3-small to 768 and latency improved 32%, but retrieval@5 dropped from 91% to 84% — a 7-point loss means 7 out of every 100 queries get wrong context in production. The cost\u002Flatency gain doesn't justify this loss.",{"type":33,"tag":34,"props":58,"children":59},{},[60],{"type":38,"value":61},"Alternative: domain-specific fine-tuning. You can fine-tune Voyage AI or Cohere embed models on your own corpus. After 50k labeled examples + 2 weeks of iteration, retrieval@10 jumped from 91% to 96%. Fine-tuning costs around $4k but query cost stays the same — as volume grows, marginal gains multiply. Instead of cost-optimizing with a generic model, gain quality with a fine-tuned one, then reduce cost with cache + batch mechanisms.",{"type":33,"tag":63,"props":64,"children":66},"h3",{"id":65},"maturity-index-where-are-you-in-embedding-selection",[67],{"type":38,"value":68},"Maturity Index: Where Are You in Embedding Selection?",{"type":33,"tag":70,"props":71,"children":72},"table",{},[73,97],{"type":33,"tag":74,"props":75,"children":76},"thead",{},[77],{"type":33,"tag":78,"props":79,"children":80},"tr",{},[81,87,92],{"type":33,"tag":82,"props":83,"children":84},"th",{},[85],{"type":38,"value":86},"Stage",{"type":33,"tag":82,"props":88,"children":89},{},[90],{"type":38,"value":91},"Model Strategy",{"type":33,"tag":82,"props":93,"children":94},{},[95],{"type":38,"value":96},"Metric Target",{"type":33,"tag":98,"props":99,"children":100},"tbody",{},[101,120,138],{"type":33,"tag":78,"props":102,"children":103},{},[104,110,115],{"type":33,"tag":105,"props":106,"children":107},"td",{},[108],{"type":38,"value":109},"MVP (0–10k queries\u002Fday)",{"type":33,"tag":105,"props":111,"children":112},{},[113],{"type":38,"value":114},"OpenAI ada-002 default",{"type":33,"tag":105,"props":116,"children":117},{},[118],{"type":38,"value":119},"Retrieval@5 > 80%",{"type":33,"tag":78,"props":121,"children":122},{},[123,128,133],{"type":33,"tag":105,"props":124,"children":125},{},[126],{"type":38,"value":127},"Scale (10k–100k\u002Fday)",{"type":33,"tag":105,"props":129,"children":130},{},[131],{"type":38,"value":132},"embedding-3-small 1536 dim",{"type":33,"tag":105,"props":134,"children":135},{},[136],{"type":38,"value":137},"Retrieval@5 > 85%, p95 latency \u003C 200ms",{"type":33,"tag":78,"props":139,"children":140},{},[141,146,151],{"type":33,"tag":105,"props":142,"children":143},{},[144],{"type":38,"value":145},"Optimized (100k+\u002Fday)",{"type":33,"tag":105,"props":147,"children":148},{},[149],{"type":38,"value":150},"Fine-tuned Voyage\u002FCohere",{"type":33,"tag":105,"props":152,"children":153},{},[154],{"type":38,"value":155},"Retrieval@10 > 93%, batch processing",{"type":33,"tag":41,"props":157,"children":159},{"id":158},"chunking-strategy-semantic-boundaries-not-fixed-tokens",[160],{"type":38,"value":161},"Chunking Strategy: Semantic Boundaries, Not Fixed Tokens",{"type":33,"tag":34,"props":163,"children":164},{},[165],{"type":38,"value":166},"Everyone treats 512-token chunks as standard, but that's the historical LLM context window limit, not the optimal point for retrieval quality. Chunks too small lose context; too large introduce noise into embeddings. Most teams chunk by markdown headers or paragraphs, but the real question: does your chunking preserve the document's semantic structure?",{"type":33,"tag":34,"props":168,"children":169},{},[170],{"type":38,"value":171},"We tested this strategy in our system:",{"type":33,"tag":173,"props":174,"children":175},"ol",{},[176,188,198],{"type":33,"tag":177,"props":178,"children":179},"li",{},[180,186],{"type":33,"tag":181,"props":182,"children":183},"strong",{},[184],{"type":38,"value":185},"Fixed 512 tokens",{"type":38,"value":187}," — baseline. Retrieval@5: 82%.",{"type":33,"tag":177,"props":189,"children":190},{},[191,196],{"type":33,"tag":181,"props":192,"children":193},{},[194],{"type":38,"value":195},"Markdown heading split",{"type":38,"value":197}," — chunk at H2\u002FH3 boundaries. Retrieval@5: 87% (+5 points). No latency change.",{"type":33,"tag":177,"props":199,"children":200},{},[201,206],{"type":33,"tag":181,"props":202,"children":203},{},[204],{"type":38,"value":205},"Semantic chunking",{"type":38,"value":207}," (sentence-transformers similarity instead of LangChain's RecursiveCharacterTextSplitter) — new chunk when similarity drops. Retrieval@5: 91% (+9 points). Latency +15% but \"relevant info not found\" errors fell 22%.",{"type":33,"tag":34,"props":209,"children":210},{},[211],{"type":38,"value":212},"With semantic chunking, we learned: overlap ratio is critical. 10% overlap (last 50 tokens repeat in the next chunk) lifted retrieval@10 from 91% to 94%. Because information cut at boundaries (e.g., \"this metric grew 18% in Q4\" split across chunks) stays intact in at least one chunk thanks to overlap.",{"type":33,"tag":34,"props":214,"children":215},{},[216],{"type":38,"value":217},"Code example (Python):",{"type":33,"tag":219,"props":220,"children":224},"pre",{"code":221,"language":222,"meta":17,"className":223,"style":17},"from langchain.text_splitter import RecursiveCharacterTextSplitter\nfrom sentence_transformers import SentenceTransformer\n\nmodel = SentenceTransformer('all-MiniLM-L6-v2')\n\ndef semantic_chunk(text, max_chunk_size=600, overlap=0.1):\n    sentences = text.split('. ')\n    chunks, current = [], []\n    \n    for sent in sentences:\n        current.append(sent)\n        chunk_text = '. '.join(current)\n        \n        if len(chunk_text.split()) > max_chunk_size:\n            chunks.append(chunk_text)\n            overlap_size = int(len(current) * overlap)\n            current = current[-overlap_size:] if overlap_size > 0 else []\n    \n    if current:\n        chunks.append('. '.join(current))\n    \n    return chunks\n","python","language-python shiki shiki-themes github-dark",[225],{"type":33,"tag":226,"props":227,"children":228},"code",{"__ignoreMap":17},[229,257,279,289,319,327,376,403,420,429,453,462,485,494,523,532,575,632,640,654,672,680],{"type":33,"tag":230,"props":231,"children":234},"span",{"class":232,"line":233},"line",1,[235,241,247,252],{"type":33,"tag":230,"props":236,"children":238},{"style":237},"--shiki-default:#F97583",[239],{"type":38,"value":240},"from",{"type":33,"tag":230,"props":242,"children":244},{"style":243},"--shiki-default:#E1E4E8",[245],{"type":38,"value":246}," langchain.text_splitter ",{"type":33,"tag":230,"props":248,"children":249},{"style":237},[250],{"type":38,"value":251},"import",{"type":33,"tag":230,"props":253,"children":254},{"style":243},[255],{"type":38,"value":256}," RecursiveCharacterTextSplitter\n",{"type":33,"tag":230,"props":258,"children":260},{"class":232,"line":259},2,[261,265,270,274],{"type":33,"tag":230,"props":262,"children":263},{"style":237},[264],{"type":38,"value":240},{"type":33,"tag":230,"props":266,"children":267},{"style":243},[268],{"type":38,"value":269}," sentence_transformers ",{"type":33,"tag":230,"props":271,"children":272},{"style":237},[273],{"type":38,"value":251},{"type":33,"tag":230,"props":275,"children":276},{"style":243},[277],{"type":38,"value":278}," SentenceTransformer\n",{"type":33,"tag":230,"props":280,"children":282},{"class":232,"line":281},3,[283],{"type":33,"tag":230,"props":284,"children":286},{"emptyLinePlaceholder":285},true,[287],{"type":38,"value":288},"\n",{"type":33,"tag":230,"props":290,"children":292},{"class":232,"line":291},4,[293,298,303,308,314],{"type":33,"tag":230,"props":294,"children":295},{"style":243},[296],{"type":38,"value":297},"model ",{"type":33,"tag":230,"props":299,"children":300},{"style":237},[301],{"type":38,"value":302},"=",{"type":33,"tag":230,"props":304,"children":305},{"style":243},[306],{"type":38,"value":307}," SentenceTransformer(",{"type":33,"tag":230,"props":309,"children":311},{"style":310},"--shiki-default:#9ECBFF",[312],{"type":38,"value":313},"'all-MiniLM-L6-v2'",{"type":33,"tag":230,"props":315,"children":316},{"style":243},[317],{"type":38,"value":318},")\n",{"type":33,"tag":230,"props":320,"children":322},{"class":232,"line":321},5,[323],{"type":33,"tag":230,"props":324,"children":325},{"emptyLinePlaceholder":285},[326],{"type":38,"value":288},{"type":33,"tag":230,"props":328,"children":330},{"class":232,"line":329},6,[331,336,342,347,351,357,362,366,371],{"type":33,"tag":230,"props":332,"children":333},{"style":237},[334],{"type":38,"value":335},"def",{"type":33,"tag":230,"props":337,"children":339},{"style":338},"--shiki-default:#B392F0",[340],{"type":38,"value":341}," semantic_chunk",{"type":33,"tag":230,"props":343,"children":344},{"style":243},[345],{"type":38,"value":346},"(text, max_chunk_size",{"type":33,"tag":230,"props":348,"children":349},{"style":237},[350],{"type":38,"value":302},{"type":33,"tag":230,"props":352,"children":354},{"style":353},"--shiki-default:#79B8FF",[355],{"type":38,"value":356},"600",{"type":33,"tag":230,"props":358,"children":359},{"style":243},[360],{"type":38,"value":361},", overlap",{"type":33,"tag":230,"props":363,"children":364},{"style":237},[365],{"type":38,"value":302},{"type":33,"tag":230,"props":367,"children":368},{"style":353},[369],{"type":38,"value":370},"0.1",{"type":33,"tag":230,"props":372,"children":373},{"style":243},[374],{"type":38,"value":375},"):\n",{"type":33,"tag":230,"props":377,"children":379},{"class":232,"line":378},7,[380,385,389,394,399],{"type":33,"tag":230,"props":381,"children":382},{"style":243},[383],{"type":38,"value":384},"    sentences ",{"type":33,"tag":230,"props":386,"children":387},{"style":237},[388],{"type":38,"value":302},{"type":33,"tag":230,"props":390,"children":391},{"style":243},[392],{"type":38,"value":393}," text.split(",{"type":33,"tag":230,"props":395,"children":396},{"style":310},[397],{"type":38,"value":398},"'. '",{"type":33,"tag":230,"props":400,"children":401},{"style":243},[402],{"type":38,"value":318},{"type":33,"tag":230,"props":404,"children":405},{"class":232,"line":27},[406,411,415],{"type":33,"tag":230,"props":407,"children":408},{"style":243},[409],{"type":38,"value":410},"    chunks, current ",{"type":33,"tag":230,"props":412,"children":413},{"style":237},[414],{"type":38,"value":302},{"type":33,"tag":230,"props":416,"children":417},{"style":243},[418],{"type":38,"value":419}," [], []\n",{"type":33,"tag":230,"props":421,"children":423},{"class":232,"line":422},9,[424],{"type":33,"tag":230,"props":425,"children":426},{"style":243},[427],{"type":38,"value":428},"    \n",{"type":33,"tag":230,"props":430,"children":432},{"class":232,"line":431},10,[433,438,443,448],{"type":33,"tag":230,"props":434,"children":435},{"style":237},[436],{"type":38,"value":437},"    for",{"type":33,"tag":230,"props":439,"children":440},{"style":243},[441],{"type":38,"value":442}," sent ",{"type":33,"tag":230,"props":444,"children":445},{"style":237},[446],{"type":38,"value":447},"in",{"type":33,"tag":230,"props":449,"children":450},{"style":243},[451],{"type":38,"value":452}," sentences:\n",{"type":33,"tag":230,"props":454,"children":456},{"class":232,"line":455},11,[457],{"type":33,"tag":230,"props":458,"children":459},{"style":243},[460],{"type":38,"value":461},"        current.append(sent)\n",{"type":33,"tag":230,"props":463,"children":465},{"class":232,"line":464},12,[466,471,475,480],{"type":33,"tag":230,"props":467,"children":468},{"style":243},[469],{"type":38,"value":470},"        chunk_text ",{"type":33,"tag":230,"props":472,"children":473},{"style":237},[474],{"type":38,"value":302},{"type":33,"tag":230,"props":476,"children":477},{"style":310},[478],{"type":38,"value":479}," '. '",{"type":33,"tag":230,"props":481,"children":482},{"style":243},[483],{"type":38,"value":484},".join(current)\n",{"type":33,"tag":230,"props":486,"children":488},{"class":232,"line":487},13,[489],{"type":33,"tag":230,"props":490,"children":491},{"style":243},[492],{"type":38,"value":493},"        \n",{"type":33,"tag":230,"props":495,"children":497},{"class":232,"line":496},14,[498,503,508,513,518],{"type":33,"tag":230,"props":499,"children":500},{"style":237},[501],{"type":38,"value":502},"        if",{"type":33,"tag":230,"props":504,"children":505},{"style":353},[506],{"type":38,"value":507}," len",{"type":33,"tag":230,"props":509,"children":510},{"style":243},[511],{"type":38,"value":512},"(chunk_text.split()) ",{"type":33,"tag":230,"props":514,"children":515},{"style":237},[516],{"type":38,"value":517},">",{"type":33,"tag":230,"props":519,"children":520},{"style":243},[521],{"type":38,"value":522}," max_chunk_size:\n",{"type":33,"tag":230,"props":524,"children":526},{"class":232,"line":525},15,[527],{"type":33,"tag":230,"props":528,"children":529},{"style":243},[530],{"type":38,"value":531},"            chunks.append(chunk_text)\n",{"type":33,"tag":230,"props":533,"children":535},{"class":232,"line":534},16,[536,541,545,550,555,560,565,570],{"type":33,"tag":230,"props":537,"children":538},{"style":243},[539],{"type":38,"value":540},"            overlap_size ",{"type":33,"tag":230,"props":542,"children":543},{"style":237},[544],{"type":38,"value":302},{"type":33,"tag":230,"props":546,"children":547},{"style":353},[548],{"type":38,"value":549}," int",{"type":33,"tag":230,"props":551,"children":552},{"style":243},[553],{"type":38,"value":554},"(",{"type":33,"tag":230,"props":556,"children":557},{"style":353},[558],{"type":38,"value":559},"len",{"type":33,"tag":230,"props":561,"children":562},{"style":243},[563],{"type":38,"value":564},"(current) ",{"type":33,"tag":230,"props":566,"children":567},{"style":237},[568],{"type":38,"value":569},"*",{"type":33,"tag":230,"props":571,"children":572},{"style":243},[573],{"type":38,"value":574}," overlap)\n",{"type":33,"tag":230,"props":576,"children":578},{"class":232,"line":577},17,[579,584,588,593,598,603,608,613,617,622,627],{"type":33,"tag":230,"props":580,"children":581},{"style":243},[582],{"type":38,"value":583},"            current ",{"type":33,"tag":230,"props":585,"children":586},{"style":237},[587],{"type":38,"value":302},{"type":33,"tag":230,"props":589,"children":590},{"style":243},[591],{"type":38,"value":592}," current[",{"type":33,"tag":230,"props":594,"children":595},{"style":237},[596],{"type":38,"value":597},"-",{"type":33,"tag":230,"props":599,"children":600},{"style":243},[601],{"type":38,"value":602},"overlap_size:] ",{"type":33,"tag":230,"props":604,"children":605},{"style":237},[606],{"type":38,"value":607},"if",{"type":33,"tag":230,"props":609,"children":610},{"style":243},[611],{"type":38,"value":612}," overlap_size ",{"type":33,"tag":230,"props":614,"children":615},{"style":237},[616],{"type":38,"value":517},{"type":33,"tag":230,"props":618,"children":619},{"style":353},[620],{"type":38,"value":621}," 0",{"type":33,"tag":230,"props":623,"children":624},{"style":237},[625],{"type":38,"value":626}," else",{"type":33,"tag":230,"props":628,"children":629},{"style":243},[630],{"type":38,"value":631}," []\n",{"type":33,"tag":230,"props":633,"children":635},{"class":232,"line":634},18,[636],{"type":33,"tag":230,"props":637,"children":638},{"style":243},[639],{"type":38,"value":428},{"type":33,"tag":230,"props":641,"children":643},{"class":232,"line":642},19,[644,649],{"type":33,"tag":230,"props":645,"children":646},{"style":237},[647],{"type":38,"value":648},"    if",{"type":33,"tag":230,"props":650,"children":651},{"style":243},[652],{"type":38,"value":653}," current:\n",{"type":33,"tag":230,"props":655,"children":657},{"class":232,"line":656},20,[658,663,667],{"type":33,"tag":230,"props":659,"children":660},{"style":243},[661],{"type":38,"value":662},"        chunks.append(",{"type":33,"tag":230,"props":664,"children":665},{"style":310},[666],{"type":38,"value":398},{"type":33,"tag":230,"props":668,"children":669},{"style":243},[670],{"type":38,"value":671},".join(current))\n",{"type":33,"tag":230,"props":673,"children":675},{"class":232,"line":674},21,[676],{"type":33,"tag":230,"props":677,"children":678},{"style":243},[679],{"type":38,"value":428},{"type":33,"tag":230,"props":681,"children":683},{"class":232,"line":682},22,[684,689],{"type":33,"tag":230,"props":685,"children":686},{"style":237},[687],{"type":38,"value":688},"    return",{"type":33,"tag":230,"props":690,"children":691},{"style":243},[692],{"type":38,"value":693}," chunks\n",{"type":33,"tag":34,"props":695,"children":696},{},[697],{"type":38,"value":698},"Pushing overlap from 10% to 20% stopped retrieval gains but increased storage cost 18%. In production, 10% was our sweet spot.",{"type":33,"tag":41,"props":700,"children":702},{"id":701},"evaluation-setup-no-blind-spots-in-production",[703],{"type":38,"value":704},"Evaluation Setup: No Blind Spots in Production",{"type":33,"tag":34,"props":706,"children":707},{},[708],{"type":38,"value":709},"After deploying a RAG system, \"we'll check if users complain\" doesn't work in production. The eval pipeline must run continuously: on new documents, model changes, chunking updates — automated regression tests. We run this metric set on every commit inside CI\u002FCD:",{"type":33,"tag":34,"props":711,"children":712},{},[713],{"type":33,"tag":181,"props":714,"children":715},{},[716],{"type":38,"value":717},"Retrieval metrics:",{"type":33,"tag":719,"props":720,"children":721},"ul",{},[722,727,732],{"type":33,"tag":177,"props":723,"children":724},{},[725],{"type":38,"value":726},"Retrieval@5, @10 (on ground truth pairs)",{"type":33,"tag":177,"props":728,"children":729},{},[730],{"type":38,"value":731},"Mean Reciprocal Rank (MRR) — rank of correct document",{"type":33,"tag":177,"props":733,"children":734},{},[735],{"type":38,"value":736},"NDCG@10 (ranking quality)",{"type":33,"tag":34,"props":738,"children":739},{},[740],{"type":33,"tag":181,"props":741,"children":742},{},[743],{"type":38,"value":744},"End-to-end metrics:",{"type":33,"tag":719,"props":746,"children":747},{},[748,753,758],{"type":33,"tag":177,"props":749,"children":750},{},[751],{"type":38,"value":752},"Answer correctness (LLM-as-judge: GPT-4 evaluates the answer)",{"type":33,"tag":177,"props":754,"children":755},{},[756],{"type":38,"value":757},"Citation accuracy (penalize info not in sources)",{"type":33,"tag":177,"props":759,"children":760},{},[761],{"type":38,"value":762},"Latency p50\u002Fp95\u002Fp99",{"type":33,"tag":34,"props":764,"children":765},{},[766],{"type":38,"value":767},"How we build eval datasets: sample 500 queries from production, manually tag ground truth documents, then measure every change against this set. The dataset updates monthly because user query distribution shifts — eval scores from 3 months ago don't reflect today's production performance.",{"type":33,"tag":34,"props":769,"children":770},{},[771],{"type":38,"value":772},"For LLM-as-judge, an example prompt:",{"type":33,"tag":219,"props":774,"children":776},{"code":775},"You are a RAG system evaluation model.\nAnalyze this triplet:\n\nUSER_QUERY: \"{query}\"\nRETRIEVED_CONTEXT: \"{context}\"\nGENERATED_ANSWER: \"{answer}\"\n\nRate:\n1. Does the answer correctly address the query? (0–10)\n2. Is every fact in the answer sourced in context? (0–10, give 0 if out-of-source info exists)\n3. Does the answer include unnecessary details? (0–10, 10=concise)\n\nJSON output: {{\"correctness\": X, \"grounding\": Y, \"conciseness\": Z}}\n",[777],{"type":33,"tag":226,"props":778,"children":779},{"__ignoreMap":17},[780],{"type":38,"value":775},{"type":33,"tag":34,"props":782,"children":783},{},[784],{"type":38,"value":785},"We run this eval on every pull request — if retrieval@5 drops more than 2%, the merge is blocked.",{"type":33,"tag":41,"props":787,"children":789},{"id":788},"hyperparameter-tuning-top-k-and-reranking",[790],{"type":38,"value":791},"Hyperparameter Tuning: Top-K and Reranking",{"type":33,"tag":34,"props":793,"children":794},{},[795],{"type":38,"value":796},"After embedding search, you retrieve top-K documents. K=5? 10? 20? Larger K means more context but more tokens sent to the LLM — cost and latency rise, and noise multiplies (LLM hits \"lost in the middle\" where middle-context facts get lost).",{"type":33,"tag":34,"props":798,"children":799},{},[800,802,807],{"type":38,"value":801},"Our sweet spot: ",{"type":33,"tag":181,"props":803,"children":804},{},[805],{"type":38,"value":806},"K=10 embedding retrieval + reranker to select top-3",{"type":38,"value":808},". A reranker (Cohere rerank-english-v2.0 or cross-encoder\u002Fms-marco-MiniLM) does deeper semantic matching between query and document. It gives 7–12% better ranking than embedding cosine similarity alone but adds latency per document (forward pass per doc).",{"type":33,"tag":34,"props":810,"children":811},{},[812],{"type":38,"value":813},"Pipeline:",{"type":33,"tag":173,"props":815,"children":816},{},[817,822,827],{"type":33,"tag":177,"props":818,"children":819},{},[820],{"type":38,"value":821},"Embedding retrieves top-10 (~80ms)",{"type":33,"tag":177,"props":823,"children":824},{},[825],{"type":38,"value":826},"Reranker re-ranks 10 docs, picks top-3 (~120ms)",{"type":33,"tag":177,"props":828,"children":829},{},[830],{"type":38,"value":831},"Send top-3 as LLM prompt context",{"type":33,"tag":34,"props":833,"children":834},{},[835],{"type":38,"value":836},"Total latency rose 40% vs. embedding-only (80ms → 200ms) but answer correctness jumped from 87% to 94%. Our user-facing latency SLA is 500ms, so this tradeoff is acceptable. If SLA were tighter, we could put reranker in an async queue, serve embedding top-3 first, and write reranked results to cache in background.",{"type":33,"tag":63,"props":838,"children":840},{"id":839},"rerankings-real-contribution-ab-test-results",[841],{"type":38,"value":842},"Reranking's Real Contribution: A\u002FB Test Results",{"type":33,"tag":34,"props":844,"children":845},{},[846,848,857],{"type":38,"value":847},"For 7 days, 50% of traffic went embedding-only, 50% went embedding+rerank. Using ",{"type":33,"tag":849,"props":850,"children":854},"a",{"href":851,"rel":852},"https:\u002F\u002Fwww.roibase.com.tr\u002Fru\u002Ffirstparty",[853],"nofollow",[855],{"type":38,"value":856},"First-Party Data & Measurement Architecture",{"type":38,"value":858},", we collected metrics per segment:",{"type":33,"tag":70,"props":860,"children":861},{},[862,888],{"type":33,"tag":74,"props":863,"children":864},{},[865],{"type":33,"tag":78,"props":866,"children":867},{},[868,873,878,883],{"type":33,"tag":82,"props":869,"children":870},{},[871],{"type":38,"value":872},"Metric",{"type":33,"tag":82,"props":874,"children":875},{},[876],{"type":38,"value":877},"Embedding Only",{"type":33,"tag":82,"props":879,"children":880},{},[881],{"type":38,"value":882},"Embedding + Rerank",{"type":33,"tag":82,"props":884,"children":885},{},[886],{"type":38,"value":887},"Delta",{"type":33,"tag":98,"props":889,"children":890},{},[891,914,937,960],{"type":33,"tag":78,"props":892,"children":893},{},[894,899,904,909],{"type":33,"tag":105,"props":895,"children":896},{},[897],{"type":38,"value":898},"User \"helpful\" rating",{"type":33,"tag":105,"props":900,"children":901},{},[902],{"type":38,"value":903},"72%",{"type":33,"tag":105,"props":905,"children":906},{},[907],{"type":38,"value":908},"81%",{"type":33,"tag":105,"props":910,"children":911},{},[912],{"type":38,"value":913},"+9pp",{"type":33,"tag":78,"props":915,"children":916},{},[917,922,927,932],{"type":33,"tag":105,"props":918,"children":919},{},[920],{"type":38,"value":921},"Follow-up query rate",{"type":33,"tag":105,"props":923,"children":924},{},[925],{"type":38,"value":926},"34%",{"type":33,"tag":105,"props":928,"children":929},{},[930],{"type":38,"value":931},"28%",{"type":33,"tag":105,"props":933,"children":934},{},[935],{"type":38,"value":936},"-6pp (good — first answer was enough)",{"type":33,"tag":78,"props":938,"children":939},{},[940,945,950,955],{"type":33,"tag":105,"props":941,"children":942},{},[943],{"type":38,"value":944},"p95 latency",{"type":33,"tag":105,"props":946,"children":947},{},[948],{"type":38,"value":949},"180ms",{"type":33,"tag":105,"props":951,"children":952},{},[953],{"type":38,"value":954},"240ms",{"type":33,"tag":105,"props":956,"children":957},{},[958],{"type":38,"value":959},"+60ms",{"type":33,"tag":78,"props":961,"children":962},{},[963,968,973,978],{"type":33,"tag":105,"props":964,"children":965},{},[966],{"type":38,"value":967},"Cost\u002Fquery",{"type":33,"tag":105,"props":969,"children":970},{},[971],{"type":38,"value":972},"$0.003",{"type":33,"tag":105,"props":974,"children":975},{},[976],{"type":38,"value":977},"$0.0042",{"type":33,"tag":105,"props":979,"children":980},{},[981],{"type":38,"value":982},"+40%",{"type":33,"tag":34,"props":984,"children":985},{},[986],{"type":38,"value":987},"Reranking is essential for quality retrieval in production — we offset the cost increase with batch processing and caching as query volume grew.",{"type":33,"tag":41,"props":989,"children":991},{"id":990},"cache-and-incremental-update-real-cost-savings-live-here",[992],{"type":38,"value":993},"Cache and Incremental Update: Real Cost Savings Live Here",{"type":33,"tag":34,"props":995,"children":996},{},[997],{"type":38,"value":998},"Cost optimization doesn't live in model selection; it lives in cache strategy. No need to embed + retrieve when the same query returns. We built a tiered cache on Redis:",{"type":33,"tag":173,"props":1000,"children":1001},{},[1002,1012,1022],{"type":33,"tag":177,"props":1003,"children":1004},{},[1005,1010],{"type":33,"tag":181,"props":1006,"children":1007},{},[1008],{"type":38,"value":1009},"Query embedding cache",{"type":38,"value":1011}," — every unique query's embedding vector cached 24 hours. Hit rate: 41% (queries repeat: \"pricing,\" \"integration guide\").",{"type":33,"tag":177,"props":1013,"children":1014},{},[1015,1020],{"type":33,"tag":181,"props":1016,"children":1017},{},[1018],{"type":38,"value":1019},"Retrieval result cache",{"type":38,"value":1021}," — query + top-K document IDs cached 6 hours. Hit rate: 28%.",{"type":33,"tag":177,"props":1023,"children":1024},{},[1025,1030],{"type":33,"tag":181,"props":1026,"children":1027},{},[1028],{"type":38,"value":1029},"Generated answer cache",{"type":38,"value":1031}," — full answer cached 1 hour (invalidated after document updates). Hit rate: 19%.",{"type":33,"tag":34,"props":1033,"children":1034},{},[1035],{"type":38,"value":1036},"On cache hit, latency drops from 200ms to 15ms, cost is zero. Combined hit rate ~88% — only 12% of production traffic actually calls embedding + LLM.",{"type":33,"tag":34,"props":1038,"children":1039},{},[1040],{"type":38,"value":1041},"Incremental updates: when a new document arrives, don't re-embed the entire corpus; just the new document. Vector database (Pinecone\u002FWeaviate) insert under 50ms. For changed documents, only update that document's chunks. This way, 500 documents add daily with zero downtime.",{"type":33,"tag":41,"props":1043,"children":1045},{"id":1044},"observability-in-production-tools-for-rag-debugging",[1046],{"type":38,"value":1047},"Observability in Production: Tools for RAG Debugging",{"type":33,"tag":34,"props":1049,"children":1050},{},[1051],{"type":38,"value":1052},"When a user says \"you gave the wrong answer,\" how do you debug? Our stack:",{"type":33,"tag":719,"props":1054,"children":1055},{},[1056,1066,1076],{"type":33,"tag":177,"props":1057,"children":1058},{},[1059,1064],{"type":33,"tag":181,"props":1060,"children":1061},{},[1062],{"type":38,"value":1063},"LangSmith",{"type":38,"value":1065}," — traces every step in each RAG chain: embedding latency, retrieval result, LLM prompt\u002Fresponse, token count. Replay the full pipeline by query ID.",{"type":33,"tag":177,"props":1067,"children":1068},{},[1069,1074],{"type":33,"tag":181,"props":1070,"children":1071},{},[1072],{"type":38,"value":1073},"Custom dashboard",{"type":38,"value":1075}," (Grafana + Prometheus) — retrieval@5 score, cache hit rate, p95 latency, cost\u002Fquery in real time.",{"type":33,"tag":177,"props":1077,"children":1078},{},[1079,1084],{"type":33,"tag":181,"props":1080,"children":1081},{},[1082],{"type":38,"value":1083},"Error budget",{"type":38,"value":1085}," — tolerating 2% weekly retrieval failures (e.g., no document found). Crossing this triggers alerts.",{"type":33,"tag":34,"props":1087,"children":1088},{},[1089],{"type":38,"value":1090},"Open-source alternatives to LangSmith: Helicone, Langfuse. The point: every query's full trace must be logged in production or you can't answer \"why was the answer wrong?\"",{"type":33,"tag":34,"props":1092,"children":1093},{},[1094],{"type":38,"value":1095},"RAG complexity: a single latency spike or retrieval failure cascades. Debugging needs observability tooling as much as it needs infrastructure.",{"type":33,"tag":1097,"props":1098,"children":1099},"hr",{},[],{"type":33,"tag":34,"props":1101,"children":1102},{},[1103],{"type":38,"value":1104},"In production RAG, cost optimization is the second step. First, lift retrieval quality to 90%+: test your embedding model with eval, tune chunking to semantic boundaries, add reranking, build continuous eval pipelines. Once quality stabilizes, cut costs with cache, batch processing, and incremental updates. Do it backwards and you get a cheap but unusable system — when users see hallucinations, your cost loss is 10× the retrieval error.",{"type":33,"tag":1106,"props":1107,"children":1108},"style",{},[1109],{"type":38,"value":1110},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}",{"title":17,"searchDepth":281,"depth":281,"links":1112},[1113,1116,1117,1118,1121,1122],{"id":43,"depth":259,"text":46,"children":1114},[1115],{"id":65,"depth":281,"text":68},{"id":158,"depth":259,"text":161},{"id":701,"depth":259,"text":704},{"id":788,"depth":259,"text":791,"children":1119},[1120],{"id":839,"depth":281,"text":842},{"id":990,"depth":259,"text":993},{"id":1044,"depth":259,"text":1047},"markdown","content:ru:ai:production-rag-retrieval-quality-priority.md","content","ru\u002Fai\u002Fproduction-rag-retrieval-quality-priority.md","ru\u002Fai\u002Fproduction-rag-retrieval-quality-priority","md",1782079494657]