[{"data":1,"prerenderedAt":1119},["ShallowReactive",2],{"article-alternates":3,"article-\u002Fde\u002Fai\u002Frag-production-retrieval-qualitaet-vor-kosten":13},{"i18nKey":4,"paths":5},"ai-003-2026-06",{"de":6,"en":7,"es":8,"fr":9,"it":10,"ru":11,"tr":12},"\u002Fde\u002Fai\u002Frag-production-retrieval-qualitaet-vor-kosten","\u002Fen\u002Fai\u002Frag-production-retrieval-quality-over-cost","\u002Fes\u002Fai\u002Frag-production-calidad-recuperacion-primero","\u002Ffr\u002Fai\u002Fproduction-rag-retrieval-quality-before-cost","\u002Fit\u002Fai\u002Frag-production-retrieval-quality-first","\u002Fru\u002Fai\u002Frag-production-retrieval-quality-first","\u002Ftr\u002Fai\u002Fproductionda-rag-retrieval-kalitesi-costtan-once-gelir",{"_path":6,"_dir":14,"_draft":15,"_partial":15,"_locale":16,"title":17,"description":18,"publishedAt":19,"modifiedAt":19,"category":14,"i18nKey":4,"tags":20,"readingTime":26,"author":27,"body":28,"_type":1113,"_id":1114,"_source":1115,"_file":1116,"_stem":1117,"_extension":1118},"ai",false,"","RAG in der Produktion: Retrieval-Qualität vor Kostenoptimierung","Embedding-Modelle, Chunking-Strategien und Evaluierungs-Setup bestimmen die Retrieval-Qualität in produktiven RAG-Systemen. Qualität zuerst, dann Kosteneinsparungen.","2026-06-20",[21,22,23,24,25],"rag","retrieval","embedding-modelle","chunking-strategie","llm-eval",9,"Roibase",{"type":29,"children":30,"toc":1101},"root",[31,39,46,51,56,61,68,155,161,166,171,207,212,217,693,698,704,709,717,736,744,762,767,772,780,785,791,796,808,813,831,836,842,858,982,987,993,998,1031,1036,1041,1047,1052,1085,1090,1095],{"type":32,"tag":33,"props":34,"children":35},"element","p",{},[36],{"type":37,"value":38},"text","In der Produktion RAG (Retrieval-Augmented Generation) einzuführen bedeutet für die meisten Teams, mit Kostenoptimierung zu beginnen. Zunächst wird ein günstiges Embedding-Modell gewählt, dann wird die Chunk-Größe auf 512 Token festgelegt, am Ende kommt die Frage: „Warum halluziniert das System?\" Die Logik muss umgekehrt werden: Retrieval-Qualität ist das Rückgrat des Systems, Kosteneffizienz ist eine Variable für spätere Iterationen. 2026 ist RAG nicht mehr Proof-of-Concept — produktive Systeme verarbeiten täglich Millionen von Anfragen, und Nutzer fordern „Quellenangaben\" ein. Falsches Retrieval ist ein Problem, bevor der LLM-Prompt überhaupt formuliert wird.",{"type":32,"tag":40,"props":41,"children":43},"h2",{"id":42},"embedding-modell-größen-qualitäts-tradeoff-ist-nicht-parametrisch",[44],{"type":37,"value":45},"Embedding-Modell: Größen-Qualitäts-Tradeoff ist nicht parametrisch",{"type":32,"tag":33,"props":47,"children":48},{},[49],{"type":37,"value":50},"Die Reduzierung der Embedding-Dimension verringert die Retrieval-Latenz, opfert aber Suchgenauigkeit. text-embedding-ada-002 nutzt 1536 Dimensionen, text-embedding-3-small kann zwischen 512–1536 konfiguriert werden. Wählt man eine kleinere Dimension, überschneiden sich Vektoren aus unterschiedlichen semantischen Bereichen — der Abstand zwischen „user authentication\" und „user onboarding\" verringert sich künstlich.",{"type":32,"tag":33,"props":52,"children":53},{},[54],{"type":37,"value":55},"Wir haben in der Produktion zunächst eine Test-Pipeline aufgebaut: 200 echte Nutzer-Anfragen + Ground-Truth-Dokument-Paare. Wir evaluierten jedes Modell mit Retrieval@5 und Retrieval@10 Metriken. ada-002 (1536 Dim) und embedding-3-small (1536 Dim) zeigten keinen Qualitätsunterschied, aber 18 % Latenzunterschied. Als wir embedding-3-small auf 768 Dimensionen reduzierten, verbesserte sich die Latenz um 32 %, aber der Retrieval@5-Score fiel von 91 % auf 84 % — ein Verlust von 7 Punkten bedeutet, dass bei 100 Anfragen 7 den falschen Kontext erhalten. Der Kostenvorteil rechtfertigt diesen Qualitätsverlust nicht.",{"type":32,"tag":33,"props":57,"children":58},{},[59],{"type":37,"value":60},"Alternative: Domain-spezifisches Fine-Tuning. Voyage-AI oder Cohere-Modelle lassen sich auf dem eigenen Corpus fine-tunen. Nach 50k gelabelten Beispielen und zwei Wochen Iteration stieg der Retrieval@10-Score von 91 % auf 96 %. Das Fine-Tuning kostet etwa 4.000 EUR, aber die Kosten pro Query bleiben identisch — bei wachsendem Volume wächst der Marginalgewinn. Statt Kostenoptimierung bei generischen Modellen sollte man Qualitätsgewinn mit domänenspezifischen Modellen anstreben und dann die Kosten durch Cache und Batch-Mechanismen senken.",{"type":32,"tag":62,"props":63,"children":65},"h3",{"id":64},"reife-index-welche-phase-liegt-ihrer-embedding-strategie-zugrunde",[66],{"type":37,"value":67},"Reife-Index: Welche Phase liegt Ihrer Embedding-Strategie zugrunde?",{"type":32,"tag":69,"props":70,"children":71},"table",{},[72,96],{"type":32,"tag":73,"props":74,"children":75},"thead",{},[76],{"type":32,"tag":77,"props":78,"children":79},"tr",{},[80,86,91],{"type":32,"tag":81,"props":82,"children":83},"th",{},[84],{"type":37,"value":85},"Phase",{"type":32,"tag":81,"props":87,"children":88},{},[89],{"type":37,"value":90},"Modell-Strategie",{"type":32,"tag":81,"props":92,"children":93},{},[94],{"type":37,"value":95},"Metrik-Ziel",{"type":32,"tag":97,"props":98,"children":99},"tbody",{},[100,119,137],{"type":32,"tag":77,"props":101,"children":102},{},[103,109,114],{"type":32,"tag":104,"props":105,"children":106},"td",{},[107],{"type":37,"value":108},"MVP (0–10k Anfragen\u002FTag)",{"type":32,"tag":104,"props":110,"children":111},{},[112],{"type":37,"value":113},"OpenAI ada-002 Standard",{"type":32,"tag":104,"props":115,"children":116},{},[117],{"type":37,"value":118},"Retrieval@5 > 80 %",{"type":32,"tag":77,"props":120,"children":121},{},[122,127,132],{"type":32,"tag":104,"props":123,"children":124},{},[125],{"type":37,"value":126},"Skalierung (10k–100k\u002FTag)",{"type":32,"tag":104,"props":128,"children":129},{},[130],{"type":37,"value":131},"embedding-3-small 1536 Dim",{"type":32,"tag":104,"props":133,"children":134},{},[135],{"type":37,"value":136},"Retrieval@5 > 85 %, p95-Latenz \u003C 200ms",{"type":32,"tag":77,"props":138,"children":139},{},[140,145,150],{"type":32,"tag":104,"props":141,"children":142},{},[143],{"type":37,"value":144},"Optimiert (100k+\u002FTag)",{"type":32,"tag":104,"props":146,"children":147},{},[148],{"type":37,"value":149},"Fine-tuned Voyage\u002FCohere",{"type":32,"tag":104,"props":151,"children":152},{},[153],{"type":37,"value":154},"Retrieval@10 > 93 %, Batch-Verarbeitung",{"type":32,"tag":40,"props":156,"children":158},{"id":157},"chunking-strategie-nicht-feste-token-sondern-semantische-grenzen",[159],{"type":37,"value":160},"Chunking-Strategie: Nicht feste Token, sondern semantische Grenzen",{"type":32,"tag":33,"props":162,"children":163},{},[164],{"type":37,"value":165},"512-Token-Chunks werden wie ein Standard präsentiert, sind aber das historische Limit des LLM-Context-Fensters, nicht der optimale Punkt für Retrieval-Qualität. Sind Chunks zu klein, geht Kontext verloren; zu groß, entsteht Rauschen im Embedding. Die meisten Teams chunken nach Markdown-Überschriften oder Absätzen, aber die echte Frage lautet: Erhält die Chunking-Einheit die semantische Struktur des Dokuments?",{"type":32,"tag":33,"props":167,"children":168},{},[169],{"type":37,"value":170},"Wir testeten folgende Strategien:",{"type":32,"tag":172,"props":173,"children":174},"ol",{},[175,187,197],{"type":32,"tag":176,"props":177,"children":178},"li",{},[179,185],{"type":32,"tag":180,"props":181,"children":182},"strong",{},[183],{"type":37,"value":184},"Feste 512 Token",{"type":37,"value":186}," — Baseline. Retrieval@5: 82 %.",{"type":32,"tag":176,"props":188,"children":189},{},[190,195],{"type":32,"tag":180,"props":191,"children":192},{},[193],{"type":37,"value":194},"Markdown-Überschrift Split",{"type":37,"value":196}," — Chunk-Grenzen bei H2\u002FH3. Retrieval@5: 87 % (+5 Punkte). Latenz unverändert.",{"type":32,"tag":176,"props":198,"children":199},{},[200,205],{"type":32,"tag":180,"props":201,"children":202},{},[203],{"type":37,"value":204},"Semantisches Chunking",{"type":37,"value":206}," (statt einfaches RecursiveCharacterTextSplitter: sentence-transformers mit Ähnlichkeitsberechnung) — neuer Chunk wenn die Satz-Ähnlichkeit sinkt. Retrieval@5: 91 % (+9 Punkte). Latenz +15 %, aber „relevante Information nicht gefunden\"-Fehler sanken um 22 %.",{"type":32,"tag":33,"props":208,"children":209},{},[210],{"type":37,"value":211},"Bei semantischem Chunking war die Overlap-Quote kritisch. 10 % Overlap (letzte 50 Token werden im nächsten Chunk wiederholt) erhöhte Retrieval@10 von 91 % auf 94 %. Der Grund: Informationen, die an einer Chunk-Grenze abgeschnitten werden (z. B. „diese Metrik ist im Q4 um 18 % gestiegen\"), bleiben durch Overlap vollständig in mindestens einem Chunk.",{"type":32,"tag":33,"props":213,"children":214},{},[215],{"type":37,"value":216},"Code-Beispiel (Python):",{"type":32,"tag":218,"props":219,"children":223},"pre",{"code":220,"language":221,"meta":16,"className":222,"style":16},"from langchain.text_splitter import RecursiveCharacterTextSplitter\nfrom sentence_transformers import SentenceTransformer\n\nmodel = SentenceTransformer('all-MiniLM-L6-v2')\n\ndef semantic_chunk(text, max_chunk_size=600, overlap=0.1):\n    sentences = text.split('. ')\n    chunks, current = [], []\n    \n    for sent in sentences:\n        current.append(sent)\n        chunk_text = '. '.join(current)\n        \n        if len(chunk_text.split()) > max_chunk_size:\n            chunks.append(chunk_text)\n            overlap_size = int(len(current) * overlap)\n            current = current[-overlap_size:] if overlap_size > 0 else []\n    \n    if current:\n        chunks.append('. '.join(current))\n    \n    return chunks\n","python","language-python shiki shiki-themes github-dark",[224],{"type":32,"tag":225,"props":226,"children":227},"code",{"__ignoreMap":16},[228,256,278,288,318,326,375,402,420,428,452,461,484,493,522,531,574,631,639,653,671,679],{"type":32,"tag":229,"props":230,"children":233},"span",{"class":231,"line":232},"line",1,[234,240,246,251],{"type":32,"tag":229,"props":235,"children":237},{"style":236},"--shiki-default:#F97583",[238],{"type":37,"value":239},"from",{"type":32,"tag":229,"props":241,"children":243},{"style":242},"--shiki-default:#E1E4E8",[244],{"type":37,"value":245}," langchain.text_splitter ",{"type":32,"tag":229,"props":247,"children":248},{"style":236},[249],{"type":37,"value":250},"import",{"type":32,"tag":229,"props":252,"children":253},{"style":242},[254],{"type":37,"value":255}," RecursiveCharacterTextSplitter\n",{"type":32,"tag":229,"props":257,"children":259},{"class":231,"line":258},2,[260,264,269,273],{"type":32,"tag":229,"props":261,"children":262},{"style":236},[263],{"type":37,"value":239},{"type":32,"tag":229,"props":265,"children":266},{"style":242},[267],{"type":37,"value":268}," sentence_transformers ",{"type":32,"tag":229,"props":270,"children":271},{"style":236},[272],{"type":37,"value":250},{"type":32,"tag":229,"props":274,"children":275},{"style":242},[276],{"type":37,"value":277}," SentenceTransformer\n",{"type":32,"tag":229,"props":279,"children":281},{"class":231,"line":280},3,[282],{"type":32,"tag":229,"props":283,"children":285},{"emptyLinePlaceholder":284},true,[286],{"type":37,"value":287},"\n",{"type":32,"tag":229,"props":289,"children":291},{"class":231,"line":290},4,[292,297,302,307,313],{"type":32,"tag":229,"props":293,"children":294},{"style":242},[295],{"type":37,"value":296},"model ",{"type":32,"tag":229,"props":298,"children":299},{"style":236},[300],{"type":37,"value":301},"=",{"type":32,"tag":229,"props":303,"children":304},{"style":242},[305],{"type":37,"value":306}," SentenceTransformer(",{"type":32,"tag":229,"props":308,"children":310},{"style":309},"--shiki-default:#9ECBFF",[311],{"type":37,"value":312},"'all-MiniLM-L6-v2'",{"type":32,"tag":229,"props":314,"children":315},{"style":242},[316],{"type":37,"value":317},")\n",{"type":32,"tag":229,"props":319,"children":321},{"class":231,"line":320},5,[322],{"type":32,"tag":229,"props":323,"children":324},{"emptyLinePlaceholder":284},[325],{"type":37,"value":287},{"type":32,"tag":229,"props":327,"children":329},{"class":231,"line":328},6,[330,335,341,346,350,356,361,365,370],{"type":32,"tag":229,"props":331,"children":332},{"style":236},[333],{"type":37,"value":334},"def",{"type":32,"tag":229,"props":336,"children":338},{"style":337},"--shiki-default:#B392F0",[339],{"type":37,"value":340}," semantic_chunk",{"type":32,"tag":229,"props":342,"children":343},{"style":242},[344],{"type":37,"value":345},"(text, max_chunk_size",{"type":32,"tag":229,"props":347,"children":348},{"style":236},[349],{"type":37,"value":301},{"type":32,"tag":229,"props":351,"children":353},{"style":352},"--shiki-default:#79B8FF",[354],{"type":37,"value":355},"600",{"type":32,"tag":229,"props":357,"children":358},{"style":242},[359],{"type":37,"value":360},", overlap",{"type":32,"tag":229,"props":362,"children":363},{"style":236},[364],{"type":37,"value":301},{"type":32,"tag":229,"props":366,"children":367},{"style":352},[368],{"type":37,"value":369},"0.1",{"type":32,"tag":229,"props":371,"children":372},{"style":242},[373],{"type":37,"value":374},"):\n",{"type":32,"tag":229,"props":376,"children":378},{"class":231,"line":377},7,[379,384,388,393,398],{"type":32,"tag":229,"props":380,"children":381},{"style":242},[382],{"type":37,"value":383},"    sentences ",{"type":32,"tag":229,"props":385,"children":386},{"style":236},[387],{"type":37,"value":301},{"type":32,"tag":229,"props":389,"children":390},{"style":242},[391],{"type":37,"value":392}," text.split(",{"type":32,"tag":229,"props":394,"children":395},{"style":309},[396],{"type":37,"value":397},"'. '",{"type":32,"tag":229,"props":399,"children":400},{"style":242},[401],{"type":37,"value":317},{"type":32,"tag":229,"props":403,"children":405},{"class":231,"line":404},8,[406,411,415],{"type":32,"tag":229,"props":407,"children":408},{"style":242},[409],{"type":37,"value":410},"    chunks, current ",{"type":32,"tag":229,"props":412,"children":413},{"style":236},[414],{"type":37,"value":301},{"type":32,"tag":229,"props":416,"children":417},{"style":242},[418],{"type":37,"value":419}," [], []\n",{"type":32,"tag":229,"props":421,"children":422},{"class":231,"line":26},[423],{"type":32,"tag":229,"props":424,"children":425},{"style":242},[426],{"type":37,"value":427},"    \n",{"type":32,"tag":229,"props":429,"children":431},{"class":231,"line":430},10,[432,437,442,447],{"type":32,"tag":229,"props":433,"children":434},{"style":236},[435],{"type":37,"value":436},"    for",{"type":32,"tag":229,"props":438,"children":439},{"style":242},[440],{"type":37,"value":441}," sent ",{"type":32,"tag":229,"props":443,"children":444},{"style":236},[445],{"type":37,"value":446},"in",{"type":32,"tag":229,"props":448,"children":449},{"style":242},[450],{"type":37,"value":451}," sentences:\n",{"type":32,"tag":229,"props":453,"children":455},{"class":231,"line":454},11,[456],{"type":32,"tag":229,"props":457,"children":458},{"style":242},[459],{"type":37,"value":460},"        current.append(sent)\n",{"type":32,"tag":229,"props":462,"children":464},{"class":231,"line":463},12,[465,470,474,479],{"type":32,"tag":229,"props":466,"children":467},{"style":242},[468],{"type":37,"value":469},"        chunk_text ",{"type":32,"tag":229,"props":471,"children":472},{"style":236},[473],{"type":37,"value":301},{"type":32,"tag":229,"props":475,"children":476},{"style":309},[477],{"type":37,"value":478}," '. '",{"type":32,"tag":229,"props":480,"children":481},{"style":242},[482],{"type":37,"value":483},".join(current)\n",{"type":32,"tag":229,"props":485,"children":487},{"class":231,"line":486},13,[488],{"type":32,"tag":229,"props":489,"children":490},{"style":242},[491],{"type":37,"value":492},"        \n",{"type":32,"tag":229,"props":494,"children":496},{"class":231,"line":495},14,[497,502,507,512,517],{"type":32,"tag":229,"props":498,"children":499},{"style":236},[500],{"type":37,"value":501},"        if",{"type":32,"tag":229,"props":503,"children":504},{"style":352},[505],{"type":37,"value":506}," len",{"type":32,"tag":229,"props":508,"children":509},{"style":242},[510],{"type":37,"value":511},"(chunk_text.split()) ",{"type":32,"tag":229,"props":513,"children":514},{"style":236},[515],{"type":37,"value":516},">",{"type":32,"tag":229,"props":518,"children":519},{"style":242},[520],{"type":37,"value":521}," max_chunk_size:\n",{"type":32,"tag":229,"props":523,"children":525},{"class":231,"line":524},15,[526],{"type":32,"tag":229,"props":527,"children":528},{"style":242},[529],{"type":37,"value":530},"            chunks.append(chunk_text)\n",{"type":32,"tag":229,"props":532,"children":534},{"class":231,"line":533},16,[535,540,544,549,554,559,564,569],{"type":32,"tag":229,"props":536,"children":537},{"style":242},[538],{"type":37,"value":539},"            overlap_size ",{"type":32,"tag":229,"props":541,"children":542},{"style":236},[543],{"type":37,"value":301},{"type":32,"tag":229,"props":545,"children":546},{"style":352},[547],{"type":37,"value":548}," int",{"type":32,"tag":229,"props":550,"children":551},{"style":242},[552],{"type":37,"value":553},"(",{"type":32,"tag":229,"props":555,"children":556},{"style":352},[557],{"type":37,"value":558},"len",{"type":32,"tag":229,"props":560,"children":561},{"style":242},[562],{"type":37,"value":563},"(current) ",{"type":32,"tag":229,"props":565,"children":566},{"style":236},[567],{"type":37,"value":568},"*",{"type":32,"tag":229,"props":570,"children":571},{"style":242},[572],{"type":37,"value":573}," overlap)\n",{"type":32,"tag":229,"props":575,"children":577},{"class":231,"line":576},17,[578,583,587,592,597,602,607,612,616,621,626],{"type":32,"tag":229,"props":579,"children":580},{"style":242},[581],{"type":37,"value":582},"            current ",{"type":32,"tag":229,"props":584,"children":585},{"style":236},[586],{"type":37,"value":301},{"type":32,"tag":229,"props":588,"children":589},{"style":242},[590],{"type":37,"value":591}," current[",{"type":32,"tag":229,"props":593,"children":594},{"style":236},[595],{"type":37,"value":596},"-",{"type":32,"tag":229,"props":598,"children":599},{"style":242},[600],{"type":37,"value":601},"overlap_size:] ",{"type":32,"tag":229,"props":603,"children":604},{"style":236},[605],{"type":37,"value":606},"if",{"type":32,"tag":229,"props":608,"children":609},{"style":242},[610],{"type":37,"value":611}," overlap_size ",{"type":32,"tag":229,"props":613,"children":614},{"style":236},[615],{"type":37,"value":516},{"type":32,"tag":229,"props":617,"children":618},{"style":352},[619],{"type":37,"value":620}," 0",{"type":32,"tag":229,"props":622,"children":623},{"style":236},[624],{"type":37,"value":625}," else",{"type":32,"tag":229,"props":627,"children":628},{"style":242},[629],{"type":37,"value":630}," []\n",{"type":32,"tag":229,"props":632,"children":634},{"class":231,"line":633},18,[635],{"type":32,"tag":229,"props":636,"children":637},{"style":242},[638],{"type":37,"value":427},{"type":32,"tag":229,"props":640,"children":642},{"class":231,"line":641},19,[643,648],{"type":32,"tag":229,"props":644,"children":645},{"style":236},[646],{"type":37,"value":647},"    if",{"type":32,"tag":229,"props":649,"children":650},{"style":242},[651],{"type":37,"value":652}," current:\n",{"type":32,"tag":229,"props":654,"children":656},{"class":231,"line":655},20,[657,662,666],{"type":32,"tag":229,"props":658,"children":659},{"style":242},[660],{"type":37,"value":661},"        chunks.append(",{"type":32,"tag":229,"props":663,"children":664},{"style":309},[665],{"type":37,"value":397},{"type":32,"tag":229,"props":667,"children":668},{"style":242},[669],{"type":37,"value":670},".join(current))\n",{"type":32,"tag":229,"props":672,"children":674},{"class":231,"line":673},21,[675],{"type":32,"tag":229,"props":676,"children":677},{"style":242},[678],{"type":37,"value":427},{"type":32,"tag":229,"props":680,"children":682},{"class":231,"line":681},22,[683,688],{"type":32,"tag":229,"props":684,"children":685},{"style":236},[686],{"type":37,"value":687},"    return",{"type":32,"tag":229,"props":689,"children":690},{"style":242},[691],{"type":37,"value":692}," chunks\n",{"type":32,"tag":33,"props":694,"children":695},{},[696],{"type":37,"value":697},"Als wir den Overlap von 10 % auf 20 % erhöhten, stagnierte der Retrieval-Gewinn, aber die Speicherkosten stiegen um 18 %. In der Produktion war 10 % unser Optimalpunkt.",{"type":32,"tag":40,"props":699,"children":701},{"id":700},"evaluierungs-setup-keine-blinden-flecken-in-der-produktion",[702],{"type":37,"value":703},"Evaluierungs-Setup: Keine blinden Flecken in der Produktion",{"type":32,"tag":33,"props":705,"children":706},{},[707],{"type":37,"value":708},"Ein RAG-System zu deployen und zu sagen „wir schauen, wenn Nutzer sich beschweren\" funktioniert in der Produktion nicht. Die Evaluierungs-Pipeline muss kontinuierlich laufen: neue Dokumente, Modell-Wechsel, Chunking-Updates — alles mit automatisierten Regressionstests. Wir führen folgende Metrik-Sets in jedem Commit durch CI\u002FCD aus:",{"type":32,"tag":33,"props":710,"children":711},{},[712],{"type":32,"tag":180,"props":713,"children":714},{},[715],{"type":37,"value":716},"Retrieval-Metriken:",{"type":32,"tag":718,"props":719,"children":720},"ul",{},[721,726,731],{"type":32,"tag":176,"props":722,"children":723},{},[724],{"type":37,"value":725},"Retrieval@5, @10 (basierend auf Ground-Truth-Paaren)",{"type":32,"tag":176,"props":727,"children":728},{},[729],{"type":37,"value":730},"Mean Reciprocal Rank (MRR) — an welcher Position kam das korrekte Dokument?",{"type":32,"tag":176,"props":732,"children":733},{},[734],{"type":37,"value":735},"NDCG@10 (Ranking-Qualität)",{"type":32,"tag":33,"props":737,"children":738},{},[739],{"type":32,"tag":180,"props":740,"children":741},{},[742],{"type":37,"value":743},"End-to-End-Metriken:",{"type":32,"tag":718,"props":745,"children":746},{},[747,752,757],{"type":32,"tag":176,"props":748,"children":749},{},[750],{"type":37,"value":751},"Answer Correctness (LLM-as-Judge: GPT-4 bewertet die Antwort)",{"type":32,"tag":176,"props":753,"children":754},{},[755],{"type":37,"value":756},"Citation Accuracy (Punkt-Abzug, wenn Informationen außerhalb der Quelle stammen)",{"type":32,"tag":176,"props":758,"children":759},{},[760],{"type":37,"value":761},"Latenz p50\u002Fp95\u002Fp99",{"type":32,"tag":33,"props":763,"children":764},{},[765],{"type":37,"value":766},"Den Eval-Datensatz konstruieren wir so: 500 Sample-Anfragen aus der Produktion, manuelles Labeling der Ground-Truth-Dokumente, dann Messung aller Änderungen gegen diesen Satz. Der Datensatz wird monatlich aktualisiert, weil sich die Nutzer-Query-Verteilung ändert — ein Eval-Score von vor 3 Monaten spiegelt die heutige Produktion nicht wider.",{"type":32,"tag":33,"props":768,"children":769},{},[770],{"type":37,"value":771},"Beispiel-Prompt für LLM-as-Judge:",{"type":32,"tag":218,"props":773,"children":775},{"code":774},"Du bist ein Evaluierungsmodell für ein RAG-System.\nAnalysiere folgendes Tripel:\n\nUSER_QUERY: \"{query}\"\nRETRIEVED_CONTEXT: \"{context}\"\nGENERATED_ANSWER: \"{answer}\"\n\nBewerte:\n1. Beantwortet die Antwort die Abfrage korrekt? (0–10)\n2. Stammen alle Informationen in der Antwort aus dem Kontext? (0–10, ohne Quellentext = 0)\n3. Enthält die Antwort unnötige Details? (0–10, 10 = prägnant)\n\nJSON-Ausgabe: {{\"correctness\": X, \"grounding\": Y, \"conciseness\": Z}}\n",[776],{"type":32,"tag":225,"props":777,"children":778},{"__ignoreMap":16},[779],{"type":37,"value":774},{"type":32,"tag":33,"props":781,"children":782},{},[783],{"type":37,"value":784},"Dieser Eval läuft bei jedem Pull Request — sinkt der Retrieval@5-Score um mehr als 2 %, wird der Merge blockiert.",{"type":32,"tag":40,"props":786,"children":788},{"id":787},"hyperparameter-tuning-top-k-und-reranking",[789],{"type":37,"value":790},"Hyperparameter-Tuning: Top-K und Reranking",{"type":32,"tag":33,"props":792,"children":793},{},[794],{"type":37,"value":795},"Nach dem Embedding-Search werden Sie Top-K Dokumente abrufen. K=5, 10 oder 20? Ein größeres K bedeutet mehr Kontext, aber auch mehr Token zum LLM — Kosten und Latenz steigen, und Rauschen wächst. Der LLM erfährt das „Lost in the Middle\"-Problem — er übersieht Informationen in der Mitte eines langen Kontexts.",{"type":32,"tag":33,"props":797,"children":798},{},[799,801,806],{"type":37,"value":800},"Unser optimaler Punkt: ",{"type":32,"tag":180,"props":802,"children":803},{},[804],{"type":37,"value":805},"K=10 Embedding-Retrieval + Reranker-Modell für Top-3 Auswahl",{"type":37,"value":807},". Der Reranker (Cohere rerank-english-v2.0 oder cross-encoder\u002Fms-marco-MiniLM) führt ein tieferes semantisches Matching zwischen Abfrage und Dokumenten durch. Das Ranking ist 7–12 % besser als nur Cosine-Similarity, verursacht aber zusätzliche Latenz (Forward Pass für jedes Dokument).",{"type":32,"tag":33,"props":809,"children":810},{},[811],{"type":37,"value":812},"Pipeline:",{"type":32,"tag":172,"props":814,"children":815},{},[816,821,826],{"type":32,"tag":176,"props":817,"children":818},{},[819],{"type":37,"value":820},"Embedding-Top-10 abrufen (~80ms)",{"type":32,"tag":176,"props":822,"children":823},{},[824],{"type":37,"value":825},"Reranker: 10 Dokumente neu sortieren, Top-3 wählen (~120ms)",{"type":32,"tag":176,"props":827,"children":828},{},[829],{"type":37,"value":830},"Top-3 als Kontext an LLM senden",{"type":32,"tag":33,"props":832,"children":833},{},[834],{"type":37,"value":835},"Gesamtlatenz ist 40 % höher als nur Embedding (80ms → 200ms), aber Answer Correctness stieg von 87 % auf 94 %. Unser User-Facing-Latenz-SLA ist 500ms, dieser Tradeoff ist akzeptabel. Bei straffer Anforderung könnten wir den Reranker in eine asynchrone Queue auslagern, zunächst mit Embedding-Top-3 antworten und das Reranking-Ergebnis im Hintergrund cachen.",{"type":32,"tag":62,"props":837,"children":839},{"id":838},"echter-reranking-impact-ab-test-ergebnisse",[840],{"type":37,"value":841},"Echter Reranking-Impact: A\u002FB-Test-Ergebnisse",{"type":32,"tag":33,"props":843,"children":844},{},[845,847,856],{"type":37,"value":846},"Über 7 Tage richteten wir 50 % Traffic an Embedding-Only und 50 % an Embedding+Rerank. Mit ",{"type":32,"tag":848,"props":849,"children":853},"a",{"href":850,"rel":851},"https:\u002F\u002Fwww.roibase.com.tr\u002Fde\u002Ffirstparty",[852],"nofollow",[854],{"type":37,"value":855},"First-Party-Daten und Messung-Architektur",{"type":37,"value":857}," verfolgten wir jede Query nach Segment:",{"type":32,"tag":69,"props":859,"children":860},{},[861,887],{"type":32,"tag":73,"props":862,"children":863},{},[864],{"type":32,"tag":77,"props":865,"children":866},{},[867,872,877,882],{"type":32,"tag":81,"props":868,"children":869},{},[870],{"type":37,"value":871},"Metrik",{"type":32,"tag":81,"props":873,"children":874},{},[875],{"type":37,"value":876},"Nur Embedding",{"type":32,"tag":81,"props":878,"children":879},{},[880],{"type":37,"value":881},"Embedding + Rerank",{"type":32,"tag":81,"props":883,"children":884},{},[885],{"type":37,"value":886},"Delta",{"type":32,"tag":97,"props":888,"children":889},{},[890,913,936,959],{"type":32,"tag":77,"props":891,"children":892},{},[893,898,903,908],{"type":32,"tag":104,"props":894,"children":895},{},[896],{"type":37,"value":897},"„Hilfreich\"-Rating durch Nutzer",{"type":32,"tag":104,"props":899,"children":900},{},[901],{"type":37,"value":902},"72 %",{"type":32,"tag":104,"props":904,"children":905},{},[906],{"type":37,"value":907},"81 %",{"type":32,"tag":104,"props":909,"children":910},{},[911],{"type":37,"value":912},"+9pp",{"type":32,"tag":77,"props":914,"children":915},{},[916,921,926,931],{"type":32,"tag":104,"props":917,"children":918},{},[919],{"type":37,"value":920},"Follow-up-Query-Rate",{"type":32,"tag":104,"props":922,"children":923},{},[924],{"type":37,"value":925},"34 %",{"type":32,"tag":104,"props":927,"children":928},{},[929],{"type":37,"value":930},"28 %",{"type":32,"tag":104,"props":932,"children":933},{},[934],{"type":37,"value":935},"-6pp (gut — erste Antwort genügte)",{"type":32,"tag":77,"props":937,"children":938},{},[939,944,949,954],{"type":32,"tag":104,"props":940,"children":941},{},[942],{"type":37,"value":943},"p95-Latenz",{"type":32,"tag":104,"props":945,"children":946},{},[947],{"type":37,"value":948},"180ms",{"type":32,"tag":104,"props":950,"children":951},{},[952],{"type":37,"value":953},"240ms",{"type":32,"tag":104,"props":955,"children":956},{},[957],{"type":37,"value":958},"+60ms",{"type":32,"tag":77,"props":960,"children":961},{},[962,967,972,977],{"type":32,"tag":104,"props":963,"children":964},{},[965],{"type":37,"value":966},"Kosten pro Query",{"type":32,"tag":104,"props":968,"children":969},{},[970],{"type":37,"value":971},"0,003 EUR",{"type":32,"tag":104,"props":973,"children":974},{},[975],{"type":37,"value":976},"0,0042 EUR",{"type":32,"tag":104,"props":978,"children":979},{},[980],{"type":37,"value":981},"+40 %",{"type":32,"tag":33,"props":983,"children":984},{},[985],{"type":37,"value":986},"Reranking ist in der Produktion für hochwertiges Retrieval notwendig — Kostenerhöhungen reduzieren wir durch Batch-Verarbeitung und Caching mit wachsendem Volume.",{"type":32,"tag":40,"props":988,"children":990},{"id":989},"cache-und-inkrementelle-aktualisierungen-echter-kostenvorteil-liegt-hier",[991],{"type":37,"value":992},"Cache und inkrementelle Aktualisierungen: Echter Kostenvorteil liegt hier",{"type":32,"tag":33,"props":994,"children":995},{},[996],{"type":37,"value":997},"Kostenoptimierung passiert nicht bei der Modellwahl, sondern bei der Cache-Strategie. Wird die gleiche Abfrage erneut gestellt, müssen Sie Embedding + Retrieval nicht wiederholen. Wir konstruierten folgende mehrstufige Cache-Struktur auf Redis:",{"type":32,"tag":172,"props":999,"children":1000},{},[1001,1011,1021],{"type":32,"tag":176,"props":1002,"children":1003},{},[1004,1009],{"type":32,"tag":180,"props":1005,"children":1006},{},[1007],{"type":37,"value":1008},"Query-Embedding-Cache",{"type":37,"value":1010}," — jeder eindeutige Query speichert seinen Embedding-Vektor für 24 Stunden. Hit-Rate: 41 % (Nutzer-Queries sind repetitiv: „Preise\", „Integrationsleitfaden\").",{"type":32,"tag":176,"props":1012,"children":1013},{},[1014,1019],{"type":32,"tag":180,"props":1015,"children":1016},{},[1017],{"type":37,"value":1018},"Retrieval-Result-Cache",{"type":37,"value":1020}," — Query + Top-K Dokument-IDs für 6 Stunden. Hit-Rate: 28 %.",{"type":32,"tag":176,"props":1022,"children":1023},{},[1024,1029],{"type":32,"tag":180,"props":1025,"children":1026},{},[1027],{"type":37,"value":1028},"Generated-Answer-Cache",{"type":37,"value":1030}," — komplette Antwort für 1 Stunde (invalidiert nach Dokument-Update). Hit-Rate: 19 %.",{"type":32,"tag":33,"props":1032,"children":1033},{},[1034],{"type":37,"value":1035},"Bei Cache-Hit sinkt die Latenz von 200ms auf 15ms, Kosten sind null. Combined Hit-Rate ~88 % — nur 12 % des Production-Traffic führt tatsächlich Embedding + LLM aus.",{"type":32,"tag":33,"props":1037,"children":1038},{},[1039],{"type":37,"value":1040},"Inkrementelle Updates: statt das gesamte Corpus neu einzubetten, wenn neue Dokumente hinzukommen, verarbeiten wir nur die neuen. Vector-DB-Insert (Pinecone\u002FWeaviate) vollzieht sich unter 50ms. Ändert sich ein altes Dokument, aktualisieren wir nur dessen Chunks. So können täglich 500 Dokumente hinzugefügt werden, das System läuft ohne Ausfallzeit.",{"type":32,"tag":40,"props":1042,"children":1044},{"id":1043},"beobachtbarkeit-in-der-produktion-rag-debugging-tools",[1045],{"type":37,"value":1046},"Beobachtbarkeit in der Produktion: RAG-Debugging-Tools",{"type":32,"tag":33,"props":1048,"children":1049},{},[1050],{"type":37,"value":1051},"Wenn ein Nutzer sagt „falsche Antwort\", wie debuggen Sie? Unser Stack:",{"type":32,"tag":718,"props":1053,"children":1054},{},[1055,1065,1075],{"type":32,"tag":176,"props":1056,"children":1057},{},[1058,1063],{"type":32,"tag":180,"props":1059,"children":1060},{},[1061],{"type":37,"value":1062},"LangSmith",{"type":37,"value":1064}," — speichert Traces für jeden RAG-Chain-Schritt: Embedding-Latenz, Retrieval-Resultat, LLM-Prompt\u002FResponse, Token-Count. Mit Query-ID können wir die gesamte Pipeline nachspielen.",{"type":32,"tag":176,"props":1066,"children":1067},{},[1068,1073],{"type":32,"tag":180,"props":1069,"children":1070},{},[1071],{"type":37,"value":1072},"Custom-Dashboard",{"type":37,"value":1074}," (Grafana + Prometheus) — Retrieval@5-Score, Cache-Hit-Rate, p95-Latenz, Kosten pro Query werden echtzeit überwacht.",{"type":32,"tag":176,"props":1076,"children":1077},{},[1078,1083],{"type":32,"tag":180,"props":1079,"children":1080},{},[1081],{"type":37,"value":1082},"Error Budget",{"type":37,"value":1084}," — 2 % Retrieval-Fehlertoleranz pro Woche (z. B. Dokument nicht gefunden). Wird diese Schwelle überschritten, gibt es einen Alert.",{"type":32,"tag":33,"props":1086,"children":1087},{},[1088],{"type":37,"value":1089},"LangSmith-Alternativen sind Open-Source-Tools wie Helicone, Langfuse. Das Entscheidende: Jeder Query in der Produktion muss vollständig getraced sein, sonst können Sie die Frage „warum falsche Antwort?\" nicht beantworten.",{"type":32,"tag":33,"props":1091,"children":1092},{},[1093],{"type":37,"value":1094},"Die",{"type":32,"tag":1096,"props":1097,"children":1098},"style",{},[1099],{"type":37,"value":1100},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}",{"title":16,"searchDepth":280,"depth":280,"links":1102},[1103,1106,1107,1108,1111,1112],{"id":42,"depth":258,"text":45,"children":1104},[1105],{"id":64,"depth":280,"text":67},{"id":157,"depth":258,"text":160},{"id":700,"depth":258,"text":703},{"id":787,"depth":258,"text":790,"children":1109},[1110],{"id":838,"depth":280,"text":841},{"id":989,"depth":258,"text":992},{"id":1043,"depth":258,"text":1046},"markdown","content:de:ai:rag-production-retrieval-qualitaet-vor-kosten.md","content","de\u002Fai\u002Frag-production-retrieval-qualitaet-vor-kosten.md","de\u002Fai\u002Frag-production-retrieval-qualitaet-vor-kosten","md",1782079488920]