[{"data":1,"prerenderedAt":1128},["ShallowReactive",2],{"article-alternates":3,"article-\u002Fes\u002Fai\u002Frag-production-calidad-recuperacion-primero":13},{"i18nKey":4,"paths":5},"ai-003-2026-06",{"de":6,"en":7,"es":8,"fr":9,"it":10,"ru":11,"tr":12},"\u002Fde\u002Fai\u002Frag-production-retrieval-qualitaet-vor-kosten","\u002Fen\u002Fai\u002Frag-production-retrieval-quality-over-cost","\u002Fes\u002Fai\u002Frag-production-calidad-recuperacion-primero","\u002Ffr\u002Fai\u002Fproduction-rag-retrieval-quality-before-cost","\u002Fit\u002Fai\u002Frag-production-retrieval-quality-first","\u002Fru\u002Fai\u002Frag-production-retrieval-quality-first","\u002Ftr\u002Fai\u002Fproductionda-rag-retrieval-kalitesi-costtan-once-gelir",{"_path":8,"_dir":14,"_draft":15,"_partial":15,"_locale":16,"title":17,"description":18,"publishedAt":19,"modifiedAt":19,"category":14,"i18nKey":4,"tags":20,"readingTime":26,"author":27,"body":28,"_type":1122,"_id":1123,"_source":1124,"_file":1125,"_stem":1126,"_extension":1127},"ai",false,"","RAG en Producción: La Calidad de Recuperación Viene Antes que el Costo","Cómo el modelo de embedding, la estrategia de chunking y el setup de evaluación determinan la calidad de recuperación en sistemas RAG production. Primero calidad, después optimización de costos.","2026-06-20",[21,22,23,24,25],"rag","retrieval","embedding-models","chunking-strategy","llm-eval",8,"Roibase",{"type":29,"children":30,"toc":1110},"root",[31,39,46,51,56,61,68,155,161,166,171,207,212,217,693,698,704,709,717,736,744,762,767,772,780,785,791,796,808,813,831,836,842,858,982,987,993,998,1031,1036,1041,1047,1052,1085,1090,1095,1099,1104],{"type":32,"tag":33,"props":34,"children":35},"element","p",{},[36],{"type":37,"value":38},"text","En RAG (Retrieval-Augmented Generation) production, la mayoría de equipos comienzan con optimización de costos. Primero eligen un modelo de embedding económico, luego fijan el tamaño de chunk en 512 tokens, y finalmente surge la pregunta: \"¿por qué está alucinando?\" Hay que invertir el orden: la calidad de recuperación es la columna vertebral del sistema, el costo es una variable a optimizar en iteraciones posteriores. En 2026, RAG ya no es proof-of-concept — sistemas production procesan millones de queries diarias y los usuarios piden \"muestra la fuente\". La recuperación incorrecta mata el sistema antes de que llegue al prompt del LLM.",{"type":32,"tag":40,"props":41,"children":43},"h2",{"id":42},"modelo-de-embedding-el-tradeoff-tamaño-calidad-no-es-paramétrico",[44],{"type":37,"value":45},"Modelo de Embedding: El Tradeoff Tamaño-Calidad No es Paramétrico",{"type":32,"tag":33,"props":47,"children":48},{},[49],{"type":37,"value":50},"Reducir la dimensión del embedding disminuye latencia pero sacrifica precisión de búsqueda. text-embedding-ada-002 tiene 1536 dimensiones, text-embedding-3-small se puede ajustar entre 512-1536. Si eliges una dimensión pequeña, los vectores de dominios semánticos diferentes se solapan — la distancia entre \"user authentication\" y \"user onboarding\" se reduce artificialmente.",{"type":32,"tag":33,"props":52,"children":53},{},[54],{"type":37,"value":55},"En production, primero construimos un pipeline de pruebas: 200 queries de usuarios reales + pares de documentos ground truth. Medimos cada modelo con métricas retrieval@5 y retrieval@10. Entre ada-002 (1536 dim) y embedding-3-small (1536 dim) no hay diferencia de calidad, pero la latencia varía %18. Cuando reducimos embedding-3-small a 768 dimensiones, la latencia mejoró %32 pero el score retrieval@5 bajó de %91 a %84 — 7 puntos de caída, es decir, en 100 queries, 7 entregarían contexto incorrecto. La ganancia en costo\u002Flatencia no compensa esta pérdida.",{"type":32,"tag":33,"props":57,"children":58},{},[59],{"type":37,"value":60},"Alternativa: fine-tuning domain-specific. Puedes ajustar modelos de Voyage AI o Cohere embed con tu corpus propio. Después de 50k ejemplos etiquetados + 2 semanas de iteración, el score retrieval@10 subió de %91 a %96. El costo del fine-tuning es ~$4k pero el costo por query permanece igual — conforme aumenta el volumen, la ganancia marginal crece. En lugar de optimizar costos con un modelo genérico, mejora calidad con un modelo específico del dominio, y luego reduce costos mediante cache y procesamiento batch.",{"type":32,"tag":62,"props":63,"children":65},"h3",{"id":64},"índice-de-madurez-en-qué-etapa-está-tu-selección-de-embedding",[66],{"type":37,"value":67},"Índice de Madurez: ¿En Qué Etapa Está Tu Selección de Embedding?",{"type":32,"tag":69,"props":70,"children":71},"table",{},[72,96],{"type":32,"tag":73,"props":74,"children":75},"thead",{},[76],{"type":32,"tag":77,"props":78,"children":79},"tr",{},[80,86,91],{"type":32,"tag":81,"props":82,"children":83},"th",{},[84],{"type":37,"value":85},"Etapa",{"type":32,"tag":81,"props":87,"children":88},{},[89],{"type":37,"value":90},"Estrategia de Modelo",{"type":32,"tag":81,"props":92,"children":93},{},[94],{"type":37,"value":95},"Objetivo de Métrica",{"type":32,"tag":97,"props":98,"children":99},"tbody",{},[100,119,137],{"type":32,"tag":77,"props":101,"children":102},{},[103,109,114],{"type":32,"tag":104,"props":105,"children":106},"td",{},[107],{"type":37,"value":108},"MVP (0-10k queries\u002Fdía)",{"type":32,"tag":104,"props":110,"children":111},{},[112],{"type":37,"value":113},"OpenAI ada-002 default",{"type":32,"tag":104,"props":115,"children":116},{},[117],{"type":37,"value":118},"Retrieval@5 > %80",{"type":32,"tag":77,"props":120,"children":121},{},[122,127,132],{"type":32,"tag":104,"props":123,"children":124},{},[125],{"type":37,"value":126},"Scale (10k-100k\u002Fdía)",{"type":32,"tag":104,"props":128,"children":129},{},[130],{"type":37,"value":131},"embedding-3-small 1536 dim",{"type":32,"tag":104,"props":133,"children":134},{},[135],{"type":37,"value":136},"Retrieval@5 > %85, p95 latencia \u003C 200ms",{"type":32,"tag":77,"props":138,"children":139},{},[140,145,150],{"type":32,"tag":104,"props":141,"children":142},{},[143],{"type":37,"value":144},"Optimized (100k+\u002Fdía)",{"type":32,"tag":104,"props":146,"children":147},{},[148],{"type":37,"value":149},"Voyage\u002FCohere fine-tuned",{"type":32,"tag":104,"props":151,"children":152},{},[153],{"type":37,"value":154},"Retrieval@10 > %93, procesamiento batch",{"type":32,"tag":40,"props":156,"children":158},{"id":157},"estrategia-de-chunking-no-tokens-fijos-límites-semánticos",[159],{"type":37,"value":160},"Estrategia de Chunking: No Tokens Fijos, Límites Semánticos",{"type":32,"tag":33,"props":162,"children":163},{},[164],{"type":37,"value":165},"El chunk de 512 tokens se presenta como estándar universal, pero es un artefacto del histórico context window de LLMs, no el punto óptimo para calidad de recuperación. Chunks muy pequeños pierden contexto, muy grandes introducen ruido en el embedding. La mayoría de equipos chunking por headers markdown o párrafos, pero la pregunta real es: ¿tu unidad de chunking preserva la estructura semántica del documento?",{"type":32,"tag":33,"props":167,"children":168},{},[169],{"type":37,"value":170},"En nuestro sistema probamos las siguientes estrategias:",{"type":32,"tag":172,"props":173,"children":174},"ol",{},[175,187,197],{"type":32,"tag":176,"props":177,"children":178},"li",{},[179,185],{"type":32,"tag":180,"props":181,"children":182},"strong",{},[183],{"type":37,"value":184},"512 tokens fijo",{"type":37,"value":186}," — baseline. Retrieval@5: %82.",{"type":32,"tag":176,"props":188,"children":189},{},[190,195],{"type":32,"tag":180,"props":191,"children":192},{},[193],{"type":37,"value":194},"Chunking por heading markdown",{"type":37,"value":196}," — divide en límites de H2\u002FH3. Retrieval@5: %87 (+5 puntos). Latencia sin cambios.",{"type":32,"tag":176,"props":198,"children":199},{},[200,205],{"type":32,"tag":180,"props":201,"children":202},{},[203],{"type":37,"value":204},"Semantic chunking",{"type":37,"value":206}," (en lugar de RecursiveCharacterTextSplitter de LangChain, usamos sentence-transformers para calcular similitud) — crea nuevo chunk cuando la similitud entre oraciones cae. Retrieval@5: %91 (+9 puntos). Latencia aumenta %15 pero el error \"información relevante no encontrada\" bajó %22.",{"type":32,"tag":33,"props":208,"children":209},{},[210],{"type":37,"value":211},"En semantic chunking aprendimos que la tasa de overlap es crítica. Un overlap del %10 (es decir, los últimos 50 tokens del chunk anterior se repiten en el siguiente) elevó retrieval@10 de %91 a %94. Porque la información cortada en un boundary (ej. \"esta métrica creció %18 en Q4\") se mantiene completa en al menos un chunk gracias al overlap.",{"type":32,"tag":33,"props":213,"children":214},{},[215],{"type":37,"value":216},"Ejemplo de código (Python):",{"type":32,"tag":218,"props":219,"children":223},"pre",{"code":220,"language":221,"meta":16,"className":222,"style":16},"from langchain.text_splitter import RecursiveCharacterTextSplitter\nfrom sentence_transformers import SentenceTransformer\n\nmodel = SentenceTransformer('all-MiniLM-L6-v2')\n\ndef semantic_chunk(text, max_chunk_size=600, overlap=0.1):\n    sentences = text.split('. ')\n    chunks, current = [], []\n    \n    for sent in sentences:\n        current.append(sent)\n        chunk_text = '. '.join(current)\n        \n        if len(chunk_text.split()) > max_chunk_size:\n            chunks.append(chunk_text)\n            overlap_size = int(len(current) * overlap)\n            current = current[-overlap_size:] if overlap_size > 0 else []\n    \n    if current:\n        chunks.append('. '.join(current))\n    \n    return chunks\n","python","language-python shiki shiki-themes github-dark",[224],{"type":32,"tag":225,"props":226,"children":227},"code",{"__ignoreMap":16},[228,256,278,288,318,326,375,402,419,428,452,461,484,493,522,531,574,631,639,653,671,679],{"type":32,"tag":229,"props":230,"children":233},"span",{"class":231,"line":232},"line",1,[234,240,246,251],{"type":32,"tag":229,"props":235,"children":237},{"style":236},"--shiki-default:#F97583",[238],{"type":37,"value":239},"from",{"type":32,"tag":229,"props":241,"children":243},{"style":242},"--shiki-default:#E1E4E8",[244],{"type":37,"value":245}," langchain.text_splitter ",{"type":32,"tag":229,"props":247,"children":248},{"style":236},[249],{"type":37,"value":250},"import",{"type":32,"tag":229,"props":252,"children":253},{"style":242},[254],{"type":37,"value":255}," RecursiveCharacterTextSplitter\n",{"type":32,"tag":229,"props":257,"children":259},{"class":231,"line":258},2,[260,264,269,273],{"type":32,"tag":229,"props":261,"children":262},{"style":236},[263],{"type":37,"value":239},{"type":32,"tag":229,"props":265,"children":266},{"style":242},[267],{"type":37,"value":268}," sentence_transformers ",{"type":32,"tag":229,"props":270,"children":271},{"style":236},[272],{"type":37,"value":250},{"type":32,"tag":229,"props":274,"children":275},{"style":242},[276],{"type":37,"value":277}," SentenceTransformer\n",{"type":32,"tag":229,"props":279,"children":281},{"class":231,"line":280},3,[282],{"type":32,"tag":229,"props":283,"children":285},{"emptyLinePlaceholder":284},true,[286],{"type":37,"value":287},"\n",{"type":32,"tag":229,"props":289,"children":291},{"class":231,"line":290},4,[292,297,302,307,313],{"type":32,"tag":229,"props":293,"children":294},{"style":242},[295],{"type":37,"value":296},"model ",{"type":32,"tag":229,"props":298,"children":299},{"style":236},[300],{"type":37,"value":301},"=",{"type":32,"tag":229,"props":303,"children":304},{"style":242},[305],{"type":37,"value":306}," SentenceTransformer(",{"type":32,"tag":229,"props":308,"children":310},{"style":309},"--shiki-default:#9ECBFF",[311],{"type":37,"value":312},"'all-MiniLM-L6-v2'",{"type":32,"tag":229,"props":314,"children":315},{"style":242},[316],{"type":37,"value":317},")\n",{"type":32,"tag":229,"props":319,"children":321},{"class":231,"line":320},5,[322],{"type":32,"tag":229,"props":323,"children":324},{"emptyLinePlaceholder":284},[325],{"type":37,"value":287},{"type":32,"tag":229,"props":327,"children":329},{"class":231,"line":328},6,[330,335,341,346,350,356,361,365,370],{"type":32,"tag":229,"props":331,"children":332},{"style":236},[333],{"type":37,"value":334},"def",{"type":32,"tag":229,"props":336,"children":338},{"style":337},"--shiki-default:#B392F0",[339],{"type":37,"value":340}," semantic_chunk",{"type":32,"tag":229,"props":342,"children":343},{"style":242},[344],{"type":37,"value":345},"(text, max_chunk_size",{"type":32,"tag":229,"props":347,"children":348},{"style":236},[349],{"type":37,"value":301},{"type":32,"tag":229,"props":351,"children":353},{"style":352},"--shiki-default:#79B8FF",[354],{"type":37,"value":355},"600",{"type":32,"tag":229,"props":357,"children":358},{"style":242},[359],{"type":37,"value":360},", overlap",{"type":32,"tag":229,"props":362,"children":363},{"style":236},[364],{"type":37,"value":301},{"type":32,"tag":229,"props":366,"children":367},{"style":352},[368],{"type":37,"value":369},"0.1",{"type":32,"tag":229,"props":371,"children":372},{"style":242},[373],{"type":37,"value":374},"):\n",{"type":32,"tag":229,"props":376,"children":378},{"class":231,"line":377},7,[379,384,388,393,398],{"type":32,"tag":229,"props":380,"children":381},{"style":242},[382],{"type":37,"value":383},"    sentences ",{"type":32,"tag":229,"props":385,"children":386},{"style":236},[387],{"type":37,"value":301},{"type":32,"tag":229,"props":389,"children":390},{"style":242},[391],{"type":37,"value":392}," text.split(",{"type":32,"tag":229,"props":394,"children":395},{"style":309},[396],{"type":37,"value":397},"'. '",{"type":32,"tag":229,"props":399,"children":400},{"style":242},[401],{"type":37,"value":317},{"type":32,"tag":229,"props":403,"children":404},{"class":231,"line":26},[405,410,414],{"type":32,"tag":229,"props":406,"children":407},{"style":242},[408],{"type":37,"value":409},"    chunks, current ",{"type":32,"tag":229,"props":411,"children":412},{"style":236},[413],{"type":37,"value":301},{"type":32,"tag":229,"props":415,"children":416},{"style":242},[417],{"type":37,"value":418}," [], []\n",{"type":32,"tag":229,"props":420,"children":422},{"class":231,"line":421},9,[423],{"type":32,"tag":229,"props":424,"children":425},{"style":242},[426],{"type":37,"value":427},"    \n",{"type":32,"tag":229,"props":429,"children":431},{"class":231,"line":430},10,[432,437,442,447],{"type":32,"tag":229,"props":433,"children":434},{"style":236},[435],{"type":37,"value":436},"    for",{"type":32,"tag":229,"props":438,"children":439},{"style":242},[440],{"type":37,"value":441}," sent ",{"type":32,"tag":229,"props":443,"children":444},{"style":236},[445],{"type":37,"value":446},"in",{"type":32,"tag":229,"props":448,"children":449},{"style":242},[450],{"type":37,"value":451}," sentences:\n",{"type":32,"tag":229,"props":453,"children":455},{"class":231,"line":454},11,[456],{"type":32,"tag":229,"props":457,"children":458},{"style":242},[459],{"type":37,"value":460},"        current.append(sent)\n",{"type":32,"tag":229,"props":462,"children":464},{"class":231,"line":463},12,[465,470,474,479],{"type":32,"tag":229,"props":466,"children":467},{"style":242},[468],{"type":37,"value":469},"        chunk_text ",{"type":32,"tag":229,"props":471,"children":472},{"style":236},[473],{"type":37,"value":301},{"type":32,"tag":229,"props":475,"children":476},{"style":309},[477],{"type":37,"value":478}," '. '",{"type":32,"tag":229,"props":480,"children":481},{"style":242},[482],{"type":37,"value":483},".join(current)\n",{"type":32,"tag":229,"props":485,"children":487},{"class":231,"line":486},13,[488],{"type":32,"tag":229,"props":489,"children":490},{"style":242},[491],{"type":37,"value":492},"        \n",{"type":32,"tag":229,"props":494,"children":496},{"class":231,"line":495},14,[497,502,507,512,517],{"type":32,"tag":229,"props":498,"children":499},{"style":236},[500],{"type":37,"value":501},"        if",{"type":32,"tag":229,"props":503,"children":504},{"style":352},[505],{"type":37,"value":506}," len",{"type":32,"tag":229,"props":508,"children":509},{"style":242},[510],{"type":37,"value":511},"(chunk_text.split()) ",{"type":32,"tag":229,"props":513,"children":514},{"style":236},[515],{"type":37,"value":516},">",{"type":32,"tag":229,"props":518,"children":519},{"style":242},[520],{"type":37,"value":521}," max_chunk_size:\n",{"type":32,"tag":229,"props":523,"children":525},{"class":231,"line":524},15,[526],{"type":32,"tag":229,"props":527,"children":528},{"style":242},[529],{"type":37,"value":530},"            chunks.append(chunk_text)\n",{"type":32,"tag":229,"props":532,"children":534},{"class":231,"line":533},16,[535,540,544,549,554,559,564,569],{"type":32,"tag":229,"props":536,"children":537},{"style":242},[538],{"type":37,"value":539},"            overlap_size ",{"type":32,"tag":229,"props":541,"children":542},{"style":236},[543],{"type":37,"value":301},{"type":32,"tag":229,"props":545,"children":546},{"style":352},[547],{"type":37,"value":548}," int",{"type":32,"tag":229,"props":550,"children":551},{"style":242},[552],{"type":37,"value":553},"(",{"type":32,"tag":229,"props":555,"children":556},{"style":352},[557],{"type":37,"value":558},"len",{"type":32,"tag":229,"props":560,"children":561},{"style":242},[562],{"type":37,"value":563},"(current) ",{"type":32,"tag":229,"props":565,"children":566},{"style":236},[567],{"type":37,"value":568},"*",{"type":32,"tag":229,"props":570,"children":571},{"style":242},[572],{"type":37,"value":573}," overlap)\n",{"type":32,"tag":229,"props":575,"children":577},{"class":231,"line":576},17,[578,583,587,592,597,602,607,612,616,621,626],{"type":32,"tag":229,"props":579,"children":580},{"style":242},[581],{"type":37,"value":582},"            current ",{"type":32,"tag":229,"props":584,"children":585},{"style":236},[586],{"type":37,"value":301},{"type":32,"tag":229,"props":588,"children":589},{"style":242},[590],{"type":37,"value":591}," current[",{"type":32,"tag":229,"props":593,"children":594},{"style":236},[595],{"type":37,"value":596},"-",{"type":32,"tag":229,"props":598,"children":599},{"style":242},[600],{"type":37,"value":601},"overlap_size:] ",{"type":32,"tag":229,"props":603,"children":604},{"style":236},[605],{"type":37,"value":606},"if",{"type":32,"tag":229,"props":608,"children":609},{"style":242},[610],{"type":37,"value":611}," overlap_size ",{"type":32,"tag":229,"props":613,"children":614},{"style":236},[615],{"type":37,"value":516},{"type":32,"tag":229,"props":617,"children":618},{"style":352},[619],{"type":37,"value":620}," 0",{"type":32,"tag":229,"props":622,"children":623},{"style":236},[624],{"type":37,"value":625}," else",{"type":32,"tag":229,"props":627,"children":628},{"style":242},[629],{"type":37,"value":630}," []\n",{"type":32,"tag":229,"props":632,"children":634},{"class":231,"line":633},18,[635],{"type":32,"tag":229,"props":636,"children":637},{"style":242},[638],{"type":37,"value":427},{"type":32,"tag":229,"props":640,"children":642},{"class":231,"line":641},19,[643,648],{"type":32,"tag":229,"props":644,"children":645},{"style":236},[646],{"type":37,"value":647},"    if",{"type":32,"tag":229,"props":649,"children":650},{"style":242},[651],{"type":37,"value":652}," current:\n",{"type":32,"tag":229,"props":654,"children":656},{"class":231,"line":655},20,[657,662,666],{"type":32,"tag":229,"props":658,"children":659},{"style":242},[660],{"type":37,"value":661},"        chunks.append(",{"type":32,"tag":229,"props":663,"children":664},{"style":309},[665],{"type":37,"value":397},{"type":32,"tag":229,"props":667,"children":668},{"style":242},[669],{"type":37,"value":670},".join(current))\n",{"type":32,"tag":229,"props":672,"children":674},{"class":231,"line":673},21,[675],{"type":32,"tag":229,"props":676,"children":677},{"style":242},[678],{"type":37,"value":427},{"type":32,"tag":229,"props":680,"children":682},{"class":231,"line":681},22,[683,688],{"type":32,"tag":229,"props":684,"children":685},{"style":236},[686],{"type":37,"value":687},"    return",{"type":32,"tag":229,"props":689,"children":690},{"style":242},[691],{"type":37,"value":692}," chunks\n",{"type":32,"tag":33,"props":694,"children":695},{},[696],{"type":37,"value":697},"Cuando aumentamos overlap de %10 a %20, la ganancia en retrieval se estancó pero el costo de storage creció %18. En production, %10 fue el punto óptimo.",{"type":32,"tag":40,"props":699,"children":701},{"id":700},"setup-de-evaluación-sin-puntos-ciegos-en-production",[702],{"type":37,"value":703},"Setup de Evaluación: Sin Puntos Ciegos en Production",{"type":32,"tag":33,"props":705,"children":706},{},[707],{"type":37,"value":708},"Después de desplegar el sistema RAG, la mentalidad \"revisaremos si el usuario se queja\" no funciona en production. El pipeline de evaluación debe ejecutarse continuamente: cuando se añaden nuevos documentos, cuando cambia el modelo de embedding, cuando se actualiza la estrategia de chunking — pruebas de regresión automáticas. Este conjunto de métricas se ejecuta en cada commit en CI\u002FCD:",{"type":32,"tag":33,"props":710,"children":711},{},[712],{"type":32,"tag":180,"props":713,"children":714},{},[715],{"type":37,"value":716},"Métricas de retrieval:",{"type":32,"tag":718,"props":719,"children":720},"ul",{},[721,726,731],{"type":32,"tag":176,"props":722,"children":723},{},[724],{"type":37,"value":725},"Retrieval@5, @10 (sobre pares ground truth)",{"type":32,"tag":176,"props":727,"children":728},{},[729],{"type":37,"value":730},"Mean Reciprocal Rank (MRR) — en qué posición llegó el documento correcto",{"type":32,"tag":176,"props":732,"children":733},{},[734],{"type":37,"value":735},"NDCG@10 (calidad del ranking)",{"type":32,"tag":33,"props":737,"children":738},{},[739],{"type":32,"tag":180,"props":740,"children":741},{},[742],{"type":37,"value":743},"Métricas end-to-end:",{"type":32,"tag":718,"props":745,"children":746},{},[747,752,757],{"type":32,"tag":176,"props":748,"children":749},{},[750],{"type":37,"value":751},"Answer correctness (LLM-as-judge: GPT-4 evalúa la respuesta generada)",{"type":32,"tag":176,"props":753,"children":754},{},[755],{"type":37,"value":756},"Citation accuracy (penalización si contiene información no en la fuente)",{"type":32,"tag":176,"props":758,"children":759},{},[760],{"type":37,"value":761},"Latencia p50\u002Fp95\u002Fp99",{"type":32,"tag":33,"props":763,"children":764},{},[765],{"type":37,"value":766},"¿Cómo construimos el dataset de evaluación? Tomamos 500 queries del production, etiquetamos manualmente los documentos ground truth, luego medimos cada cambio contra este set. El dataset se actualiza mensualmente porque la distribución de queries de usuarios cambia — un score de eval de hace 3 meses no refleja el performance de production hoy.",{"type":32,"tag":33,"props":768,"children":769},{},[770],{"type":37,"value":771},"Para LLM-as-judge, usamos este prompt:",{"type":32,"tag":218,"props":773,"children":775},{"code":774},"Eres un modelo evaluador de sistemas RAG.\nAnaliza la siguiente tríada:\n\nUSER_QUERY: \"{query}\"\nRETRIEVED_CONTEXT: \"{context}\"\nGENERATED_ANSWER: \"{answer}\"\n\nEvalúa:\n1. ¿La respuesta contesta correctamente la query? (0-10)\n2. ¿Toda la información en la respuesta está en el contexto? (0-10, 0 si hay información no fundamentada)\n3. ¿La respuesta evita detalles innecesarios? (0-10, 10=concisa)\n\nOutput JSON: {{\"correctness\": X, \"grounding\": Y, \"conciseness\": Z}}\n",[776],{"type":32,"tag":225,"props":777,"children":778},{"__ignoreMap":16},[779],{"type":37,"value":774},{"type":32,"tag":33,"props":781,"children":782},{},[783],{"type":37,"value":784},"Ejecutamos esta evaluación en cada pull request — si el score retrieval@5 cae más de %2, el merge se bloquea.",{"type":32,"tag":40,"props":786,"children":788},{"id":787},"ajuste-de-hiperparámetros-top-k-y-reranking",[789],{"type":37,"value":790},"Ajuste de Hiperparámetros: Top-K y Reranking",{"type":32,"tag":33,"props":792,"children":793},{},[794],{"type":37,"value":795},"Después de búsqueda por embedding, recuperas los top-K documentos. ¿K=5, 10 o 20? Mayor K significa más contexto pero también más tokens enviados al LLM — tanto costo como latencia aumentan, además el ruido se multiplica (el LLM sufre el problema \"lost in the middle\" — pierde información en el medio de contextos largos).",{"type":32,"tag":33,"props":797,"children":798},{},[799,801,806],{"type":37,"value":800},"Lo que encontramos óptimo: ",{"type":32,"tag":180,"props":802,"children":803},{},[804],{"type":37,"value":805},"K=10 en retrieval por embedding + modelo reranker para seleccionar top-3",{"type":37,"value":807},". El reranker (Cohere rerank-english-v2.0 o cross-encoder\u002Fms-marco-MiniLM) hace matching semántico más profundo entre query y documento. Proporciona ranking %7-12 mejor que similitud coseno de embedding pero añade latencia (forward pass por cada documento).",{"type":32,"tag":33,"props":809,"children":810},{},[811],{"type":37,"value":812},"Pipeline:",{"type":32,"tag":172,"props":814,"children":815},{},[816,821,826],{"type":32,"tag":176,"props":817,"children":818},{},[819],{"type":37,"value":820},"Embedding retrieval top-10 (~80ms)",{"type":32,"tag":176,"props":822,"children":823},{},[824],{"type":37,"value":825},"Reranker reordena los 10 documentos, selecciona top-3 (~120ms)",{"type":32,"tag":176,"props":827,"children":828},{},[829],{"type":37,"value":830},"Envía top-3 como contexto al prompt del LLM",{"type":32,"tag":33,"props":832,"children":833},{},[834],{"type":37,"value":835},"La latencia total aumentó %40 comparado con embedding-only (80ms → 200ms) pero answer correctness subió de %87 a %94. Nuestro SLA de latencia visible es 500ms, así que este tradeoff es aceptable. Si el SLA fuera más restrictivo, podríamos mover el reranker a una cola async y servir top-3 de embedding primero, escribiendo el resultado del rerank en cache en background.",{"type":32,"tag":62,"props":837,"children":839},{"id":838},"impacto-real-del-reranking-resultados-de-ab-test",[840],{"type":37,"value":841},"Impacto Real del Reranking: Resultados de A\u002FB Test",{"type":32,"tag":33,"props":843,"children":844},{},[845,847,856],{"type":37,"value":846},"Durante 7 días, %50 del tráfico se enrutó a embedding-only y %50 a embedding+rerank. Usando ",{"type":32,"tag":848,"props":849,"children":853},"a",{"href":850,"rel":851},"https:\u002F\u002Fwww.roibase.com.tr\u002Fes\u002Ffirstparty",[852],"nofollow",[854],{"type":37,"value":855},"arquitectura de medición first-party",{"type":37,"value":857},", capturamos métricas por cada query en segmentos:",{"type":32,"tag":69,"props":859,"children":860},{},[861,887],{"type":32,"tag":73,"props":862,"children":863},{},[864],{"type":32,"tag":77,"props":865,"children":866},{},[867,872,877,882],{"type":32,"tag":81,"props":868,"children":869},{},[870],{"type":37,"value":871},"Métrica",{"type":32,"tag":81,"props":873,"children":874},{},[875],{"type":37,"value":876},"Solo Embedding",{"type":32,"tag":81,"props":878,"children":879},{},[880],{"type":37,"value":881},"Embedding + Rerank",{"type":32,"tag":81,"props":883,"children":884},{},[885],{"type":37,"value":886},"Delta",{"type":32,"tag":97,"props":888,"children":889},{},[890,913,936,959],{"type":32,"tag":77,"props":891,"children":892},{},[893,898,903,908],{"type":32,"tag":104,"props":894,"children":895},{},[896],{"type":37,"value":897},"Rating \"útil\" del usuario",{"type":32,"tag":104,"props":899,"children":900},{},[901],{"type":37,"value":902},"72%",{"type":32,"tag":104,"props":904,"children":905},{},[906],{"type":37,"value":907},"81%",{"type":32,"tag":104,"props":909,"children":910},{},[911],{"type":37,"value":912},"+9pp",{"type":32,"tag":77,"props":914,"children":915},{},[916,921,926,931],{"type":32,"tag":104,"props":917,"children":918},{},[919],{"type":37,"value":920},"Tasa de follow-up queries",{"type":32,"tag":104,"props":922,"children":923},{},[924],{"type":37,"value":925},"34%",{"type":32,"tag":104,"props":927,"children":928},{},[929],{"type":37,"value":930},"28%",{"type":32,"tag":104,"props":932,"children":933},{},[934],{"type":37,"value":935},"-6pp (bueno — la respuesta inicial fue suficiente)",{"type":32,"tag":77,"props":937,"children":938},{},[939,944,949,954],{"type":32,"tag":104,"props":940,"children":941},{},[942],{"type":37,"value":943},"Latencia p95",{"type":32,"tag":104,"props":945,"children":946},{},[947],{"type":37,"value":948},"180ms",{"type":32,"tag":104,"props":950,"children":951},{},[952],{"type":37,"value":953},"240ms",{"type":32,"tag":104,"props":955,"children":956},{},[957],{"type":37,"value":958},"+60ms",{"type":32,"tag":77,"props":960,"children":961},{},[962,967,972,977],{"type":32,"tag":104,"props":963,"children":964},{},[965],{"type":37,"value":966},"Costo\u002Fquery",{"type":32,"tag":104,"props":968,"children":969},{},[970],{"type":37,"value":971},"$0.003",{"type":32,"tag":104,"props":973,"children":974},{},[975],{"type":37,"value":976},"$0.0042",{"type":32,"tag":104,"props":978,"children":979},{},[980],{"type":37,"value":981},"+40%",{"type":32,"tag":33,"props":983,"children":984},{},[985],{"type":37,"value":986},"El reranking es obligatorio en production para retrieval de calidad — reducimos el costo incrementado mediante batch processing y cache conforme crece el volumen.",{"type":32,"tag":40,"props":988,"children":990},{"id":989},"cache-e-incrementalización-aquí-es-donde-viene-la-ganancia-real-de-costo",[991],{"type":37,"value":992},"Cache e Incrementalización: Aquí Es Donde Viene la Ganancia Real de Costo",{"type":32,"tag":33,"props":994,"children":995},{},[996],{"type":37,"value":997},"La optimización de costos no está en la selección de modelos sino en la estrategia de cache. Si la misma query llega de nuevo, no necesitas hacer embedding + retrieval de nuevo. Construimos esta estructura de cache en capas sobre Redis:",{"type":32,"tag":172,"props":999,"children":1000},{},[1001,1011,1021],{"type":32,"tag":176,"props":1002,"children":1003},{},[1004,1009],{"type":32,"tag":180,"props":1005,"children":1006},{},[1007],{"type":37,"value":1008},"Query embedding cache",{"type":37,"value":1010}," — cada query unique tiene su vector embedding cacheado 24 horas. Hit rate %41 (porque las queries de usuarios son repetitivas: \"pricing\", \"integration guide\", etc.).",{"type":32,"tag":176,"props":1012,"children":1013},{},[1014,1019],{"type":32,"tag":180,"props":1015,"children":1016},{},[1017],{"type":37,"value":1018},"Retrieval result cache",{"type":37,"value":1020}," — pares de query + IDs de documentos top-K se cachean 6 horas. Hit rate %28.",{"type":32,"tag":176,"props":1022,"children":1023},{},[1024,1029],{"type":32,"tag":180,"props":1025,"children":1026},{},[1027],{"type":37,"value":1028},"Generated answer cache",{"type":37,"value":1030}," — la respuesta completa se cachea 1 hora (se invalida después de actualizaciones de documentos). Hit rate %19.",{"type":32,"tag":33,"props":1032,"children":1033},{},[1034],{"type":37,"value":1035},"En un cache hit, la latencia cae de 200ms a 15ms, costo cero. El hit rate combinado es ~%88 — solo %12 del tráfico production requiere llamadas reales a embedding + LLM.",{"type":32,"tag":33,"props":1037,"children":1038},{},[1039],{"type":37,"value":1040},"Incrementalización: cuando se añade un documento nuevo, no reembedeas todo el corpus, solo el documento nuevo. La operación insert en vector database (Pinecone\u002FWeaviate) toma \u003C 50ms. Si un documento existente cambia, solo actualizas los chunks de ese documento. Así podemos integrar 500 documentos diarios, el sistema nunca tiene downtime.",{"type":32,"tag":40,"props":1042,"children":1044},{"id":1043},"observabilidad-en-production-herramientas-necesarias-para-debugging-de-rag",[1045],{"type":37,"value":1046},"Observabilidad en Production: Herramientas Necesarias para Debugging de RAG",{"type":32,"tag":33,"props":1048,"children":1049},{},[1050],{"type":37,"value":1051},"Cuando un usuario dice \"me dio una respuesta incorrecta\", ¿cómo debuggeas? Nuestro stack:",{"type":32,"tag":718,"props":1053,"children":1054},{},[1055,1065,1075],{"type":32,"tag":176,"props":1056,"children":1057},{},[1058,1063],{"type":32,"tag":180,"props":1059,"children":1060},{},[1061],{"type":37,"value":1062},"LangSmith",{"type":37,"value":1064}," — mantiene trace de cada paso del RAG chain: latencia de embedding, resultado retrieval, prompt\u002Fresponse del LLM, token count. Puedes reproducir cualquier query por su ID.",{"type":32,"tag":176,"props":1066,"children":1067},{},[1068,1073],{"type":32,"tag":180,"props":1069,"children":1070},{},[1071],{"type":37,"value":1072},"Dashboard custom",{"type":37,"value":1074}," (Grafana + Prometheus) — monitoreo en tiempo real de retrieval@5 score, cache hit rate, latencia p95, costo\u002Fquery.",{"type":32,"tag":176,"props":1076,"children":1077},{},[1078,1083],{"type":32,"tag":180,"props":1079,"children":1080},{},[1081],{"type":37,"value":1082},"Error budget",{"type":37,"value":1084}," — tolerancia de %2 de fallos de retrieval semanal (ej. documento no encontrado). Si se excede este threshold, se dispara una alerta.",{"type":32,"tag":33,"props":1086,"children":1087},{},[1088],{"type":37,"value":1089},"Alternativas open-source a LangSmith: Helicone, Langfuse. Lo importante es esto: en production debe mantenerse el trace completo de cada query, de lo contrario no puedes responder \"¿por qué la respuesta fue incorrecta?\"",{"type":32,"tag":33,"props":1091,"children":1092},{},[1093],{"type":37,"value":1094},"La complejidad del RAG está aquí: un simple spike de latencia o error de retrieval causa efecto cascada. La herramienta de observabilidad es tan crítica como la infraestructura.",{"type":32,"tag":1096,"props":1097,"children":1098},"hr",{},[],{"type":32,"tag":33,"props":1100,"children":1101},{},[1102],{"type":37,"value":1103},"En RAG production, la optimización de costos es el segundo paso. Primero eleva la calidad de retrieval a niveles %90+: prueba el modelo de embedding con evaluación, ajusta la estrategia de chunking según límites semánticos, añade reranker, construye un pipeline de evaluación continua. Una vez que la calidad está establecida, reduce costos mediante cache, procesamiento batch e incrementalización. Si lo haces al revés, terminas con un sistema económico pero inutilizable — cuando el usuario ve una alucinación, tu pérdida de costo es 10 veces mayor que el error de retrieval.",{"type":32,"tag":1105,"props":1106,"children":1107},"style",{},[1108],{"type":37,"value":1109},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}",{"title":16,"searchDepth":280,"depth":280,"links":1111},[1112,1115,1116,1117,1120,1121],{"id":42,"depth":258,"text":45,"children":1113},[1114],{"id":64,"depth":280,"text":67},{"id":157,"depth":258,"text":160},{"id":700,"depth":258,"text":703},{"id":787,"depth":258,"text":790,"children":1118},[1119],{"id":838,"depth":280,"text":841},{"id":989,"depth":258,"text":992},{"id":1043,"depth":258,"text":1046},"markdown","content:es:ai:rag-production-calidad-recuperacion-primero.md","content","es\u002Fai\u002Frag-production-calidad-recuperacion-primero.md","es\u002Fai\u002Frag-production-calidad-recuperacion-primero","md",1782079490189]