RAG systems have become mainstream in production since 2024. Companies are building embedding + vector DB stacks to feed their own document corpus into LLMs. Yet most pilot projects hit the same wall: retrieval quality drops, answers become inconsistent, costs spiral. The culprit is usually hasty decisions on embedding model selection, chunking strategy, and eval setup. This piece shows you which decisions in your RAG pipeline have no reversible path before you move to production.
Embedding Model: Alignment, Not Dimension
Your first instinct when choosing an embedding model is "which one has the highest MTEB score." But benchmark rankings don't guarantee production performance. What matters is how well the model aligns with your document types and query patterns.
When we compared OpenAI's text-embedding-3-large (3072 dim) with Cohere's embed-v3 (1024 dim), Cohere delivered more consistent recall@10 on marketing documents (blogs, case studies, landing pages)—because its training set was heavy on business content. OpenAI's larger size scored well on general benchmarks, but the distribution of domain-specific queries was different.
Another example: bge-large-en-v1.5 (1024 dim, self-hosted) is sufficient for legal documents. But on a multilingual corpus, multilingual-e5-large (1024 dim) clearly wins. Model size isn't always a quality signal—alignment between training data and your domain is more critical.
Selection criteria:
- Not MTEB score, but recall@5 / MRR metric on your own eval set
- Latency (self-hosted vs API)—batch embedding time for 512 documents
- Cost per 1M tokens—OpenAI 3-large costs $0.13, Cohere v3 $0.10, self-hosted $0 but infrastructure overhead exists
If your document set contains domain-specific jargon (pharma, finance, legal), fine-tuning an embedding model or adapting sentence transformers to your data increases retrieval quality by 15–20%. This falls under data analysis & insights engineering—you need to build a training pipeline and monitor data quality.
Chunking Strategy: Fixed Size Doesn't Work
Most RAG implementations start with "512 token overlapping window" as default. This barely works for markdown blogs but breaks immediately on mixed-format corpus (PDF, HTML, JSON).
Fixed-size chunking problems:
- Headings get split, semantic integrity is lost
- Tables and code blocks are severed mid-block
- Overlap strategy duplicates overlapping context, adding retrieval noise
Alternative: semantic chunking. Split documents respecting sentence boundaries, heading hierarchy, and structural integrity. Replace langchain's RecursiveCharacterTextSplitter with MarkdownTextSplitter or a custom parser. On PDFs, use pdfplumber to separate tables from text and apply different chunk strategies to each.
For an e-commerce firm, we split product documents into three chunk types:
- Title + short description: 128 tokens, lightweight for retrieval
- Technical specs + table: 256 tokens, structured data
- Long-form content (blog, guide): 512 tokens, semantically split
Each chunk type got metadata (chunk_type, source_page). During retrieval, we filtered by chunk_type based on query type. For example, "product comparison" queries only looked at technical_specs chunks. This improved precision@3 by 18%.
Overlap Strategy: How Much Is Enough?
Overlap is usually recommended at 10–20%, but that's arbitrary. Test results: 50-token overlap on 512-token chunks preserves semantic continuity. 100-token overlap increases retrieval latency by 12% without quality gains. The sweet spot varies by domain—test it on your own eval set.
Eval Setup: Build It Before Production
Most RAG systems pass the "looks good visually" test into production. But without a structured eval pipeline to measure retrieval quality, the system won't be trustworthy on the first 1000 queries.
Minimal eval pipeline:
# eval_set.json — golden dataset
[
{
"query": "How do I ensure GDPR-compliant user consent?",
"expected_docs": ["doc_42", "doc_89"],
"expected_answer_contains": ["cookie notice", "explicit consent"]
},
...
]
# eval metrics
def evaluate_retrieval(query, retrieved_docs, expected_docs):
recall_at_k = len(set(retrieved_docs[:5]) & set(expected_docs)) / len(expected_docs)
mrr = 1 / (retrieved_docs.index(expected_docs[0]) + 1) if expected_docs[0] in retrieved_docs else 0
return {"recall@5": recall_at_k, "mrr": mrr}
def evaluate_generation(generated_answer, expected_contains):
# LLM-as-judge: ask Claude "does this answer cover expected content?"
prompt = f"Expected: {expected_contains}\nGenerated: {generated_answer}\nScore 0-1:"
score = claude_api(prompt)
return float(score)
Eval frequency: After every embedding model change, chunking tweak, or strategy shift. Run it automatically in CI/CD. Block deployment if recall@5 drops below 0.7.
In practice: we built a 200-query eval set for a customer. The eval pipeline ran automatically on every commit. One chunking change lifted recall@5 from 0.68 to 0.81, but p95 latency jumped from 340ms to 520ms. Once we visualized the cost/latency tradeoff on the dashboard, we reverted the chunking and tried a different approach. Without eval, we'd never have seen this tradeoff.
Hybrid Search: Sparse + Dense Retrieval
Relying only on vector similarity fails on edge cases. For example, queries requiring exact keyword matches (product codes, API endpoints) score low on vector search. This is where hybrid search comes in: combine BM25 (sparse) + embedding (dense) scores.
# Hybrid retrieval example
bm25_results = bm25_index.search(query, top_k=20)
vector_results = vector_db.search(query_embedding, top_k=20)
# RRF (Reciprocal Rank Fusion)
def rrf_score(rank, k=60):
return 1 / (k + rank)
combined_scores = {}
for rank, doc in enumerate(bm25_results):
combined_scores[doc.id] = combined_scores.get(doc.id, 0) + rrf_score(rank)
for rank, doc in enumerate(vector_results):
combined_scores[doc.id] = combined_scores.get(doc.id, 0) + rrf_score(rank)
final_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)[:5]
Test result: hybrid search boosted recall@5 by 22% on technical queries. But latency doubled because you're hitting two separate indexes. If that tradeoff is acceptable (internal tool, sub-500ms is fine), hybrid search works in production.
Reranking: Second-Pass Filtering
Initial retrieval (BM25 + vector) returns 20–50 documents. But not all fit into LLM context (token limits + cost). A reranker model rescores each document's relevance to the query and picks the top-5.
Models like Cohere's rerank-english-v2.0 or bge-reranker-large are standard. Rerankers use cross-encoder architecture—encoding query + document together—so they're more expensive than embeddings but far more accurate.
Benchmark from our work: reranking over 50 documents:
- Recall@5: 0.73 → 0.89
- Latency: +180ms (acceptable)
- Cost: +$0.002 per retrieval (Cohere API)
If budget is tight, self-hosted rerankers are an option but require GPU inference. At that point, you need to do the math: self-hosted infrastructure cost vs API cost.
Context Window Optimization: Fewer Documents, Better Answers
Sending 20 documents to an LLM doesn't always produce better answers. Long context triggers the "lost in the middle" problem—the model skips information in the middle. Test result: feeding GPT-4 Turbo 5 documents produces better answers than 15 documents (11% BLEU score difference).
Optimization strategy:
- Use reranking to pick top-5
- Drop any document with relevance score < 0.6
- Send the remaining 3–5 documents to LLM context
This cuts input token cost (70% reduction) and improves answer quality. In production, you're balancing the cost/latency/quality triangle—your eval pipeline makes this visible.
Production Monitoring: Retrieval Drift
Retrieval quality degrades over time—as new documents are added, as query distribution shifts. Set up a dashboard to track retrieval drift:
| Metric | Target | Alert Threshold |
|---|---|---|
| Recall@5 (weekly eval) | > 0.75 | < 0.70 |
| P95 latency | < 400ms | > 600ms |
| Zero-result queries (%) | < 5% | > 10% |
| Average relevance score | > 0.65 | < 0.55 |
If you spot recall drift:
- Refresh your eval set (add new query patterns)
- Fine-tune your embedding model or swap it
- Review your chunking strategy
This monitoring fits into first-party data & measurement architecture—your RAG system is a data pipeline and needs observability.
Cost vs Quality Tradeoff: Pragmatic Choices
Every production RAG decision involves a cost/quality/latency tradeoff. Some pragmatic picks:
- Embedding model: Swap OpenAI 3-large for Cohere v3 → 30% cost cut, 2% quality loss (acceptable)
- Reranking: Rerank every query vs only ambiguous ones → 40% latency drop
- Hybrid search: BM25 + vector vs vector alone (if exact match doesn't matter) → 50% latency drop
- Context window: 10 docs vs 5 docs → 60% token cost cut, 8% quality gain
Without an eval pipeline, you don't see these tradeoffs. You change embedding models, get cheaper, and miss the 15% drop in retrieval quality.
Before shipping RAG to production, take embedding models, chunking strategies, and eval setup seriously. Cost optimization comes second—first, nail retrieval quality and keep it stable, then reduce costs. Otherwise, the system's unreliability surfaces to users and adoption tanks.