RAG systems have become widespread in production since 2024. Companies are building embedding + vector DB stacks to feed their own document corpus to LLMs. But most pilot projects hit the same wall: retrieval quality is low, answers are inconsistent, costs spiral. The problem usually comes down to hasty decisions on embedding model choice, chunking strategy, and eval setup. This post shows which decisions have no undo button before shipping your RAG pipeline to production.
Embedding Model: Alignment Over Dimensionality
The first instinct when choosing an embedding model is "which has the highest MTEB score." But benchmark rankings don't guarantee production performance. What matters is how well the model aligns with your document type and query pattern.
When we compared OpenAI's text-embedding-3-large (3072 dim) against Cohere's embed-v3 (1024 dim), Cohere delivered more consistent recall@10 on marketing documents (blogs, case studies, landing pages) because its training set was heavy on business content. OpenAI's larger size scored well on general benchmarks, but the distribution of domain-specific queries was different.
Another example: bge-large-en-v1.5 (1024 dim, self-hosted) is sufficient for legal documents. But on a multilingual corpus, multilingual-e5-large (1024 dim) clearly outperforms. Model size isn't always a quality signal—training data overlap with your domain is more critical.
Selection criteria:
- Not MTEB score—recall@5 / MRR metric on your own eval set
- Latency (self-hosted vs API)—batch embedding time for 512 documents
- Cost per 1M tokens—OpenAI 3-large $0.13, Cohere v3 $0.10, self-hosted $0 but infrastructure exists
If your document set has domain-specific jargon (pharma, finance, legal), fine-tuning an embedding model or sentence transformers on your own data lifts retrieval quality by 15–20%. This falls under data analytics & insight engineering—you need a training pipeline and data quality observation.
Chunking Strategy: Fixed Size Doesn't Scale
Most RAG implementations start with "512 token overlapping window" as default. This barely works for markdown blogs but breaks immediately on mixed-format corpus (PDF, HTML, JSON).
Problems with fixed-size chunking:
- Headers get split, semantic integrity lost
- Tables, code blocks cut in half
- Overlap strategy duplicates overlapping context, retrieval noise increases
Alternative: semantic chunking. Split on sentence boundaries and heading hierarchy to preserve semantic units. Use MarkdownTextSplitter instead of langchain's RecursiveCharacterTextSplitter. Parse PDFs with pdfplumber to separate tables from text and apply different strategies to each.
For an e-commerce company's RAG stack, we split product documentation into three chunk types:
- Title + short description: 128 tokens, lightweight for retrieval
- Technical specs + table: 256 tokens, structured data
- Long-form content (blog, guide): 512 tokens, semantic split
We added metadata to each chunk (chunk_type, source_page). During retrieval, we filtered by chunk_type based on query type. For example, "product comparison" queries only looked at technical_specs chunks. This lifted precision@3 by 18%.
Overlap Strategy: How Much Is Enough?
Overlap is usually recommended at 10–20% but that's arbitrary. Our test: 50-token overlap on 512-token chunks preserves semantic continuity. 100-token overlap bumped retrieval latency 12% with no quality gain. The sweet spot varies by domain—test on your eval set.
Eval Setup: Must Exist Before Production
Most RAG systems pass to production on a "looks good visually" test. But without a structured eval setup to measure retrieval quality, the system won't be trustworthy after the first 1,000 queries.
Minimal eval pipeline:
# eval_set.json — golden dataset
[
{
"query": "How to collect user consent in GDPR-compliant way?",
"expected_docs": ["doc_42", "doc_89"],
"expected_answer_contains": ["cookie notice", "explicit consent"]
},
...
]
# eval metrics
def evaluate_retrieval(query, retrieved_docs, expected_docs):
recall_at_k = len(set(retrieved_docs[:5]) & set(expected_docs)) / len(expected_docs)
mrr = 1 / (retrieved_docs.index(expected_docs[0]) + 1) if expected_docs[0] in retrieved_docs else 0
return {"recall@5": recall_at_k, "mrr": mrr}
def evaluate_generation(generated_answer, expected_contains):
# LLM-as-judge: ask Claude "does this answer cover the expected content?"
prompt = f"Expected: {expected_contains}\nGenerated: {generated_answer}\nScore 0-1:"
score = claude_api(prompt)
return float(score)
Eval frequency: After every embedding model change, every chunking strategy tweak. Run automatically in CI/CD. If recall@5 drops below 0.7, block the deploy.
Real scenario: we built a 200-query eval set for a customer. The eval pipeline ran automatically on every commit. One chunking change lifted recall@5 from 0.68 to 0.81 but p95 latency went from 340ms to 520ms. Without eval, this latency-quality tradeoff would have been invisible on the dashboard.
Hybrid Search: Sparse + Dense Retrieval Combined
Relying only on vector similarity fails on edge cases. For example, queries needing exact keyword matches (product codes, API endpoint names) score low in vector search. This is where hybrid search enters: combine BM25 (sparse) + embedding (dense) scores.
# Hybrid retrieval example
bm25_results = bm25_index.search(query, top_k=20)
vector_results = vector_db.search(query_embedding, top_k=20)
# RRF (Reciprocal Rank Fusion)
def rrf_score(rank, k=60):
return 1 / (k + rank)
combined_scores = {}
for rank, doc in enumerate(bm25_results):
combined_scores[doc.id] = combined_scores.get(doc.id, 0) + rrf_score(rank)
for rank, doc in enumerate(vector_results):
combined_scores[doc.id] = combined_scores.get(doc.id, 0) + rrf_score(rank)
final_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)[:5]
Test result: hybrid search lifted recall@5 by 22% on technical queries. But latency doubled because you're querying two separate indexes. If this tradeoff is acceptable (e.g., internal tool with <500ms requirement), hybrid search works in production.
Reranking: Second-Stage Filtering
First retrieval (BM25 + vector) returns 20–50 documents. But not all fit in LLM context (cost + token limit). A reranker model steps in: rescores each document by relevance to the query and picks top-5.
Models like Cohere's rerank-english-v2.0 or bge-reranker-large do this. Rerankers use cross-encoder architecture—they encode query + document together, so they're pricier than embeddings but more accurate.
Benchmark: applying reranking over 50 documents:
- Recall@5: 0.73 → 0.89
- Latency: +180ms (acceptable)
- Cost: +$0.002 per retrieval (Cohere API)
If budget is tight, use a self-hosted reranker—but you need GPU inference. At this point, calculate self-hosted infra cost vs API cost.
Context Window Optimization: Fewer Documents, Better Answers
Sending 20 documents to an LLM doesn't always produce better answers. Long context triggers the "lost in the middle" problem—the model skips information in the middle. Test result: sending GPT-4 Turbo 5 documents produces better answers than 15 documents (11% BLEU score difference).
Optimization strategy:
- Use reranker to pick top-5
- Drop documents with relevance score < 0.6
- Send remaining 3–5 documents to LLM context
This cuts input token cost by 70% and improves answer quality. In production, you're balancing the cost/latency/quality triangle—eval pipeline makes this visible.
Production Monitoring: Retrieval Drift
Retrieval quality can degrade over time—as new documents are added, query distribution shifts. Set up a retrieval drift dashboard:
| Metric | Target | Alarm Threshold |
|---|---|---|
| Recall@5 (weekly eval) | > 0.75 | < 0.70 |
| P95 latency | < 400ms | > 600ms |
| Zero-result queries (%) | < 5% | > 10% |
| Average relevance score | > 0.65 | < 0.55 |
If you see recall drift:
- Update your eval set (add new query patterns)
- Fine-tune the embedding model or swap it
- Revisit chunking strategy
This monitoring falls under first-party data & measurement architecture—a RAG system is also a data pipeline and must be observable.
Cost vs Quality Tradeoff: Pragmatic Choices
In production RAG, every decision involves a cost/quality/latency tradeoff. Some pragmatic moves:
- Embedding model: Use Cohere v3 instead of OpenAI 3-large → 30% cost savings, 2% quality loss (acceptable)
- Reranking: Rerank only ambiguous queries instead of all → 40% latency reduction
- Hybrid search: Vector-only instead of BM25 + vector (if exact match isn't critical) → 50% latency reduction
- Context window: 5 documents instead of 10 → 60% token cost reduction, 8% quality improvement
To see these tradeoffs, you need an eval pipeline. Otherwise you say "I swapped the embedding model, it's cheaper now" but miss the 15% retrieval quality drop.
Before shipping your RAG system to production, take embedding model, chunking strategy, and eval setup seriously. Cost optimization comes second—first stabilize retrieval quality, then optimize cost. Otherwise, the system's unreliability hits your users and adoption suffers.