Embedding models evolve. You upgraded from OpenAI's text-embedding-3-small to text-embedding-3-large — do you re-generate all vectors? Is a year-old content index still valid, or has semantic space shifted? Building a RAG pipeline in production forces these questions immediately. Embedding drift—the semantic distance between evolving model representations and stale indices—silently erodes retrieval accuracy. This article outlines re-indexing strategies, the cost-tradeoff of model migration, and vector versioning practices.
The Anatomy of Drift: Why Embedding Space Shifts
An embedding model doesn't just convert input to vectors—it defines latent space geometry. Model updates, fine-tuning on new domain data, or architectural migration (Sentence-BERT to BGE-M3) all rotate this space. Result: documents encoded with the old model, queries with the new one—cosine similarity no longer reflects original semantic relationships.
Two scenarios: intra-model drift (version change within a model family) and inter-model drift (different model families). OpenAI's ada-002 to text-embedding-3-small is inter-model; 3-small to 3-large is intra-model, but both trigger re-indexing. The difference is magnitude: cross-family migration can drop retrieval accuracy ~40% (MTEB benchmark observation), same-family ~5-10%.
Drift detection is hard because systems fail silently. Query latency doesn't increase, no errors thrown—just lower-ranked documents in top results. This is why production retrieval quality metrics (nDCG, recall@k) are non-negotiable. Without user feedback or offline evaluation, you notice drift only at 15-20% accuracy loss—by then revenue is already impacted.
Re-indexing Strategies: Full Rebuild and Incremental Hybrid
Re-indexing takes three forms: full rebuild, incremental re-index, shadow index.
Full rebuild: Encode entire corpus with new model, write to new collection, atomic switch prod traffic. Advantage: guaranteed semantic consistency. Disadvantage: cost. 10M documents, avg. 400 tokens, text-embedding-3-large = ~2B tokens. At OpenAI $0.13/1M tokens, ~$260. Pinecone/Weaviate: 1536-dim, 10M vectors = ~60 GB index, ~$150/month hosting (Pinecone p2 pod). Total first pass: ~$400-500.
Incremental re-index: Only new or modified documents get new embeddings. Old docs retain old embeddings. Advantage: 70% cost reduction (assuming 30% corpus added in 6 months). Disadvantage: hybrid space—query encoded new model, some docs old model. Cosine similarity breaks; magnitude bias emerges if models aren't normalized.
Shadow index: Test new model on separate production index. Route real queries to both, compare results (users only see old index). Switch prod once accuracy threshold passes. Advantage: zero-risk A/B testing. Disadvantage: double cost, latency +30-40% (two parallel queries + aggregation overhead).
Our choice: shadow index → full rebuild. Eval with shadow for two weeks; if nDCG@10 improvement >5%, switch prod and drop old index. We use incremental re-index only for minor model bumps (ada-002 v1 → v2).
Model Migration Cost-Tradeoff: Dimensionality and Inference
New embedding models typically offer higher dimensions: ada-002 (1536-dim) → text-embedding-3-large (3072-dim). Dimension increase multiplies two costs: storage and query latency.
Storage: Pinecone's pod architecture: 3072-dim vector uses 2× disk vs. 1536-dim (float32: 3072 × 4 bytes = 12 KB per vector). 10M vectors = 120 GB. This exceeds p2's 100 GB free tier; jump to p3 (~$500/month). Alternative: quantization (product or binary)—75% storage reduction, ~2-3% recall loss.
Query latency: Higher dimension means more distance computation in HNSW traversal. 1536-dim → 3072-dim pushes p95 from 45ms to 70ms (Pinecone docs extrapolation). If SLA target is <50ms, unacceptable. Solution: dimension reduction—use text-embedding-3-large's embedding_size parameter to downsize to 1536. Trade-off: 1-2% accuracy loss, latency stable.
Cost-tradeoff matrix:
| Option | Storage (10M docs) | Latency (p95) | Accuracy drop |
|---|---|---|---|
| 1536-dim (old model) | 60 GB | 45 ms | Baseline |
| 3072-dim (new model, full) | 120 GB | 70 ms | Baseline |
| 3072-dim + quantization | 30 GB | 65 ms | -2% recall |
| 1536-dim (new model, reduced) | 60 GB | 48 ms | -1% recall |
Our choice: reduce new model to 1536-dim. Minimal accuracy loss, flat infrastructure cost. If downstream task (e.g., GEO pipeline citation rate) tracks end-metrics, offline eval 1536 vs. 3072 directly—usually 1% difference doesn't move final metrics.
Versioning: Storing Embedding Provenance in Metadata
Treat your vector DB like an audit log—each vector carries timestamp and model_version metadata. Weaviate or Qdrant store this as fields:
{
"id": "doc-12345",
"vector": [...],
"metadata": {
"model": "text-embedding-3-large",
"model_version": "2024-04",
"indexed_at": "2026-01-15T10:30:00Z",
"content_hash": "a3f8c..."
}
}
This metadata serves three purposes:
- Incremental re-index filtering: Query
model_version != currentto find docs needing updates. - Drift detection: At query time, log if >30% results come from stale model versions—auto-trigger re-index.
- Rollback: If new model breaks prod, filter metadata to fallback to old embeddings (if shadow index still exists).
Metadata overhead is small: ~100 bytes per vector, 10M docs = 1 GB. But operational flexibility is huge. Essential for multi-tenant systems where tenants use different model versions.
Content Hash for Idempotency: Avoiding Redundant Re-indexing
Separate problem: re-indexing even when content hasn't changed. Your CMS pipeline fetches all blog posts nightly and sends to index—90% unchanged, 10 updated. Re-encoding entire corpus is wasteful.
Solution: SHA-256 hash document content, store in metadata. Before re-indexing, compare hashes—skip re-encoding if matched. Pseudo-code:
def should_reindex(doc_id, new_content, vector_db):
existing = vector_db.get_metadata(doc_id)
if not existing:
return True
new_hash = hashlib.sha256(new_content.encode()).hexdigest()
return new_hash != existing.get("content_hash")
This pattern cuts encoding cost 70-80% in daily incremental pipelines. But caveat: if model_version changes, skip the hash check entirely. Logic: if model_version != current OR content_hash != existing → re-index.
The Counterargument: Cost of Delaying Re-indexing
Some teams defer re-indexing "old embeddings are good enough" for 6-12 months. Risk: if model is domain fine-tuned (e.g., e-commerce product descriptions), new model may give 20-30% better retrieval. This translates downstream—in one Roibase project with Data Analytics & Insights Engineering, upgrading RAG's embedding model lifted product recommendation click-through 18% (A/B, 14 days, n=50K users).
But tradeoff exists: downtime risk during switch. Non-atomic transitions show users temporary inconsistency (some docs new model, some old). Solution: blue-green deployment—stage new index separately, switch via DNS/load-balancer in 10 seconds. Pinecone/Weaviate collection aliases simplify this.
Closing: Embedding Hygiene as Production Practice
Embedding drift is inevitable—models evolve, domain data shifts, semantic space rotates. Treat your vector DB not as static artifact but as a continuously maintained system. Minimum hygiene checklist: (1) store model version in metadata, (2) monitor retrieval quality weekly (lightweight offline eval suffices), (3) test migration via shadow index, (4) establish content-hash idempotency. If re-indexing costs are prohibitive, go hybrid (incremental + reduced dimensionality)—but measure accuracy loss downstream, don't guess. Ignoring embedding drift silently erodes search accuracy 15-20%—by detection time, user behavior has already shifted.