Embedding Drift: Managing Vector DBs in Production

Deploy a RAG system to production and everything works fine for the first month. In month three, OpenAI releases text-embedding-4 to replace text-embedding-3-large. You test it, results show 4% higher recall. But your 12 million documents are still indexed against the old embedding model. Re-indexing costs 18 hours and $6,400 in API fees. This is where embedding drift begins — you upgrade the model but the vector store lags behind, query embeddings and stored embeddings occupy different manifolds, retrieval accuracy silently decays. This piece explains the cost-quality tradeoff for model migration, how to design incremental re-indexing, and how to measure drift in production.

What is Embedding Drift and Why It Matters

Embedding drift occurs when the query embedding model differs from the document embedding model. Index documents with model A, then query with model B — cosine similarity loses meaning. Both models operate in different vector spaces, so "similarity" scores become unreliable.

This appears in three main scenarios: (1) embedding provider releases a new version (OpenAI ada-002 → text-embedding-3-small reduced dimensionality by 12% but no binary compatibility), (2) switching to a fine-tuned model (domain-specific training beats generic models but requires re-embedding the entire corpus), (3) multilingual model change (swapping sentence-transformers/paraphrase-multilingual-mpnet-base-v2 for intfloat/multilingual-e5-large improves retrieval@10 by 8% but no 1:1 mapping exists).

In production, drift is hard to detect because metrics degrade gradually. Users report "slightly worse results" in week one, support tickets rise 15% by week two, retention drops by week three. The early signal for drift is this: mean similarity score of new queries falls below the baseline from indexing time. If mean cosine similarity was 0.78 at indexing, dropping to 0.71 at query time signals model incompatibility.

Cost Tradeoff: Re-index vs Dual Model

Think about re-indexing cost in three components: (1) API call cost (OpenAI text-embedding-3-large $0.13 per 1M tokens, Cohere embed-v3 $0.10), (2) compute time (12M documents × 512 token average = 6.1B tokens ≈ 18 hours parallel batch processing), (3) downtime risk (without atomic switchover, user queries hit a half-indexed state).

Alternative: dual model strategy — build a separate index for the new model and A/B test the transition. This doubles storage cost but eliminates risk. Once the new index is ready, shift traffic 10% → 50% → 100%. If regression appears, rollback happens instantly. However, dual indexing costs 2× vector storage (Pinecone p1.x1 pod costs $0.096/hour, 12M 1536-dim vectors = ~18GB ≈ 2 pods = $140/month, dual index = $280/month).

Incremental Re-indexing: Hot/Cold Partitioning

Instead of re-indexing the entire corpus overnight, partition by query frequency. Documents queried in the last 30 days are "hot," the rest are "cold." Hot partitions typically represent 15-25% of the corpus but handle 80% of query traffic.

The strategy: re-embed the hot partition with the new model first (18 hours becomes 3 hours, $6,400 becomes $1,200). At query time, shard routing directs new queries to the hot index first, falling back to cold if needed. This delivers 80% accuracy improvement on day one, 100% within 2-3 weeks of rolling re-indexing.

Track partitions with a simple PostgreSQL table:

CREATE TABLE doc_partition (
  doc_id UUID PRIMARY KEY,
  partition TEXT CHECK (partition IN ('hot', 'cold')),
  last_queried_at TIMESTAMPTZ,
  embedding_model TEXT,
  embedding_version TEXT,
  re_indexed_at TIMESTAMPTZ
);

CREATE INDEX idx_partition_model 
  ON doc_partition(partition, embedding_model);

Query routing logic:

def retrieve(query: str, model: str, k: int = 10):
    query_emb = embed(query, model)
    
    # search hot partition first
    hot_results = vector_db.search(
        collection="hot",
        vector=query_emb,
        limit=k,
        filter={"embedding_model": model}
    )
    
    if len(hot_results) >= k:
        return hot_results
    
    # backfill from cold if needed
    cold_results = vector_db.search(
        collection="cold",
        vector=query_emb,
        limit=k - len(hot_results),
        filter={"embedding_model": model}
    )
    
    return merge_results(hot_results, cold_results)

This approach mirrors the "event-driven incremental sync" logic used in Roibase's first-party data architecture work — sync changing subsets continuously instead of copying all data at once.

Drift Detection: Embedding Space Monitoring

Measure drift in production with three metrics:

Metric	Threshold	Meaning
Mean similarity shift	baseline − 0.05	Distance between query embeddings and index increased
Top-k stability	<90% overlap	Same query returns different results (model change effect)
OOV (out-of-vocabulary) rate	>2%	New model doesn't recognize terms from old corpus

Calculate mean similarity shift with daily batch jobs — take queries from the last 24 hours, embed with both the old and new models, compare cosine similarity against stored embeddings. If the new model yields 0.73 similarity and the old one 0.78, there's 0.05 drift — re-indexing signal.

For top-k stability, run the same test query set (100-200 queries) daily with both models, compare the top 10 results. If overlap drops below 85%, model migration is needed.

Model Migration Strategy: Blue-Green Deployment

When changing models, perform atomic switchover via blue-green deployment. Old index is "blue," new index is "green." Traffic initially hits blue while you populate green in the background. Once green is ready, shift traffic to green in 5 minutes. If issues arise, rollback to blue immediately.

Concrete steps:

T-0: Start generating embeddings with the new model, build parallel index (green_index).
T+18h: Green index 100% ready. Blue index still live.
T+18h 5m: Add MODEL_VERSION=green flag to query router, shift 10% of traffic to green.
T+18h 30m: No errors, shift 50%.
T+19h: 100% green. Blue index moves to read-only (7-day backup).
T+7 days: Blue index deleted.

This approach is especially critical for e-commerce search — at a Roibase customer (cosmetics category, 2.4M products, 80K queries/day), model migration via blue-green resulted in 0.2% session loss (rollback completed in 12 seconds).

Cost Optimization: Batch + Cache

Two techniques to reduce re-indexing cost:

Batch API usage: OpenAI batch API costs 50% less than standard API ($0.13 → $0.065 per 1M tokens). It's async — responses arrive in 1-24 hours, sufficient for re-indexing since realtime isn't required. Sending 12M documents to batch reduces cost from $6,400 to $3,200.

Semantic cache: If the same document is indexed multiple times with different metadata (e.g., same product description, different SKUs), cache the embedding. Deduplicate via MD5 hash. At Roibase, this yields 12-18% cost reduction (especially in fashion/beauty segments where descriptions are similar).

import hashlib
from functools import lru_cache

@lru_cache(maxsize=100_000)
def cached_embed(text: str, model: str) -> list[float]:
    cache_key = hashlib.md5(f"{model}:{text}".encode()).hexdigest()
    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)
    
    emb = openai.Embedding.create(input=text, model=model)
    redis.setex(cache_key, 86400 * 7, json.dumps(emb))
    return emb

Fine-Tuned Model Migration: Domain Adaptation Tradeoff

Replacing generic embeddings with domain-specific fine-tuned models improves retrieval@10 by 8-15% (e.g., legal domain: paraphrase-mpnet-base-v2 → legal-bert-base-uncased with contrastive learning). But fine-tuning carries costs: (1) labeled data collection (1,000-5,000 query-document pairs), (2) GPU time (A100, 8 hours ≈ $60), (3) full corpus re-indexing.

Tradeoff analysis: if retrieval accuracy improves 10% and drives 2% conversion lift (e.g., recommending the right article increases form completion by 2%), at 100K queries/month × 0.02 × $50 AOV = $100K monthly lift, the $10K fine-tuning + re-indexing cost pays back in a month.

But fine-tuned models have ongoing maintenance costs — re-training every 6 months as domain shifts. This creates continuous re-indexing cycles. Alternative: adapter layers — attach a small fine-tuned layer atop the base model so base embeddings remain fixed, only query-time projection changes. Accuracy gain drops from 15% to 8% but eliminates re-indexing.

The Counter Case: Is Re-indexing Ever Unnecessary?

Sometimes not re-indexing is the right call. If (1) model change is minor (e.g., OpenAI ada-002 vs text-embedding-3-small empirical recall difference <2%), (2) corpus is static (no new documents), (3) query patterns don't shift — drift stays minimal.

Especially in B2B SaaS (internal knowledge bases, documentation search), corpus updates happen 1-2 times yearly. Outside major upgrades (BERT → MPNet), skipping re-indexing makes sense. Instead, ensemble at query time — retrieve with both old and new models, merge results via reciprocal rank fusion. This adds 3-5% latency but costs less than re-indexing.

Decision tree:

Corpus >5M documents + new model >5% accuracy gain → incremental re-index with hot/cold partitioning
Corpus <1M + >10% gain → blue-green full re-index
Corpus <1M + <5% gain → ensemble + defer re-indexing
Fine-tuned model + conversion impact >10× cost → re-index
Fine-tuned model + conversion impact <3× cost → adapter layer or skip

Roibase's GEO work faces similar tradeoffs — when optimizing LLM citations, which content must be regenerated versus which is sufficient as-is? That also demands cost-impact analysis.

Drift Prevention: Version Pinning and Contract Testing

The best defense against embedding drift in production is model version pinning and API contract tests. If you use OpenAI's text-embedding-3-large, keep the model ID fixed in config, no auto-upgrades. Test new versions manually before adopting.

Contract test example:

def test_embedding_compatibility():
    test_docs = [
        "machine learning model training",
        "vector database indexing",
        "semantic search optimization"
    ]
    
    # baseline embedding (production model)
    baseline = [embed(doc, model="text-embedding-3-large") for doc in test_docs]
    
    # compare against candidate model
    candidate = [embed(doc, model="text-embedding-4") for doc in test_docs]
    
    # check cosine similarity
    for i, doc in enumerate(test_docs):
        sim = cosine_similarity(baseline[i], candidate[i])
        assert sim > 0.95, f"Embedding drift detected: {doc}, sim={sim}"

Run this in CI/CD on every model update. If drift exceeds 5%, block deployment and trigger manual review.

Schedule re-indexing during planned maintenance windows — Sunday night 02:00-06:00, for example. This prevents drift accumulation and avoids daytime impact. At Roibase, similar windows run jobs like CDP syncs at 03:00, keeping daytime query latency unaffected.