Vector Databases

Conceptual Explanation

The Embedding Space

An embedding model converts a piece of text (a sentence, a paragraph, a document) into a vector of floating-point numbers โ€” typically 768 to 3072 dimensions. Two texts that are semantically similar will produce vectors that are close together in this high-dimensional space.

text
"The patient has hypertension"  โ†’ [0.23, -0.41, 0.87, ...]  (768 numbers)
"The patient has high blood pressure" โ†’ [0.24, -0.39, 0.85, ...]  (768 numbers)
"The weather is sunny today"    โ†’ [0.71, 0.22, -0.54, ...]  (768 numbers)

The first two vectors are close (small angular distance); the third is far. Retrieving "the 10 most similar vectors to this query vector" is the fundamental vector database operation.

An exact nearest neighbor search over 10 million 1536-dimensional vectors requires computing the distance between the query vector and all 10 million stored vectors โ€” approximately 30 billion floating-point operations per query. This takes several seconds on modern hardware.

ANN algorithms build an index structure at insert time that allows the database to skip most comparisons at query time, finding the approximate nearest neighbors in milliseconds.

Core Architecture

HNSW โ€” Hierarchical Navigable Small World

HNSW is the dominant ANN algorithm in production vector databases (used by pgvector, Weaviate, ChromaDB, and optionally Pinecone). It builds a multi-layer graph where:

  • Layer 0 contains all vectors
  • Higher layers contain progressively fewer vectors (randomly sampled)
  • Each vector connects to its nearest neighbors at each layer

Query traversal starts at the top layer (sparse, fast to navigate), finds approximate nearest neighbors, then descends to progressively denser layers to refine the result.

HNSW parameters that affect the accuracy/performance trade-off:

Parameter Effect Production Default
M (max connections per node) Higher = more accurate, more memory 16โ€“32
ef_construction (build time search width) Higher = more accurate index, slower build 200โ€“400
ef (query time search width) Higher = more accurate, slower query 100โ€“200

HNSW trade-offs:

  • Memory: approximately 1.5โ€“2ร— the raw vector storage (for the graph structure)
  • Build time: O(N log N) โ€” builds slowly for large collections
  • Query time: excellent โ€” typically 1โ€“10ms for millions of vectors
  • Accuracy: tunable; typically 95โ€“99% recall at production settings

IVF โ€” Inverted File Index

IVF partitions the vector space into clusters (using k-means) and stores an inverted index mapping each cluster to its member vectors. At query time, only the nearest clusters are searched.

IVF is preferred when:

  • The collection is very large (hundreds of millions of vectors) and memory is constrained
  • Build time must be minimized (IVF builds faster than HNSW)
  • Pinecone uses a proprietary variant of IVF-based indexing

IVF trade-offs:

  • Lower recall than HNSW at equivalent speed (approximately 90โ€“95%)
  • Requires training on a representative sample of the data before building the index
  • More complex to tune than HNSW

Metadata Filtering

Production RAG pipelines require filtered vector search โ€” retrieving semantically similar documents from a specific subset defined by metadata conditions:

python
# Example: retrieve similar documents from a specific source within a date range
results = vector_store.query(
    query_embedding=query_vector,
    n_results=5,
    where={
        "$and": [
            {"source": {"$eq": "clinical-guidelines"}},
            {"effective_date": {"$gte": "2024-01-01"}},
            {"evidence_grade": {"$in": ["A", "B"]}}
        ]
    }
)

Pre-filtering vs. post-filtering:

  • Pre-filtering: Apply metadata filter before ANN search. More accurate (ANN only searches the filtered subset) but slower (filtered subset may be small, requiring exact search).
  • Post-filtering: Run ANN search, then filter results. Faster but may return fewer than k results if many are filtered out.

Different vector databases handle this trade-off differently โ€” Weaviate uses pre-filtering with HNSW; Pinecone uses post-filtering by default with optional hybrid.

Architecture Diagram

Common Mistakes

1. Using the wrong distance metric. Cosine similarity is the correct metric for text embeddings from most models (OpenAI, Voyage, Cohere). Euclidean (L2) distance is appropriate for embeddings trained with L2 normalization. Using the wrong metric produces subtly incorrect retrieval results that are difficult to diagnose.

2. Mismatching embedding models between indexing and query. Embeddings from model A are not comparable to embeddings from model B. If the query uses a different model than the indexing pipeline, retrieval will be random. This is a silent failure โ€” no error is raised.

3. Chunking too large or too small. Chunks that are too large (full documents) retrieve too much irrelevant context. Chunks that are too small (single sentences) lose surrounding context. The right chunk size is use-case dependent: 200โ€“500 tokens is the starting range for most clinical document RAG.

4. Not indexing metadata at insert time. Metadata filtering that is not indexed requires a full scan of the collection at query time. Index all filter dimensions at insert time.

5. Skipping the embedding model warming. Cold-start latency for embedding model inference can be 3โ€“10 seconds. Production RAG pipelines must keep the embedding model warm with periodic heartbeat requests.

Best Practices

  • Always use the same embedding model at index time and query time โ€” treat it as a versioned dependency
  • Index all metadata fields that will be used for filtering at insert time
  • Use cosine similarity for text embeddings unless the embedding model specifies otherwise
  • Implement collection-level partitioning for different document sources rather than using metadata filters alone
  • For clinical RAG: never index PHI in the vector store; retrieve patient-specific data via FHIR at query time
  • Plan embedding model migration from day one: keep source documents available so re-indexing is possible
  • Choose a vector database that matches the operational capability of the team managing it

Alternatives

Option Best For Not Suitable For
ChromaDB POC, small production, developer-friendly Large scale (>5M vectors), HIPAA BAA requirement
Pinecone Fully managed, large scale, minimal ops PostgreSQL-centric shops, strict data residency
Weaviate Multi-modal, GraphQL API, enterprise features Teams preferring minimal new infrastructure
pgvector PostgreSQL shops, HIPAA via cloud provider, hybrid Large scale without dedicated PostgreSQL investment
Elasticsearch dense_vector Teams with existing Elasticsearch investment New deployments (higher operational complexity)
Milvus Largest scale (billions of vectors) Small teams without dedicated infrastructure

Trade-offs

Managed vs. self-hosted: Managed vector databases (Pinecone, Weaviate Cloud) eliminate operational burden but introduce vendor lock-in and data residency constraints. Self-hosted (ChromaDB, pgvector, Weaviate self-hosted) provides control but requires infrastructure expertise and operational investment.

HNSW accuracy vs. memory: HNSW's graph structure requires approximately 50โ€“100 bytes per vector beyond the raw embedding storage. For 10 million 1536-dimensional vectors, HNSW overhead adds approximately 500MBโ€“1GB of RAM. Production systems must size memory accordingly.

Approximate vs. exact search: ANN search may miss the true nearest neighbor with probability proportional to the recall setting. For most RAG use cases, 95โ€“98% recall is sufficient. For high-stakes retrieval (drug interaction lookup, critical clinical guideline), consider hybrid: ANN for candidate retrieval followed by exact reranking of the top candidates.

Interview Questions

Q: Explain why approximate nearest neighbor search is used in vector databases instead of exact nearest neighbor search, and what trade-offs this introduces.

Category: Architecture Difficulty: Senior Role: AI Architect / ML Engineer

Answer Framework:

Exact nearest neighbor search in high-dimensional space requires computing the distance between the query vector and every stored vector. For 10 million vectors at 1536 dimensions, this is approximately 30 billion floating-point operations per query โ€” taking several seconds on modern hardware, unacceptable for an interactive RAG system.

Approximate nearest neighbor (ANN) algorithms solve this by building an index structure at insert time that allows the database to skip most distance computations at query time. HNSW, the dominant algorithm, builds a hierarchical graph where each vector connects to its nearest neighbors; query traversal navigates this graph in O(log N) steps to find approximate nearest neighbors.

The trade-off is recall accuracy: ANN may miss the true nearest neighbor with probability that decreases as the ef (search width) parameter increases. At ef=100, HNSW typically achieves 95โ€“99% recall โ€” meaning 1โ€“5% of queries might miss the true best match. For RAG use cases, this accuracy level is sufficient because retrieving the 2nd-most-relevant chunk instead of the 1st-most-relevant rarely changes the final LLM output quality.

Key Points to Hit:

  • Exact search computational cost at scale (O(NยทD) per query)
  • HNSW graph structure and traversal
  • Recall accuracy as a tunable parameter (ef)
  • Trade-off context: 95โ€“99% recall is sufficient for RAG

Red Flags:

  • Claiming ANN is exact
  • Not knowing that recall is tunable

Q: A clinical RAG system is returning outdated treatment guidelines. What are the most likely root causes and how would you diagnose them?

Category: System Design Difficulty: Principal Role: AI Architect

Answer Framework:

Three root causes with distinct diagnostic approaches:

1. Index not updated: The most common cause. The vector store was indexed once and never updated. Diagnose by checking the effective_date metadata of the retrieved chunks โ€” if all chunks predate recent guideline updates, the index needs refreshing. Fix: implement a scheduled re-indexing pipeline triggered by guideline publication events.

2. Metadata filter not applied: The retrieval query is not filtering by date, so old and new versions of the same guideline are both in the index and the old version is winning on embedding similarity. Diagnose by running the query without any filter and examining the effective<em>date distribution in results. Fix: add effective</em>date &gt;= [last<em>major</em>update] to the metadata filter, or implement version deduplication that removes superseded documents.

3. Wrong embedding model: A high-quality embedding model captures the semantic content of clinical recommendations more accurately than a general-purpose model, meaning newer more specific guidelines retrieve better. If the embedding model was downgraded (to reduce cost), older documents whose embeddings better matched the query might rank higher. Diagnose by checking which embedding model was used and when.

Key Points to Hit:

  • Three distinct root causes
  • Diagnostic approach for each
  • Metadata filtering as the standard mitigation
  • Index update cadence as a patient safety requirement (not just an operational concern)

Key Takeaways

  • Vector databases store high-dimensional embeddings and enable semantic similarity search through ANN algorithms
  • HNSW is the dominant indexing algorithm: high accuracy, high memory, fast queries
  • The same embedding model must be used at index time and query time โ€” mismatch produces random results
  • ChromaDB suits POC and small production; pgvector suits PostgreSQL shops; Pinecone and Weaviate suit large-scale managed production
  • PHI in vector content creates a HIPAA data store โ€” not all vector databases provide HIPAA BAAs
  • For clinical RAG: never index PHI; retrieve patient-specific context via FHIR API at query time
  • Index update cadence is a patient safety requirement, not an operational nicety
  • Metadata filtering requires metadata indexing at insert time