Vector Databases

Executive Summary

Vector databases are the infrastructure layer that makes semantic search and Retrieval-Augmented Generation possible at production scale. Unlike traditional relational databases that match records by exact field equality, vector databases store high-dimensional numerical representations of content and retrieve the most semantically similar records to a query โ€” enabling AI systems to find relevant documents, passages, or records even when no exact keyword matches exist. For enterprise AI architects, the choice of vector database determines retrieval latency, operational complexity, cost at scale, and compliance posture. This chapter covers the internal architecture of vector databases, the algorithmic trade-offs that govern their performance, and the selection criteria that distinguish the right choice for a given enterprise deployment.

Learning Objectives

  • Explain how vector databases store and retrieve high-dimensional embeddings using approximate nearest neighbor algorithms
  • Compare the four primary vector database options (ChromaDB, Pinecone, Weaviate, pgvector) across the dimensions that matter for enterprise selection
  • Design a vector database schema that supports metadata filtering, source attribution, and access control for a production RAG pipeline
  • Identify the performance and cost trade-offs of HNSW vs. IVF indexing strategies
  • Evaluate HIPAA suitability of each vector database option for healthcare AI deployments

Business Problem

Enterprise AI systems must retrieve relevant information from large, growing document collections in under 200 milliseconds โ€” while respecting access controls, supporting metadata filtering, and maintaining accuracy as the collection grows to millions of documents. Traditional search infrastructure (Elasticsearch, SQL LIKE queries) operates on keyword overlap, not semantic meaning: a search for "myocardial infarction" will not find documents about "heart attack" unless both terms appear. Embedding-based retrieval solves this by mapping both queries and documents into a shared semantic space where proximity means relevance.

The vector database is the infrastructure that makes this retrieval fast enough and cost-effective enough to operate at enterprise scale.

Conceptual Explanation

The Embedding Space

An embedding model converts a piece of text (a sentence, a paragraph, a document) into a vector of floating-point numbers โ€” typically 768 to 3072 dimensions. Two texts that are semantically similar will produce vectors that are close together in this high-dimensional space.

text
"The patient has hypertension"  โ†’ [0.23, -0.41, 0.87, ...]  (768 numbers)
"The patient has high blood pressure" โ†’ [0.24, -0.39, 0.85, ...]  (768 numbers)
"The weather is sunny today"    โ†’ [0.71, 0.22, -0.54, ...]  (768 numbers)

The first two vectors are close (small angular distance); the third is far. Retrieving "the 10 most similar vectors to this query vector" is the fundamental vector database operation.

An exact nearest neighbor search over 10 million 1536-dimensional vectors requires computing the distance between the query vector and all 10 million stored vectors โ€” approximately 30 billion floating-point operations per query. This takes several seconds on modern hardware.

ANN algorithms build an index structure at insert time that allows the database to skip most comparisons at query time, finding the approximate nearest neighbors in milliseconds.

Core Architecture

HNSW โ€” Hierarchical Navigable Small World

HNSW is the dominant ANN algorithm in production vector databases (used by pgvector, Weaviate, ChromaDB, and optionally Pinecone). It builds a multi-layer graph where:

  • Layer 0 contains all vectors
  • Higher layers contain progressively fewer vectors (randomly sampled)
  • Each vector connects to its nearest neighbors at each layer

Query traversal starts at the top layer (sparse, fast to navigate), finds approximate nearest neighbors, then descends to progressively denser layers to refine the result.

HNSW parameters that affect the accuracy/performance trade-off:

Parameter Effect Production Default
M (max connections per node) Higher = more accurate, more memory 16โ€“32
ef_construction (build time search width) Higher = more accurate index, slower build 200โ€“400
ef (query time search width) Higher = more accurate, slower query 100โ€“200

HNSW trade-offs:

  • Memory: approximately 1.5โ€“2ร— the raw vector storage (for the graph structure)
  • Build time: O(N log N) โ€” builds slowly for large collections
  • Query time: excellent โ€” typically 1โ€“10ms for millions of vectors
  • Accuracy: tunable; typically 95โ€“99% recall at production settings

IVF โ€” Inverted File Index

IVF partitions the vector space into clusters (using k-means) and stores an inverted index mapping each cluster to its member vectors. At query time, only the nearest clusters are searched.

IVF is preferred when:

  • The collection is very large (hundreds of millions of vectors) and memory is constrained
  • Build time must be minimized (IVF builds faster than HNSW)
  • Pinecone uses a proprietary variant of IVF-based indexing

IVF trade-offs:

  • Lower recall than HNSW at equivalent speed (approximately 90โ€“95%)
  • Requires training on a representative sample of the data before building the index
  • More complex to tune than HNSW

Metadata Filtering

Production RAG pipelines require filtered vector search โ€” retrieving semantically similar documents from a specific subset defined by metadata conditions:

Implementation code omitted in the Playbook edition. For complete code examples, production patterns, and advanced implementation details, see the Enterprise AI Technical Reference.

Pre-filtering vs. post-filtering:

  • Pre-filtering: Apply metadata filter before ANN search. More accurate (ANN only searches the filtered subset) but slower (filtered subset may be small, requiring exact search).
  • Post-filtering: Run ANN search, then filter results. Faster but may return fewer than k results if many are filtered out.

Different vector databases handle this trade-off differently โ€” Weaviate uses pre-filtering with HNSW; Pinecone uses post-filtering by default with optional hybrid.

Architecture Diagram

Enterprise Considerations

Scalability ceiling: ChromaDB's embedded mode has a practical upper limit of approximately 2โ€“5 million vectors before query latency degrades. Above this threshold, Pinecone or Weaviate (with dedicated nodes) provide better performance scaling.

Multi-tenancy: Enterprise AI platforms serving multiple departments or clients require namespace isolation. Pinecone provides namespaces natively; pgvector requires schema-per-tenant or row-level security; ChromaDB uses separate collections per tenant. The choice affects both operational complexity and access control design.

Backup and disaster recovery: Vector databases contain both the original metadata and the computed embeddings. Embeddings can be recomputed from source documents if the embedding model is available, but this takes significant time for large collections. Backup strategy must cover both the document source and the index.

Embedding model versioning: When the embedding model changes (a higher-quality model becomes available), all existing vectors must be recomputed and re-indexed. This is a multi-hour to multi-day operation for large collections. Production systems must plan for embedding model migrations without service interruption.

Healthcare Example

โŠ• Healthcare Example

Educational Example โ€” Not intended for clinical use.

The Reference Healthcare Organization's HMS AI Platform uses a partitioned vector store architecture with three collections/namespaces:

Partition Content Database HIPAA PHI? Update Cadence
clinical-guidelines ACC, AHA, USPSTF, CDC guidelines Weaviate Cloud (HIPAA BAA) No (de-identified) Quarterly
hospital-formulary HMS-specific drug list, dosing, interactions pgvector (within HMS Azure environment) No Per pharmacy committee
clinical-protocols HMS clinical protocols pgvector (within HMS Azure environment) No Per protocol committee

Patient-specific clinical notes are not indexed in the vector store. The RAG pipeline retrieves from static knowledge bases; patient-specific context is retrieved via FHIR API at query time and injected directly into the prompt. This architecture ensures no PHI enters the vector index.

Common Mistakes

1. Using the wrong distance metric. Cosine similarity is the correct metric for text embeddings from most models (OpenAI, Voyage, Cohere). Euclidean (L2) distance is appropriate for embeddings trained with L2 normalization. Using the wrong metric produces subtly incorrect retrieval results that are difficult to diagnose.

2. Mismatching embedding models between indexing and query. Embeddings from model A are not comparable to embeddings from model B. If the query uses a different model than the indexing pipeline, retrieval will be random. This is a silent failure โ€” no error is raised.

3. Chunking too large or too small. Chunks that are too large (full documents) retrieve too much irrelevant context. Chunks that are too small (single sentences) lose surrounding context. The right chunk size is use-case dependent: 200โ€“500 tokens is the starting range for most clinical document RAG.

4. Not indexing metadata at insert time. Metadata filtering that is not indexed requires a full scan of the collection at query time. Index all filter dimensions at insert time.

5. Skipping the embedding model warming. Cold-start latency for embedding model inference can be 3โ€“10 seconds. Production RAG pipelines must keep the embedding model warm with periodic heartbeat requests.

Best Practices

  • Always use the same embedding model at index time and query time โ€” treat it as a versioned dependency
  • Index all metadata fields that will be used for filtering at insert time
  • Use cosine similarity for text embeddings unless the embedding model specifies otherwise
  • Implement collection-level partitioning for different document sources rather than using metadata filters alone
  • For clinical RAG: never index PHI in the vector store; retrieve patient-specific data via FHIR at query time
  • Plan embedding model migration from day one: keep source documents available so re-indexing is possible
  • Choose a vector database that matches the operational capability of the team managing it

Alternatives

Option Best For Not Suitable For
ChromaDB POC, small production, developer-friendly Large scale (>5M vectors), HIPAA BAA requirement
Pinecone Fully managed, large scale, minimal ops PostgreSQL-centric shops, strict data residency
Weaviate Multi-modal, GraphQL API, enterprise features Teams preferring minimal new infrastructure
pgvector PostgreSQL shops, HIPAA via cloud provider, hybrid Large scale without dedicated PostgreSQL investment
Elasticsearch dense_vector Teams with existing Elasticsearch investment New deployments (higher operational complexity)
Milvus Largest scale (billions of vectors) Small teams without dedicated infrastructure

Trade-offs

Managed vs. self-hosted: Managed vector databases (Pinecone, Weaviate Cloud) eliminate operational burden but introduce vendor lock-in and data residency constraints. Self-hosted (ChromaDB, pgvector, Weaviate self-hosted) provides control but requires infrastructure expertise and operational investment.

HNSW accuracy vs. memory: HNSW's graph structure requires approximately 50โ€“100 bytes per vector beyond the raw embedding storage. For 10 million 1536-dimensional vectors, HNSW overhead adds approximately 500MBโ€“1GB of RAM. Production systems must size memory accordingly.

Approximate vs. exact search: ANN search may miss the true nearest neighbor with probability proportional to the recall setting. For most RAG use cases, 95โ€“98% recall is sufficient. For high-stakes retrieval (drug interaction lookup, critical clinical guideline), consider hybrid: ANN for candidate retrieval followed by exact reranking of the top candidates.

Interview Questions

Q: Explain why approximate nearest neighbor search is used in vector databases instead of exact nearest neighbor search, and what trade-offs this introduces.

Category: Architecture Difficulty: Senior Role: AI Architect / ML Engineer

Answer Framework:

Exact nearest neighbor search in high-dimensional space requires computing the distance between the query vector and every stored vector. For 10 million vectors at 1536 dimensions, this is approximately 30 billion floating-point operations per query โ€” taking several seconds on modern hardware, unacceptable for an interactive RAG system.

Approximate nearest neighbor (ANN) algorithms solve this by building an index structure at insert time that allows the database to skip most distance computations at query time. HNSW, the dominant algorithm, builds a hierarchical graph where each vector connects to its nearest neighbors; query traversal navigates this graph in O(log N) steps to find approximate nearest neighbors.

The trade-off is recall accuracy: ANN may miss the true nearest neighbor with probability that decreases as the ef (search width) parameter increases. At ef=100, HNSW typically achieves 95โ€“99% recall โ€” meaning 1โ€“5% of queries might miss the true best match. For RAG use cases, this accuracy level is sufficient because retrieving the 2nd-most-relevant chunk instead of the 1st-most-relevant rarely changes the final LLM output quality.

Key Points to Hit:

  • Exact search computational cost at scale (O(NยทD) per query)
  • HNSW graph structure and traversal
  • Recall accuracy as a tunable parameter (ef)
  • Trade-off context: 95โ€“99% recall is sufficient for RAG

Red Flags:

  • Claiming ANN is exact
  • Not knowing that recall is tunable

Q: A clinical RAG system is returning outdated treatment guidelines. What are the most likely root causes and how would you diagnose them?

Category: System Design Difficulty: Principal Role: AI Architect

Answer Framework:

Three root causes with distinct diagnostic approaches:

1. Index not updated: The most common cause. The vector store was indexed once and never updated. Diagnose by checking the effective_date metadata of the retrieved chunks โ€” if all chunks predate recent guideline updates, the index needs refreshing. Fix: implement a scheduled re-indexing pipeline triggered by guideline publication events.

2. Metadata filter not applied: The retrieval query is not filtering by date, so old and new versions of the same guideline are both in the index and the old version is winning on embedding similarity. Diagnose by running the query without any filter and examining the effective<em>date distribution in results. Fix: add effective</em>date &gt;= [last<em>major</em>update] to the metadata filter, or implement version deduplication that removes superseded documents.

3. Wrong embedding model: A high-quality embedding model captures the semantic content of clinical recommendations more accurately than a general-purpose model, meaning newer more specific guidelines retrieve better. If the embedding model was downgraded (to reduce cost), older documents whose embeddings better matched the query might rank higher. Diagnose by checking which embedding model was used and when.

Key Points to Hit:

  • Three distinct root causes
  • Diagnostic approach for each
  • Metadata filtering as the standard mitigation
  • Index update cadence as a patient safety requirement (not just an operational concern)

Key Takeaways

  • Vector databases store high-dimensional embeddings and enable semantic similarity search through ANN algorithms
  • HNSW is the dominant indexing algorithm: high accuracy, high memory, fast queries
  • The same embedding model must be used at index time and query time โ€” mismatch produces random results
  • ChromaDB suits POC and small production; pgvector suits PostgreSQL shops; Pinecone and Weaviate suit large-scale managed production
  • PHI in vector content creates a HIPAA data store โ€” not all vector databases provide HIPAA BAAs
  • For clinical RAG: never index PHI; retrieve patient-specific context via FHIR API at query time
  • Index update cadence is a patient safety requirement, not an operational nicety
  • Metadata filtering requires metadata indexing at insert time

Further Reading