Vector Databases

Executive Summary

Vector databases are the infrastructure layer that makes semantic search and Retrieval-Augmented Generation possible at production scale. Unlike traditional relational databases that match records by exact field equality, vector databases store high-dimensional numerical representations of content and retrieve the most semantically similar records to a query โ€” enabling AI systems to find relevant documents, passages, or records even when no exact keyword matches exist. For enterprise AI architects, the choice of vector database determines retrieval latency, operational complexity, cost at scale, and compliance posture. This chapter covers the internal architecture of vector databases, the algorithmic trade-offs that govern their performance, and the selection criteria that distinguish the right choice for a given enterprise deployment.

Learning Objectives

  • Explain how vector databases store and retrieve high-dimensional embeddings using approximate nearest neighbor algorithms
  • Compare the four primary vector database options (ChromaDB, Pinecone, Weaviate, pgvector) across the dimensions that matter for enterprise selection
  • Design a vector database schema that supports metadata filtering, source attribution, and access control for a production RAG pipeline
  • Identify the performance and cost trade-offs of HNSW vs. IVF indexing strategies
  • Evaluate HIPAA suitability of each vector database option for healthcare AI deployments

Business Problem

Enterprise AI systems must retrieve relevant information from large, growing document collections in under 200 milliseconds โ€” while respecting access controls, supporting metadata filtering, and maintaining accuracy as the collection grows to millions of documents. Traditional search infrastructure (Elasticsearch, SQL LIKE queries) operates on keyword overlap, not semantic meaning: a search for "myocardial infarction" will not find documents about "heart attack" unless both terms appear. Embedding-based retrieval solves this by mapping both queries and documents into a shared semantic space where proximity means relevance.

The vector database is the infrastructure that makes this retrieval fast enough and cost-effective enough to operate at enterprise scale.

Why Vector Databases Exist

Before purpose-built vector databases, AI teams tried to implement similarity search using approximate approaches: storing embeddings as JSON in PostgreSQL and computing cosine similarity in application code (too slow above 10,000 records), or using Elasticsearch's dense_vector field (functional but operationally complex). None of these approaches provided the combination of performance, filtering, and operational simplicity needed for production RAG.

The first purpose-built vector databases (Pinecone, Milvus, Weaviate) appeared in 2021โ€“2022 as the RAG pattern became widely adopted. They addressed the core challenge: approximate nearest neighbor (ANN) search over hundreds of millions of high-dimensional vectors, with filtering and access control, at query latency under 100ms.

The "approximate" qualifier matters: finding the exact nearest neighbor in high-dimensional space is computationally intractable at scale. ANN algorithms trade a small accuracy reduction for a large performance gain โ€” and in practice, the top-k results from ANN search are accurate enough for RAG use cases.

Conceptual Explanation

The Embedding Space

An embedding model converts a piece of text (a sentence, a paragraph, a document) into a vector of floating-point numbers โ€” typically 768 to 3072 dimensions. Two texts that are semantically similar will produce vectors that are close together in this high-dimensional space.

text
"The patient has hypertension"  โ†’ [0.23, -0.41, 0.87, ...]  (768 numbers)
"The patient has high blood pressure" โ†’ [0.24, -0.39, 0.85, ...]  (768 numbers)
"The weather is sunny today"    โ†’ [0.71, 0.22, -0.54, ...]  (768 numbers)

The first two vectors are close (small angular distance); the third is far. Retrieving "the 10 most similar vectors to this query vector" is the fundamental vector database operation.

An exact nearest neighbor search over 10 million 1536-dimensional vectors requires computing the distance between the query vector and all 10 million stored vectors โ€” approximately 30 billion floating-point operations per query. This takes several seconds on modern hardware.

ANN algorithms build an index structure at insert time that allows the database to skip most comparisons at query time, finding the approximate nearest neighbors in milliseconds.

Core Architecture

HNSW โ€” Hierarchical Navigable Small World

HNSW is the dominant ANN algorithm in production vector databases (used by pgvector, Weaviate, ChromaDB, and optionally Pinecone). It builds a multi-layer graph where:

  • Layer 0 contains all vectors
  • Higher layers contain progressively fewer vectors (randomly sampled)
  • Each vector connects to its nearest neighbors at each layer

Query traversal starts at the top layer (sparse, fast to navigate), finds approximate nearest neighbors, then descends to progressively denser layers to refine the result.

HNSW parameters that affect the accuracy/performance trade-off:

Parameter Effect Production Default
M (max connections per node) Higher = more accurate, more memory 16โ€“32
ef_construction (build time search width) Higher = more accurate index, slower build 200โ€“400
ef (query time search width) Higher = more accurate, slower query 100โ€“200

HNSW trade-offs:

  • Memory: approximately 1.5โ€“2ร— the raw vector storage (for the graph structure)
  • Build time: O(N log N) โ€” builds slowly for large collections
  • Query time: excellent โ€” typically 1โ€“10ms for millions of vectors
  • Accuracy: tunable; typically 95โ€“99% recall at production settings

IVF โ€” Inverted File Index

IVF partitions the vector space into clusters (using k-means) and stores an inverted index mapping each cluster to its member vectors. At query time, only the nearest clusters are searched.

IVF is preferred when:

  • The collection is very large (hundreds of millions of vectors) and memory is constrained
  • Build time must be minimized (IVF builds faster than HNSW)
  • Pinecone uses a proprietary variant of IVF-based indexing

IVF trade-offs:

  • Lower recall than HNSW at equivalent speed (approximately 90โ€“95%)
  • Requires training on a representative sample of the data before building the index
  • More complex to tune than HNSW

Metadata Filtering

Production RAG pipelines require filtered vector search โ€” retrieving semantically similar documents from a specific subset defined by metadata conditions:

python
# Example: retrieve similar documents from a specific source within a date range
results = vector_store.query(
    query_embedding=query_vector,
    n_results=5,
    where={
        "$and": [
            {"source": {"$eq": "clinical-guidelines"}},
            {"effective_date": {"$gte": "2024-01-01"}},
            {"evidence_grade": {"$in": ["A", "B"]}}
        ]
    }
)

Pre-filtering vs. post-filtering:

  • Pre-filtering: Apply metadata filter before ANN search. More accurate (ANN only searches the filtered subset) but slower (filtered subset may be small, requiring exact search).
  • Post-filtering: Run ANN search, then filter results. Faster but may return fewer than k results if many are filtered out.

Different vector databases handle this trade-off differently โ€” Weaviate uses pre-filtering with HNSW; Pinecone uses post-filtering by default with optional hybrid.

Components

Core Vector Database Components

Component Function Production Consideration
Embedding storage Persists vector arrays alongside metadata Float32 vs. Float16 storage (memory trade-off)
ANN index Data structure enabling fast similarity search HNSW vs. IVF selection based on scale
Metadata store Key-value pairs attached to each vector Filtering, access control, source attribution
Query engine Executes ANN search with optional filtering Pre-filter vs. post-filter strategy
Collection/namespace Logical partition of vectors Multi-tenancy, access control
Replication High availability Critical for production SLA

Vector Database Selection Framework

python
from dataclasses import dataclass
from typing import Optional

@dataclass
class VectorDatabaseSelectionCriteria:
    """
    Framework for vector database selection in enterprise AI deployments.
    """
    # Scale requirements
    document_count_current: int
    document_count_projected_2yr: int
    
    # Performance requirements
    query_latency_p95_ms: float     # Required p95 query latency
    writes_per_second: int           # Embedding ingestion rate
    
    # Operational requirements
    managed_service_required: bool   # True if team cannot manage infrastructure
    existing_infrastructure: str     # "postgresql" | "azure" | "aws" | None
    
    # Compliance requirements
    hipaa_baa_required: bool
    data_residency_region: Optional[str]
    phi_in_vectors: bool             # Will vector content contain PHI?
    
    # Cost sensitivity
    monthly_budget_usd: Optional[float]
    
    def recommend(self) -> dict:
        scores = {
            "chromadb": 0,
            "pinecone": 0,
            "weaviate": 0,
            "pgvector": 0
        }
        
        # Scale scoring
        if self.document_count_projected_2yr < 500_000:
            scores["chromadb"] += 2
            scores["pgvector"] += 2
        elif self.document_count_projected_2yr < 10_000_000:
            scores["pgvector"] += 1
            scores["weaviate"] += 2
            scores["pinecone"] += 2
        else:
            scores["pinecone"] += 3
            scores["weaviate"] += 2
        
        # Managed service preference
        if self.managed_service_required:
            scores["pinecone"] += 3
            scores["chromadb"] -= 1  # Requires self-management at scale
        
        # PostgreSQL integration
        if self.existing_infrastructure == "postgresql":
            scores["pgvector"] += 3
        
        # HIPAA
        if self.hipaa_baa_required:
            scores["chromadb"] = min(scores["chromadb"], 0)  # No HIPAA BAA
            scores["pinecone"] += 1   # HIPAA BAA available
            scores["weaviate"] += 1   # HIPAA BAA available (Weaviate Cloud)
            scores["pgvector"] += 1   # Via cloud provider BAA
        
        top = max(scores, key=scores.get)
        return {"recommendation": top, "scores": scores}

Architecture Diagram

Implementation Patterns

ChromaDB โ€” Development and POC Pattern

ChromaDB is the preferred choice for development, POC, and small-to-medium production deployments. It runs embedded (in-process, no separate server) or as a standalone server and requires no infrastructure expertise to operate.

python
import chromadb
from chromadb.config import Settings

# Educational Example โ€” Not for clinical use

def create_clinical_knowledge_store(
    collection_name: str,
    persist_directory: str
) -> chromadb.Collection:
    """
    Create a ChromaDB collection for clinical knowledge retrieval.
    Suitable for POC and small production deployments.
    """
    client = chromadb.PersistentClient(
        path=persist_directory,
        settings=Settings(
            anonymized_telemetry=False  # Disable telemetry for enterprise deployments
        )
    )
    
    collection = client.get_or_create_collection(
        name=collection_name,
        metadata={
            "hnsw:space": "cosine",           # Cosine similarity for text embeddings
            "hnsw:construction_ef": 200,       # Build accuracy
            "hnsw:M": 16,                      # Graph connectivity
            "hnsw:search_ef": 100,             # Query accuracy
        }
    )
    return collection


def index_clinical_document(
    collection: chromadb.Collection,
    document_id: str,
    content: str,
    embedding: list[float],
    metadata: dict
) -> None:
    """
    Index a clinical document with required metadata for RAG.
    
    Required metadata fields for clinical RAG:
    - source: organization that produced the document
    - document_type: guideline | formulary | protocol | policy
    - effective_date: when this version became current (ISO 8601)
    - evidence_grade: A | B | C | Expert (for guidelines)
    - section: section heading within the source document
    """
    required_metadata_fields = ["source", "document_type", "effective_date", "section"]
    missing = [f for f in required_metadata_fields if f not in metadata]
    if missing:
        raise ValueError(f"Missing required metadata fields: {missing}")
    
    collection.add(
        ids=[document_id],
        embeddings=[embedding],
        documents=[content],
        metadatas=[metadata]
    )


def retrieve_similar_chunks(
    collection: chromadb.Collection,
    query_embedding: list[float],
    n_results: int = 5,
    source_filter: str = None,
    min_date: str = None
) -> list[dict]:
    """
    Retrieve semantically similar chunks with optional metadata filtering.
    """
    where_clause = {}
    
    if source_filter and min_date:
        where_clause = {
            "$and": [
                {"source": {"$eq": source_filter}},
                {"effective_date": {"$gte": min_date}}
            ]
        }
    elif source_filter:
        where_clause = {"source": {"$eq": source_filter}}
    elif min_date:
        where_clause = {"effective_date": {"$gte": min_date}}
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        where=where_clause if where_clause else None,
        include=["documents", "metadatas", "distances"]
    )
    
    return [
        {
            "content": results["documents"][0][i],
            "metadata": results["metadatas"][0][i],
            "distance": results["distances"][0][i],
            "relevance_score": 1 - results["distances"][0][i]  # cosine: lower distance = higher relevance
        }
        for i in range(len(results["documents"][0]))
    ]

pgvector โ€” PostgreSQL-Native Pattern

pgvector extends PostgreSQL with vector storage and HNSW/IVF indexing. The primary advantage is operational: organizations with PostgreSQL expertise can manage pgvector without learning a new database system.

sql
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Clinical knowledge table with vector + metadata
-- Educational Example โ€” Not for clinical use
CREATE TABLE clinical_documents (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    content         TEXT NOT NULL,
    embedding       VECTOR(1536),    -- OpenAI ada-002 or equivalent dimensionality
    source          TEXT NOT NULL,
    document_type   TEXT NOT NULL CHECK (document_type IN ('guideline', 'formulary', 'protocol', 'policy')),
    effective_date  DATE NOT NULL,
    evidence_grade  TEXT,
    section         TEXT,
    organization    TEXT NOT NULL,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

-- HNSW index for fast approximate nearest neighbor search
-- Build once; supports cosine and L2 distance
CREATE INDEX clinical_documents_embedding_idx
ON clinical_documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

-- Composite index for metadata-filtered queries
CREATE INDEX clinical_documents_filter_idx
ON clinical_documents (source, document_type, effective_date);

-- Retrieval query with metadata pre-filtering
-- Returns 5 most similar chunks from clinical guidelines updated after 2024-01-01
SELECT
    id,
    content,
    source,
    evidence_grade,
    effective_date,
    1 - (embedding <=> $1::vector) AS relevance_score
FROM clinical_documents
WHERE
    document_type = 'guideline'
    AND effective_date >= '2024-01-01'
ORDER BY embedding <=> $1::vector
LIMIT 5;

Enterprise Considerations

Scalability ceiling: ChromaDB's embedded mode has a practical upper limit of approximately 2โ€“5 million vectors before query latency degrades. Above this threshold, Pinecone or Weaviate (with dedicated nodes) provide better performance scaling.

Multi-tenancy: Enterprise AI platforms serving multiple departments or clients require namespace isolation. Pinecone provides namespaces natively; pgvector requires schema-per-tenant or row-level security; ChromaDB uses separate collections per tenant. The choice affects both operational complexity and access control design.

Backup and disaster recovery: Vector databases contain both the original metadata and the computed embeddings. Embeddings can be recomputed from source documents if the embedding model is available, but this takes significant time for large collections. Backup strategy must cover both the document source and the index.

Embedding model versioning: When the embedding model changes (a higher-quality model becomes available), all existing vectors must be recomputed and re-indexed. This is a multi-hour to multi-day operation for large collections. Production systems must plan for embedding model migrations without service interruption.

Security Considerations

PHI in vector content: If vector database content includes clinical text containing PHI, the vector database itself is a PHI data store requiring HIPAA controls: encryption at rest, encryption in transit, access control, and audit logging. ChromaDB does not provide a HIPAA BAA. Pinecone and Weaviate Cloud provide HIPAA BAAs (verify current availability). pgvector inherits the HIPAA posture of the PostgreSQL deployment.

Vector reconstruction risk: Raw embeddings can be partially inverted to reconstruct approximate source text. For documents containing PHI, this means the vector store is not a safe de-identified environment โ€” it must be treated with the same access controls as the source text.

Access control at query time: Vector database access control must be enforced at the application layer if the vector database itself does not support row-level access control. A RAG pipeline that serves multiple user types (physicians, administrators, patients) must filter retrieved results to those the requesting user is authorized to see.

Healthcare Example

โŠ• Healthcare Example

Educational Example โ€” Not intended for clinical use.

The Reference Healthcare Organization's HMS AI Platform uses a partitioned vector store architecture with three collections/namespaces:

Partition Content Database HIPAA PHI? Update Cadence
clinical-guidelines ACC, AHA, USPSTF, CDC guidelines Weaviate Cloud (HIPAA BAA) No (de-identified) Quarterly
hospital-formulary HMS-specific drug list, dosing, interactions pgvector (within HMS Azure environment) No Per pharmacy committee
clinical-protocols HMS clinical protocols pgvector (within HMS Azure environment) No Per protocol committee

Patient-specific clinical notes are not indexed in the vector store. The RAG pipeline retrieves from static knowledge bases; patient-specific context is retrieved via FHIR API at query time and injected directly into the prompt. This architecture ensures no PHI enters the vector index.

Common Mistakes

1. Using the wrong distance metric. Cosine similarity is the correct metric for text embeddings from most models (OpenAI, Voyage, Cohere). Euclidean (L2) distance is appropriate for embeddings trained with L2 normalization. Using the wrong metric produces subtly incorrect retrieval results that are difficult to diagnose.

2. Mismatching embedding models between indexing and query. Embeddings from model A are not comparable to embeddings from model B. If the query uses a different model than the indexing pipeline, retrieval will be random. This is a silent failure โ€” no error is raised.

3. Chunking too large or too small. Chunks that are too large (full documents) retrieve too much irrelevant context. Chunks that are too small (single sentences) lose surrounding context. The right chunk size is use-case dependent: 200โ€“500 tokens is the starting range for most clinical document RAG.

4. Not indexing metadata at insert time. Metadata filtering that is not indexed requires a full scan of the collection at query time. Index all filter dimensions at insert time.

5. Skipping the embedding model warming. Cold-start latency for embedding model inference can be 3โ€“10 seconds. Production RAG pipelines must keep the embedding model warm with periodic heartbeat requests.

Best Practices

  • Always use the same embedding model at index time and query time โ€” treat it as a versioned dependency
  • Index all metadata fields that will be used for filtering at insert time
  • Use cosine similarity for text embeddings unless the embedding model specifies otherwise
  • Implement collection-level partitioning for different document sources rather than using metadata filters alone
  • For clinical RAG: never index PHI in the vector store; retrieve patient-specific data via FHIR at query time
  • Plan embedding model migration from day one: keep source documents available so re-indexing is possible
  • Choose a vector database that matches the operational capability of the team managing it

Alternatives

Option Best For Not Suitable For
ChromaDB POC, small production, developer-friendly Large scale (>5M vectors), HIPAA BAA requirement
Pinecone Fully managed, large scale, minimal ops PostgreSQL-centric shops, strict data residency
Weaviate Multi-modal, GraphQL API, enterprise features Teams preferring minimal new infrastructure
pgvector PostgreSQL shops, HIPAA via cloud provider, hybrid Large scale without dedicated PostgreSQL investment
Elasticsearch dense_vector Teams with existing Elasticsearch investment New deployments (higher operational complexity)
Milvus Largest scale (billions of vectors) Small teams without dedicated infrastructure

Trade-offs

Managed vs. self-hosted: Managed vector databases (Pinecone, Weaviate Cloud) eliminate operational burden but introduce vendor lock-in and data residency constraints. Self-hosted (ChromaDB, pgvector, Weaviate self-hosted) provides control but requires infrastructure expertise and operational investment.

HNSW accuracy vs. memory: HNSW's graph structure requires approximately 50โ€“100 bytes per vector beyond the raw embedding storage. For 10 million 1536-dimensional vectors, HNSW overhead adds approximately 500MBโ€“1GB of RAM. Production systems must size memory accordingly.

Approximate vs. exact search: ANN search may miss the true nearest neighbor with probability proportional to the recall setting. For most RAG use cases, 95โ€“98% recall is sufficient. For high-stakes retrieval (drug interaction lookup, critical clinical guideline), consider hybrid: ANN for candidate retrieval followed by exact reranking of the top candidates.

Interview Questions

Q: Explain why approximate nearest neighbor search is used in vector databases instead of exact nearest neighbor search, and what trade-offs this introduces.

Category: Architecture Difficulty: Senior Role: AI Architect / ML Engineer

Answer Framework:

Exact nearest neighbor search in high-dimensional space requires computing the distance between the query vector and every stored vector. For 10 million vectors at 1536 dimensions, this is approximately 30 billion floating-point operations per query โ€” taking several seconds on modern hardware, unacceptable for an interactive RAG system.

Approximate nearest neighbor (ANN) algorithms solve this by building an index structure at insert time that allows the database to skip most distance computations at query time. HNSW, the dominant algorithm, builds a hierarchical graph where each vector connects to its nearest neighbors; query traversal navigates this graph in O(log N) steps to find approximate nearest neighbors.

The trade-off is recall accuracy: ANN may miss the true nearest neighbor with probability that decreases as the ef (search width) parameter increases. At ef=100, HNSW typically achieves 95โ€“99% recall โ€” meaning 1โ€“5% of queries might miss the true best match. For RAG use cases, this accuracy level is sufficient because retrieving the 2nd-most-relevant chunk instead of the 1st-most-relevant rarely changes the final LLM output quality.

Key Points to Hit:

  • Exact search computational cost at scale (O(NยทD) per query)
  • HNSW graph structure and traversal
  • Recall accuracy as a tunable parameter (ef)
  • Trade-off context: 95โ€“99% recall is sufficient for RAG

Red Flags:

  • Claiming ANN is exact
  • Not knowing that recall is tunable

Q: A clinical RAG system is returning outdated treatment guidelines. What are the most likely root causes and how would you diagnose them?

Category: System Design Difficulty: Principal Role: AI Architect

Answer Framework:

Three root causes with distinct diagnostic approaches:

1. Index not updated: The most common cause. The vector store was indexed once and never updated. Diagnose by checking the effective_date metadata of the retrieved chunks โ€” if all chunks predate recent guideline updates, the index needs refreshing. Fix: implement a scheduled re-indexing pipeline triggered by guideline publication events.

2. Metadata filter not applied: The retrieval query is not filtering by date, so old and new versions of the same guideline are both in the index and the old version is winning on embedding similarity. Diagnose by running the query without any filter and examining the effective<em>date distribution in results. Fix: add effective</em>date &gt;= [last<em>major</em>update] to the metadata filter, or implement version deduplication that removes superseded documents.

3. Wrong embedding model: A high-quality embedding model captures the semantic content of clinical recommendations more accurately than a general-purpose model, meaning newer more specific guidelines retrieve better. If the embedding model was downgraded (to reduce cost), older documents whose embeddings better matched the query might rank higher. Diagnose by checking which embedding model was used and when.

Key Points to Hit:

  • Three distinct root causes
  • Diagnostic approach for each
  • Metadata filtering as the standard mitigation
  • Index update cadence as a patient safety requirement (not just an operational concern)

Key Takeaways

  • Vector databases store high-dimensional embeddings and enable semantic similarity search through ANN algorithms
  • HNSW is the dominant indexing algorithm: high accuracy, high memory, fast queries
  • The same embedding model must be used at index time and query time โ€” mismatch produces random results
  • ChromaDB suits POC and small production; pgvector suits PostgreSQL shops; Pinecone and Weaviate suit large-scale managed production
  • PHI in vector content creates a HIPAA data store โ€” not all vector databases provide HIPAA BAAs
  • For clinical RAG: never index PHI; retrieve patient-specific context via FHIR API at query time
  • Index update cadence is a patient safety requirement, not an operational nicety
  • Metadata filtering requires metadata indexing at insert time

Glossary

ANN (Approximate Nearest Neighbor): An algorithm that finds the k most similar vectors to a query vector in approximately O(log N) time, with tunable accuracy trade-off.

HNSW (Hierarchical Navigable Small World): The dominant ANN algorithm, implemented as a multi-layer proximity graph that supports fast insertion and high-recall query at enterprise scale.

IVF (Inverted File Index): An ANN algorithm that partitions vectors into clusters and restricts search to the nearest clusters, trading recall accuracy for memory efficiency.

Recall@k: The proportion of queries for which the true nearest neighbor appears in the top-k ANN results. A key accuracy metric for vector database evaluation.

Cosine Similarity: A distance metric between vectors measuring the angle between them rather than their magnitude. The standard metric for text embeddings.

Namespace / Collection: A logical partition within a vector database that isolates vectors for multi-tenancy, access control, or topic separation.

Further Reading