Data Pipelines for AI

Executive Summary

AI systems are only as good as the data they are built on. The data pipeline is the infrastructure that transforms raw source data — clinical notes, policy documents, transactional records, structured databases — into the embeddings, fine-tuning datasets, and evaluation corpora that AI systems consume. At enterprise scale, data pipelines for AI are significantly more complex than traditional ETL pipelines: they must handle unstructured text, manage document versioning, coordinate embedding computation across heterogeneous sources, and maintain data quality in a domain where pipeline failures produce silent quality degradation rather than system errors. This chapter covers the architecture of AI data pipelines from ingestion through embedding and indexing, with production patterns for the healthcare context.

Learning Objectives

Design an end-to-end AI data pipeline from source document ingestion through vector store indexing
Select appropriate chunking strategies for different document types and use cases
Implement pipeline orchestration with error handling, retry logic, and quality validation
Design pipeline monitoring that detects silent failures (quality degradation) as well as hard failures
Apply healthcare-specific pipeline requirements for PHI handling and document provenance

Business Problem

An AI system that retrieves from a stale or incomplete knowledge base produces worse outcomes than one that retrieves nothing — because the AI presents outdated information with apparent confidence. In clinical settings, a knowledge base that reflects guidelines from 18 months ago will produce clinical recommendations that no longer reflect current standard of care. The data pipeline is what keeps the knowledge base current, complete, and trustworthy.

Traditional ETL pipelines fail silently in AI contexts: a missed document, an incorrect chunk boundary, or a stale embedding produces no error — it simply degrades the AI's output quality in ways that are difficult to detect without continuous evaluation.

Why This Technology Exists

The first RAG systems used manual, one-time document loading: engineers uploaded PDF files to a script that chunked and embedded them. This approach fails in production for three reasons: source documents change (guidelines are updated, formularies are revised), new documents must be added without full re-indexing, and the scale of document collections grows beyond manual management.

Purpose-built AI data pipelines address these requirements: scheduled ingestion that detects new and updated documents, incremental indexing that adds new chunks without re-embedding the entire collection, and quality validation that detects pipeline failures before they reach the production RAG system.

Core Architecture

graph TD subgraph "Source Layer" S1["Clinical Guidelines\n(ACC, AHA, USPSTF PDFs)"] S2["Hospital Formulary\n(Pharmacy database)"] S3["Clinical Protocols\n(SharePoint / intranet)"] S4["EHR Documents\n(via FHIR DocumentReference)"] end subgraph "Ingestion Layer" CRAWL["Document Crawler\nDetect new / updated docs"] EXTRACT["Text Extraction\nPDF, DOCX, HTML → plain text"] META["Metadata Extraction\nSource, date, version, author"] DEDUP["Deduplication\nContent hash; skip unchanged docs"] end subgraph "Processing Layer" CHUNK["Chunking Engine\nStrategy per document type"] CLEAN["Text Cleaning\nRemove headers, footers, artifacts"] EMB["Embedding Service\nBatch embedding computation"] VALIDATE["Quality Validation\nChunk size, embedding shape, metadata completeness"] end subgraph "Storage Layer" VECSTORE["Vector Store\n(Weaviate / pgvector)"] DOCSTORE["Document Store\n(Source text + metadata)"] AUDIT["Pipeline Audit Log\nIngestion events, quality scores"] end S1 & S2 & S3 & S4 --> CRAWL CRAWL --> EXTRACT --> META --> DEDUP DEDUP --> CHUNK --> CLEAN --> EMB --> VALIDATE VALIDATE --> VECSTORE & DOCSTORE & AUDIT

Enterprise Considerations

Pipeline scheduling: Clinical knowledge bases require update cadences driven by clinical safety requirements:

Hospital formulary: within 5 business days of any formulary change (patient safety)
Clinical guidelines: quarterly or on publication of major updates
Hospital protocols: within 5 business days of committee approval
Drug interaction database: continuous (licensed data feed)

Orphan chunk cleanup: When a source document is updated or removed, the old chunks must be deleted from the vector store. Failure to delete creates a mix of current and outdated content that degrades retrieval quality and may produce incorrect clinical recommendations.

Embedding model migration: When the embedding model is upgraded, all existing chunks must be re-embedded with the new model before the new model is used at query time. A phased migration maintains a shadow index during re-embedding and switches query traffic atomically.

Pipeline observability: Silent failures are the primary risk in AI data pipelines. Monitor: document ingestion rate, chunk validation rejection rate, embedding API error rate, index size growth (should grow monotonically), and query-time retrieval quality (via periodic golden query evaluation).

Common Mistakes

1. Chunking at fixed token sizes without regard for semantic boundaries. Fixed-size chunking frequently splits a clinical recommendation in the middle, creating chunks that lack context. Always try section-boundary chunking first for structured clinical documents.

2. Not implementing orphan chunk cleanup. When a document is updated, the old chunks remain in the index. The vector store now contains both the old and new version of the guideline. The AI may retrieve either, producing inconsistent recommendations.

3. Not validating chunk metadata completeness. Chunks missing effective_date cannot be filtered by date in retrieval. Validate metadata completeness before indexing.

4. Embedding batch size too large. Large embedding batches take longer to process and are harder to retry on failure. Batch size of 10–20 documents provides a good balance of throughput and error recovery.

5. No quality monitoring on pipeline output. Monitoring only for pipeline errors (exceptions, HTTP errors) misses quality failures: chunks that are too short, chunks with missing sections, chunks with incorrect metadata.

Best Practices

Use section-boundary chunking for structured clinical documents; fixed-size with overlap for unstructured text
Implement change detection (content hash) to skip unchanged documents in incremental runs
Validate chunk quality (size, metadata completeness) before embedding
Delete orphan chunks when source documents are updated or removed
Monitor pipeline output quality, not just pipeline errors
Establish update cadences driven by clinical safety requirements (formulary: 5 business days; guidelines: quarterly)
Never mix patient-specific clinical notes into shared knowledge base indexes

Trade-offs

Chunk size: Smaller chunks are more precise but lose context. Larger chunks retain context but reduce retrieval precision. The right chunk size is use-case and document-type specific; benchmark against the target use case.

Incremental vs. full re-index: Incremental updates are faster but accumulate complexity (orphan chunks, version mixing). Full re-indexing is slower but produces a clean index. Schedule full re-indexing quarterly for production clinical knowledge bases.

Interview Questions

Q: A clinical RAG system occasionally retrieves outdated treatment guidelines even though new guidelines have been indexed. What are the most likely causes and how would you fix them?

Category: System Design Difficulty: Senior Role: AI Architect

Answer Framework:

Three root causes in order of likelihood:

Orphan chunks: The old guideline chunks were not deleted when the new version was indexed. Both old and new exist in the vector store; depending on query similarity, either may be retrieved. Fix: implement atomic version replacement — delete all old chunks for a document ID before indexing new chunks, in a single transaction.

Metadata date filter not applied: The retrieval query does not filter by effective<em>date, so old chunks can rank higher than new chunks based on embedding similarity alone (old phrasing may match queries better if the terminology changed). Fix: apply effective</em>date >= [cutoff] filter in retrieval; or version-tag chunks and filter by current version only.

Embedding similarity favoring older phrasing: If the guideline changed its recommended terminology (e.g., new dosing terminology), the new chunk may score lower on similarity to queries that use the old terminology. Fix: query expansion to include both old and new clinical terms, or a hybrid retrieval approach combining semantic and keyword search.

Key Points to Hit:

Orphan chunks as the most common root cause
Atomic replacement pattern as the fix
Metadata date filtering as defense in depth
Embedding similarity as a secondary cause

Key Takeaways

AI data pipelines must handle document versioning, orphan chunk cleanup, and incremental updates — not just initial ingestion
Section-boundary chunking outperforms fixed-size chunking for structured clinical documents
Silent quality failures are the primary risk: pipeline monitoring must include output quality checks, not just error rates
Clinical knowledge base update cadences are patient safety requirements, not operational preferences
PHI in source documents requires treating the entire pipeline (and vector store) as a HIPAA data store
Embedding model migration requires re-indexing all chunks before switching query traffic

Data Pipelines for AI#

Executive Summary#

Learning Objectives#

Business Problem#

Why This Technology Exists#

Core Architecture#

Enterprise Considerations#

Common Mistakes#

Best Practices#

Trade-offs#

Interview Questions#

Q: A clinical RAG system occasionally retrieves outdated treatment guidelines even though new guidelines have been indexed. What are the most likely causes and how would you fix them?#

Key Takeaways#

Further Reading#