Data Pipelines for AI
Executive Summary
AI systems are only as good as the data they are built on. The data pipeline is the infrastructure that transforms raw source data — clinical notes, policy documents, transactional records, structured databases — into the embeddings, fine-tuning datasets, and evaluation corpora that AI systems consume. At enterprise scale, data pipelines for AI are significantly more complex than traditional ETL pipelines: they must handle unstructured text, manage document versioning, coordinate embedding computation across heterogeneous sources, and maintain data quality in a domain where pipeline failures produce silent quality degradation rather than system errors. This chapter covers the architecture of AI data pipelines from ingestion through embedding and indexing, with production patterns for the healthcare context.
Learning Objectives
- Design an end-to-end AI data pipeline from source document ingestion through vector store indexing
- Select appropriate chunking strategies for different document types and use cases
- Implement pipeline orchestration with error handling, retry logic, and quality validation
- Design pipeline monitoring that detects silent failures (quality degradation) as well as hard failures
- Apply healthcare-specific pipeline requirements for PHI handling and document provenance
Business Problem
An AI system that retrieves from a stale or incomplete knowledge base produces worse outcomes than one that retrieves nothing — because the AI presents outdated information with apparent confidence. In clinical settings, a knowledge base that reflects guidelines from 18 months ago will produce clinical recommendations that no longer reflect current standard of care. The data pipeline is what keeps the knowledge base current, complete, and trustworthy.
Traditional ETL pipelines fail silently in AI contexts: a missed document, an incorrect chunk boundary, or a stale embedding produces no error — it simply degrades the AI's output quality in ways that are difficult to detect without continuous evaluation.
Why This Technology Exists
The first RAG systems used manual, one-time document loading: engineers uploaded PDF files to a script that chunked and embedded them. This approach fails in production for three reasons: source documents change (guidelines are updated, formularies are revised), new documents must be added without full re-indexing, and the scale of document collections grows beyond manual management.
Purpose-built AI data pipelines address these requirements: scheduled ingestion that detects new and updated documents, incremental indexing that adds new chunks without re-embedding the entire collection, and quality validation that detects pipeline failures before they reach the production RAG system.
Core Architecture
Enterprise Considerations
Pipeline scheduling: Clinical knowledge bases require update cadences driven by clinical safety requirements:
- Hospital formulary: within 5 business days of any formulary change (patient safety)
- Clinical guidelines: quarterly or on publication of major updates
- Hospital protocols: within 5 business days of committee approval
- Drug interaction database: continuous (licensed data feed)
Orphan chunk cleanup: When a source document is updated or removed, the old chunks must be deleted from the vector store. Failure to delete creates a mix of current and outdated content that degrades retrieval quality and may produce incorrect clinical recommendations.
Embedding model migration: When the embedding model is upgraded, all existing chunks must be re-embedded with the new model before the new model is used at query time. A phased migration maintains a shadow index during re-embedding and switches query traffic atomically.
Pipeline observability: Silent failures are the primary risk in AI data pipelines. Monitor: document ingestion rate, chunk validation rejection rate, embedding API error rate, index size growth (should grow monotonically), and query-time retrieval quality (via periodic golden query evaluation).
Common Mistakes
1. Chunking at fixed token sizes without regard for semantic boundaries. Fixed-size chunking frequently splits a clinical recommendation in the middle, creating chunks that lack context. Always try section-boundary chunking first for structured clinical documents.
2. Not implementing orphan chunk cleanup. When a document is updated, the old chunks remain in the index. The vector store now contains both the old and new version of the guideline. The AI may retrieve either, producing inconsistent recommendations.
3. Not validating chunk metadata completeness. Chunks missing effective_date cannot be filtered by date in retrieval. Validate metadata completeness before indexing.
4. Embedding batch size too large. Large embedding batches take longer to process and are harder to retry on failure. Batch size of 10–20 documents provides a good balance of throughput and error recovery.
5. No quality monitoring on pipeline output. Monitoring only for pipeline errors (exceptions, HTTP errors) misses quality failures: chunks that are too short, chunks with missing sections, chunks with incorrect metadata.
Best Practices
- Use section-boundary chunking for structured clinical documents; fixed-size with overlap for unstructured text
- Implement change detection (content hash) to skip unchanged documents in incremental runs
- Validate chunk quality (size, metadata completeness) before embedding
- Delete orphan chunks when source documents are updated or removed
- Monitor pipeline output quality, not just pipeline errors
- Establish update cadences driven by clinical safety requirements (formulary: 5 business days; guidelines: quarterly)
- Never mix patient-specific clinical notes into shared knowledge base indexes
Trade-offs
Chunk size: Smaller chunks are more precise but lose context. Larger chunks retain context but reduce retrieval precision. The right chunk size is use-case and document-type specific; benchmark against the target use case.
Incremental vs. full re-index: Incremental updates are faster but accumulate complexity (orphan chunks, version mixing). Full re-indexing is slower but produces a clean index. Schedule full re-indexing quarterly for production clinical knowledge bases.
Interview Questions
Q: A clinical RAG system occasionally retrieves outdated treatment guidelines even though new guidelines have been indexed. What are the most likely causes and how would you fix them?
Category: System Design Difficulty: Senior Role: AI Architect
Answer Framework:
Three root causes in order of likelihood:
Orphan chunks: The old guideline chunks were not deleted when the new version was indexed. Both old and new exist in the vector store; depending on query similarity, either may be retrieved. Fix: implement atomic version replacement — delete all old chunks for a document ID before indexing new chunks, in a single transaction.
Metadata date filter not applied: The retrieval query does not filter by effective<em>date, so old chunks can rank higher than new chunks based on embedding similarity alone (old phrasing may match queries better if the terminology changed). Fix: apply effective</em>date >= [cutoff] filter in retrieval; or version-tag chunks and filter by current version only.
Embedding similarity favoring older phrasing: If the guideline changed its recommended terminology (e.g., new dosing terminology), the new chunk may score lower on similarity to queries that use the old terminology. Fix: query expansion to include both old and new clinical terms, or a hybrid retrieval approach combining semantic and keyword search.
Key Points to Hit:
- Orphan chunks as the most common root cause
- Atomic replacement pattern as the fix
- Metadata date filtering as defense in depth
- Embedding similarity as a secondary cause
Key Takeaways
- AI data pipelines must handle document versioning, orphan chunk cleanup, and incremental updates — not just initial ingestion
- Section-boundary chunking outperforms fixed-size chunking for structured clinical documents
- Silent quality failures are the primary risk: pipeline monitoring must include output quality checks, not just error rates
- Clinical knowledge base update cadences are patient safety requirements, not operational preferences
- PHI in source documents requires treating the entire pipeline (and vector store) as a HIPAA data store
- Embedding model migration requires re-indexing all chunks before switching query traffic
Further Reading
- Vector Databases — The storage layer data pipelines populate
- Clinical RAG — Healthcare application of data pipeline patterns
- Orchestration and Workflow — Scheduling and orchestrating data pipelines at enterprise scale