Data Pipelines for AI

Core Architecture

graph TD subgraph "Source Layer" S1["Clinical Guidelines\n(ACC, AHA, USPSTF PDFs)"] S2["Hospital Formulary\n(Pharmacy database)"] S3["Clinical Protocols\n(SharePoint / intranet)"] S4["EHR Documents\n(via FHIR DocumentReference)"] end subgraph "Ingestion Layer" CRAWL["Document Crawler\nDetect new / updated docs"] EXTRACT["Text Extraction\nPDF, DOCX, HTML → plain text"] META["Metadata Extraction\nSource, date, version, author"] DEDUP["Deduplication\nContent hash; skip unchanged docs"] end subgraph "Processing Layer" CHUNK["Chunking Engine\nStrategy per document type"] CLEAN["Text Cleaning\nRemove headers, footers, artifacts"] EMB["Embedding Service\nBatch embedding computation"] VALIDATE["Quality Validation\nChunk size, embedding shape, metadata completeness"] end subgraph "Storage Layer" VECSTORE["Vector Store\n(Weaviate / pgvector)"] DOCSTORE["Document Store\n(Source text + metadata)"] AUDIT["Pipeline Audit Log\nIngestion events, quality scores"] end S1 & S2 & S3 & S4 --> CRAWL CRAWL --> EXTRACT --> META --> DEDUP DEDUP --> CHUNK --> CLEAN --> EMB --> VALIDATE VALIDATE --> VECSTORE & DOCSTORE & AUDIT

Common Mistakes

1. Chunking at fixed token sizes without regard for semantic boundaries. Fixed-size chunking frequently splits a clinical recommendation in the middle, creating chunks that lack context. Always try section-boundary chunking first for structured clinical documents.

2. Not implementing orphan chunk cleanup. When a document is updated, the old chunks remain in the index. The vector store now contains both the old and new version of the guideline. The AI may retrieve either, producing inconsistent recommendations.

3. Not validating chunk metadata completeness. Chunks missing effective_date cannot be filtered by date in retrieval. Validate metadata completeness before indexing.

4. Embedding batch size too large. Large embedding batches take longer to process and are harder to retry on failure. Batch size of 10–20 documents provides a good balance of throughput and error recovery.

5. No quality monitoring on pipeline output. Monitoring only for pipeline errors (exceptions, HTTP errors) misses quality failures: chunks that are too short, chunks with missing sections, chunks with incorrect metadata.

Best Practices

Use section-boundary chunking for structured clinical documents; fixed-size with overlap for unstructured text
Implement change detection (content hash) to skip unchanged documents in incremental runs
Validate chunk quality (size, metadata completeness) before embedding
Delete orphan chunks when source documents are updated or removed
Monitor pipeline output quality, not just pipeline errors
Establish update cadences driven by clinical safety requirements (formulary: 5 business days; guidelines: quarterly)
Never mix patient-specific clinical notes into shared knowledge base indexes

Trade-offs

Chunk size: Smaller chunks are more precise but lose context. Larger chunks retain context but reduce retrieval precision. The right chunk size is use-case and document-type specific; benchmark against the target use case.

Incremental vs. full re-index: Incremental updates are faster but accumulate complexity (orphan chunks, version mixing). Full re-indexing is slower but produces a clean index. Schedule full re-indexing quarterly for production clinical knowledge bases.

Interview Questions

Q: A clinical RAG system occasionally retrieves outdated treatment guidelines even though new guidelines have been indexed. What are the most likely causes and how would you fix them?

Category: System Design Difficulty: Senior Role: AI Architect

Answer Framework:

Three root causes in order of likelihood:

Orphan chunks: The old guideline chunks were not deleted when the new version was indexed. Both old and new exist in the vector store; depending on query similarity, either may be retrieved. Fix: implement atomic version replacement — delete all old chunks for a document ID before indexing new chunks, in a single transaction.

Metadata date filter not applied: The retrieval query does not filter by effective<em>date, so old chunks can rank higher than new chunks based on embedding similarity alone (old phrasing may match queries better if the terminology changed). Fix: apply effective</em>date >= [cutoff] filter in retrieval; or version-tag chunks and filter by current version only.

Embedding similarity favoring older phrasing: If the guideline changed its recommended terminology (e.g., new dosing terminology), the new chunk may score lower on similarity to queries that use the old terminology. Fix: query expansion to include both old and new clinical terms, or a hybrid retrieval approach combining semantic and keyword search.

Key Points to Hit:

Orphan chunks as the most common root cause
Atomic replacement pattern as the fix
Metadata date filtering as defense in depth
Embedding similarity as a secondary cause

Key Takeaways

AI data pipelines must handle document versioning, orphan chunk cleanup, and incremental updates — not just initial ingestion
Section-boundary chunking outperforms fixed-size chunking for structured clinical documents
Silent quality failures are the primary risk: pipeline monitoring must include output quality checks, not just error rates
Clinical knowledge base update cadences are patient safety requirements, not operational preferences
PHI in source documents requires treating the entire pipeline (and vector store) as a HIPAA data store
Embedding model migration requires re-indexing all chunks before switching query traffic

Data Pipelines for AI#

Core Architecture#

Common Mistakes#

Best Practices#

Trade-offs#

Interview Questions#

Q: A clinical RAG system occasionally retrieves outdated treatment guidelines even though new guidelines have been indexed. What are the most likely causes and how would you fix them?#

Key Takeaways#