Data Pipelines for AI

Core Architecture

Common Mistakes

1. Chunking at fixed token sizes without regard for semantic boundaries. Fixed-size chunking frequently splits a clinical recommendation in the middle, creating chunks that lack context. Always try section-boundary chunking first for structured clinical documents.

2. Not implementing orphan chunk cleanup. When a document is updated, the old chunks remain in the index. The vector store now contains both the old and new version of the guideline. The AI may retrieve either, producing inconsistent recommendations.

3. Not validating chunk metadata completeness. Chunks missing effective_date cannot be filtered by date in retrieval. Validate metadata completeness before indexing.

4. Embedding batch size too large. Large embedding batches take longer to process and are harder to retry on failure. Batch size of 10–20 documents provides a good balance of throughput and error recovery.

5. No quality monitoring on pipeline output. Monitoring only for pipeline errors (exceptions, HTTP errors) misses quality failures: chunks that are too short, chunks with missing sections, chunks with incorrect metadata.

Best Practices

  • Use section-boundary chunking for structured clinical documents; fixed-size with overlap for unstructured text
  • Implement change detection (content hash) to skip unchanged documents in incremental runs
  • Validate chunk quality (size, metadata completeness) before embedding
  • Delete orphan chunks when source documents are updated or removed
  • Monitor pipeline output quality, not just pipeline errors
  • Establish update cadences driven by clinical safety requirements (formulary: 5 business days; guidelines: quarterly)
  • Never mix patient-specific clinical notes into shared knowledge base indexes

Trade-offs

Chunk size: Smaller chunks are more precise but lose context. Larger chunks retain context but reduce retrieval precision. The right chunk size is use-case and document-type specific; benchmark against the target use case.

Incremental vs. full re-index: Incremental updates are faster but accumulate complexity (orphan chunks, version mixing). Full re-indexing is slower but produces a clean index. Schedule full re-indexing quarterly for production clinical knowledge bases.

Interview Questions

Q: A clinical RAG system occasionally retrieves outdated treatment guidelines even though new guidelines have been indexed. What are the most likely causes and how would you fix them?

Category: System Design Difficulty: Senior Role: AI Architect

Answer Framework:

Three root causes in order of likelihood:

Orphan chunks: The old guideline chunks were not deleted when the new version was indexed. Both old and new exist in the vector store; depending on query similarity, either may be retrieved. Fix: implement atomic version replacement — delete all old chunks for a document ID before indexing new chunks, in a single transaction.

Metadata date filter not applied: The retrieval query does not filter by effective<em>date, so old chunks can rank higher than new chunks based on embedding similarity alone (old phrasing may match queries better if the terminology changed). Fix: apply effective</em>date &gt;= [cutoff] filter in retrieval; or version-tag chunks and filter by current version only.

Embedding similarity favoring older phrasing: If the guideline changed its recommended terminology (e.g., new dosing terminology), the new chunk may score lower on similarity to queries that use the old terminology. Fix: query expansion to include both old and new clinical terms, or a hybrid retrieval approach combining semantic and keyword search.

Key Points to Hit:

  • Orphan chunks as the most common root cause
  • Atomic replacement pattern as the fix
  • Metadata date filtering as defense in depth
  • Embedding similarity as a secondary cause

Key Takeaways

  • AI data pipelines must handle document versioning, orphan chunk cleanup, and incremental updates — not just initial ingestion
  • Section-boundary chunking outperforms fixed-size chunking for structured clinical documents
  • Silent quality failures are the primary risk: pipeline monitoring must include output quality checks, not just error rates
  • Clinical knowledge base update cadences are patient safety requirements, not operational preferences
  • PHI in source documents requires treating the entire pipeline (and vector store) as a HIPAA data store
  • Embedding model migration requires re-indexing all chunks before switching query traffic