Data Warehouse Integration for AI
Common Mistakes
1. Loading the entire dataset into a Pandas DataFrame. A population health dataset of 500,000 patients does not fit in a typical application server's memory as a DataFrame. Use server-side SQL aggregation, BigQuery streaming row iterators, or Snowflake result set iteration.
2. Using SELECT * on clinical warehouse tables. Clinical warehouse tables often contain dozens of columns, many of which the AI model does not need. Always SELECT only the columns required by the model; this reduces data transfer, query cost, and PHI exposure surface.
3. Not including modelversion in AI output tables. When the risk model is retrained (monthly for clinical models), scores from the new model are written alongside scores from the old model. Without a modelversion column, there is no way to distinguish them or audit which score drove a clinical decision.
Key Takeaways
- In-warehouse inference (Snowflake Cortex, BigQuery ML) is preferred over data extraction when data cannot leave the warehouse
- Row-level security must be applied to clinical data warehouse tables accessed by AI services
- All AI output tables must include model_version for reproducibility and audit
- Stream large datasets row-by-row rather than loading into DataFrames
- PHI in warehouse datasets requires the same HIPAA controls as PHI in transactional systems