Data Warehouse Integration for AI

Executive Summary

Data warehouses and lakehouse platforms — Snowflake, BigQuery, Databricks — are the primary stores for historical clinical and operational data that AI systems need for analytics AI use cases: population health risk scoring, quality measure reporting, cost analysis, and retrospective clinical research. Integrating AI with these platforms requires understanding both the data access patterns (batch extraction, in-warehouse inference, vector search extensions) and the governance requirements that apply when AI accesses large-scale patient datasets. This chapter covers the integration patterns for connecting AI systems to enterprise data warehouse infrastructure.

Learning Objectives

  • Design data extraction pipelines from Snowflake, BigQuery, and Databricks for AI processing
  • Apply in-warehouse AI inference using platform-native ML functions to avoid data movement
  • Implement row-level security and data masking for PHI in warehoused clinical datasets
  • Choose between data movement to AI and AI movement to data based on scale and compliance constraints

Business Problem

A Reference Healthcare Organization's data warehouse contains 10 years of de-identified clinical encounter data for hundreds of thousands of patients — the dataset that powers population health analysis, quality measure calculation, and clinical research. An AI system that can query this dataset to identify patients at risk of readmission, predict ED surge demand, or surface care gaps across the patient population provides value that real-time EHR integration cannot — because it operates across the entire patient population, not just the current encounter.

The integration challenge is accessing this data at AI-relevant scale: extracting 100,000 patient records for a population health model cannot use the same FHIR API patterns as real-time clinical decision support.

Enterprise Considerations

In-warehouse inference vs. data extraction: Moving 1M patient records out of the data warehouse for AI scoring is slow, expensive, and creates additional PHI exposure. Where possible, use in-warehouse AI capabilities (Snowflake Cortex, BigQuery ML, Databricks MLflow serving) to run inference inside the warehouse and land results in a warehouse table. The PHI never leaves the warehouse's access control boundary.

Row-level security for de-identified datasets: Even "de-identified" data in a clinical warehouse requires access controls. Snowflake Row Access Policies and BigQuery row-level security allow the AI platform to access only the specific cohort it is authorized to score. Never grant the AI service account unrestricted access to the entire clinical warehouse.

Model versioning and reproducibility: AI risk scores written to the data warehouse must include the model version that produced them. When a model is retrained, historical scores from the old model version must remain queryable for comparison. Use a model_version column on all AI output tables.

Common Mistakes

1. Loading the entire dataset into a Pandas DataFrame. A population health dataset of 500,000 patients does not fit in a typical application server's memory as a DataFrame. Use server-side SQL aggregation, BigQuery streaming row iterators, or Snowflake result set iteration.

2. Using SELECT * on clinical warehouse tables. Clinical warehouse tables often contain dozens of columns, many of which the AI model does not need. Always SELECT only the columns required by the model; this reduces data transfer, query cost, and PHI exposure surface.

3. Not including modelversion in AI output tables. When the risk model is retrained (monthly for clinical models), scores from the new model are written alongside scores from the old model. Without a modelversion column, there is no way to distinguish them or audit which score drove a clinical decision.

Key Takeaways

  • In-warehouse inference (Snowflake Cortex, BigQuery ML) is preferred over data extraction when data cannot leave the warehouse
  • Row-level security must be applied to clinical data warehouse tables accessed by AI services
  • All AI output tables must include model_version for reproducibility and audit
  • Stream large datasets row-by-row rather than loading into DataFrames
  • PHI in warehouse datasets requires the same HIPAA controls as PHI in transactional systems

Further Reading