Data Privacy Architecture for AI

Common Mistakes

1. Assuming Safe Harbor de-identification is sufficient for AI training. Safe Harbor removes explicit identifiers but does not prevent memorization of clinical content or re-identification through rare condition combinations. Treat fine-tuning datasets as PHI-adjacent and apply expert determination.

2. Not marking synthetic data clearly in development environments. If synthetic data in development environments is not clearly marked as synthetic, it may inadvertently be treated as real patient data in testing workflows.

3. Applying data minimization only at the application layer. Minimum necessary filtering must be applied before PHI reaches the LLM API — not after the LLM has processed it. Minimizing data in the LLM output does not reduce the PHI exposure in the prompt.

Key Takeaways

  • HIPAA Safe Harbor de-identification is the minimum for clinical AI training datasets — use expert determination for fine-tuning
  • LLMs can memorize training data even after de-identification; apply differential privacy for clinical fine-tuning
  • Synthetic data is the preferred approach for AI development and testing where PHI is not required
  • Data minimization must be applied before PHI reaches the LLM prompt — not after
  • PHI in AI pipelines must respect data residency boundaries specified in organizational policies and patient consents