Model Security

Executive Summary

Model security addresses threats to the AI model itself as a software artifact and a knowledge store: exfiltration of the model's weights, extraction of training data through targeted queries, membership inference to determine whether specific individuals' data was used in training, and adversarial attacks that cause the model to produce systematically incorrect outputs. For organizations deploying self-hosted clinical AI models fine-tuned on clinical data, model security is not academic — a model that has memorized PHI from its training corpus and can be queried to reproduce it is a patient data breach.

Learning Objectives

  • Identify the attack vectors for model weight exfiltration and apply appropriate access controls
  • Understand training data extraction and membership inference attacks and their implications for clinical AI
  • Apply adversarial robustness principles to clinical AI deployments
  • Design model registry governance that tracks model provenance and access

Business Problem

A Reference Healthcare Organization fine-tunes a clinical language model on de-identified clinical notes and deploys it on self-hosted GPU infrastructure. An adversary who can: (1) steal the model weights, (2) query the model to extract training data, or (3) perform membership inference to determine which patients' data was in the training set — has compromised patient privacy regardless of whether the training data was formally de-identified. Model security addresses these risks.

Enterprise Considerations

Model access in multi-tenant environments: In Kubernetes-based AI deployments with multiple tenant teams, model weight files must not be accessible to tenant application containers. Models should be loaded by a dedicated inference service; tenants access inference via API only, never directly loading model weights.

Vendor model security: For organizations using managed LLM APIs (Anthropic, Azure OpenAI), the model weights are not directly accessible. The relevant model security concerns are: training data used by the provider, inference isolation between customers, and whether the provider trains on customer prompts (confirm training data opt-out in the BAA or enterprise agreement).

Key Takeaways

  • Models fine-tuned on clinical data can memorize and reproduce training data — conduct memorization audits before deployment
  • Membership inference allows adversaries to determine if specific patients' data was used in training, even without extracting content
  • Differential privacy (DP-SGD) provides the strongest mathematical protection against both memorization and membership inference — at a model quality cost that must be evaluated
  • Encrypt model weights at rest using envelope encryption with KMS customer-managed keys
  • Models fine-tuned on real (even de-identified) clinical data should be classified as "restricted" with API-only access and full audit logging

Further Reading