Zero Trust Architecture for AI Systems

Executive Summary

Zero Trust security replaces the perimeter-based model ("trust everything inside the network") with continuous verification of every request regardless of network location. AI systems are a natural fit for Zero Trust because they call external LLM APIs, process PHI within internal networks, run on shared cloud infrastructure, and involve service-to-service communication across multiple components. This chapter applies Zero Trust principles to the enterprise AI architecture: every AI component authenticates explicitly, every data access is authorized to minimum necessary scope, and every action is logged and monitored.

Learning Objectives

  • Apply the three Zero Trust principles (verify explicitly, use least privilege, assume breach) to AI infrastructure
  • Design network segmentation for AI components that handles external LLM API calls without exposing the PHI data layer
  • Implement mTLS for AI service-to-service communication
  • Configure cloud-native Zero Trust controls (AWS Security Groups, Azure Private Link, GCP VPC Service Controls) for AI workloads

Why Zero Trust Matters for AI

Traditional perimeter security assumes that once a request is inside the corporate network, it can be trusted to access most internal resources. AI systems invalidate this model in two ways:

External API dependencies: LLM inference calls exit the corporate network to reach Anthropic, Azure, or Google APIs. Traditional perimeter security cannot inspect or control this traffic without intercepting TLS.

Lateral movement via agentic workflows: A compromised AI agent that has accumulated tool access can move laterally through internal systems — reading the EHR, writing to messaging systems, querying the data warehouse — without triggering traditional network-based detection, because all these calls originate from inside the network from a trusted service account.

Zero Trust addresses both: every LLM API call is authenticated and logged, and every tool call by an AI agent is authorized against a minimum-necessary policy.

Zero Trust Principles Applied to AI

python
from dataclasses import dataclass
from enum import Enum
from typing import Optional

# Educational example — not for clinical use

class TrustLevel(Enum):
    NONE = 0
    AUTHENTICATED = 1      # Identity verified
    AUTHORIZED = 2         # Permission confirmed for this specific action
    VERIFIED = 3           # Identity + permission + request context validated


@dataclass
class ZeroTrustDecision:
    """Result of a Zero Trust policy evaluation for an AI request."""
    trust_level: TrustLevel
    allowed: bool
    component: str
    action: str
    reason: str
    requires_mfa: bool = False
    session_duration_minutes: Optional[int] = None


class ZeroTrustPolicyEngine:
    """
    Zero Trust policy evaluation for AI platform requests.
    
    Every AI component request is evaluated against these policies:
    1. Verify identity (authentication)
    2. Verify authorization (minimum necessary access)
    3. Verify context (device health, network location, time of day)
    
    Educational example — not for clinical use.
    """
    
    def evaluate_clinician_request(
        self,
        user_id: str,
        role: str,
        resource: str,
        action: str,
        device_compliance_status: str,
        network_location: str,    # "corporate_network" | "vpn" | "external"
    ) -> ZeroTrustDecision:
        """
        Evaluate a clinician's request to access the AI platform.
        Even internal requests require full authentication and authorization.
        """
        # 1. Device compliance required
        if device_compliance_status != "compliant":
            return ZeroTrustDecision(
                trust_level=TrustLevel.NONE,
                allowed=False,
                component="ai_gateway",
                action=action,
                reason="Device not compliant with MDM policy. Enroll device before accessing PHI-enabled AI features."
            )
        
        # 2. Network location risk scoring
        if network_location == "external" and resource.startswith("phi"):
            return ZeroTrustDecision(
                trust_level=TrustLevel.AUTHENTICATED,
                allowed=False,
                component="ai_gateway",
                action=action,
                reason="PHI-enabled AI features require VPN or corporate network access from external devices."
            )
        
        # 3. Role-based authorization
        allowed_resources = self._get_allowed_resources(role)
        if resource not in allowed_resources:
            return ZeroTrustDecision(
                trust_level=TrustLevel.AUTHENTICATED,
                allowed=False,
                component="ai_gateway",
                action=action,
                reason=f"Role '{role}' is not authorized for resource '{resource}'."
            )
        
        return ZeroTrustDecision(
            trust_level=TrustLevel.VERIFIED,
            allowed=True,
            component="ai_gateway",
            action=action,
            reason="Request verified",
            session_duration_minutes=60,
        )
    
    def _get_allowed_resources(self, role: str) -> list[str]:
        """Return allowed AI resources per clinical role."""
        ROLE_RESOURCE_MAP = {
            "attending_physician": [
                "phi_clinical_rag",
                "phi_discharge_summary",
                "phi_medication_review",
                "knowledge_base_query",
            ],
            "nurse": [
                "phi_clinical_rag",
                "knowledge_base_query",
            ],
            "pharmacist": [
                "phi_medication_review",
                "knowledge_base_query",
            ],
            "administrative": [
                "knowledge_base_query",  # Non-PHI queries only
            ],
        }
        return ROLE_RESOURCE_MAP.get(role, [])

Network Segmentation Architecture

mTLS for AI Service-to-Service Authentication

Mutual TLS (mTLS) is the Zero Trust authentication mechanism for AI service-to-service communication. Unlike API keys, mTLS certificates cannot be stolen from configuration files — they are bound to the specific service identity.

python
import ssl
import httpx

# Educational example — not for clinical use

def create_mtls_client(
    client_cert_path: str,
    client_key_path: str,
    ca_cert_path: str,      # CA that issued the server's certificate
    service_url: str,
) -> httpx.Client:
    """
    Create an httpx client configured for mTLS.
    Used for RAG service → FHIR proxy communication where both
    the client and server must present valid certificates.
    
    Educational example — not for clinical use.
    """
    ssl_context = ssl.create_default_context(ssl.Purpose.SERVER_AUTH, cafile=ca_cert_path)
    ssl_context.load_cert_chain(certfile=client_cert_path, keyfile=client_key_path)
    ssl_context.minimum_version = ssl.TLSVersion.TLSv1_3
    
    return httpx.Client(
        base_url=service_url,
        verify=ssl_context,
    )


# Certificate rotation automation (illustrative approach)
MTLS_CERT_MANAGEMENT = {
    "certificate_lifetime_days": 90,
    "rotation_trigger_days_before_expiry": 30,
    "certificate_authority": "Vault PKI (HashiCorp Vault internal CA)",
    "rotation_mechanism": "Automated via cert-manager (Kubernetes) or Vault agent",
    "components": [
        "ai-gateway",
        "rag-service",
        "fhir-proxy",
        "embedding-service",
        "async-ai-workers",
    ],
}

Cloud-Native Zero Trust Controls

python
# AWS Security Group and VPC configuration for Zero Trust AI deployment
# Educational example — illustrative configuration patterns only

AWS_ZERO_TRUST_AI_CONFIG = {
    "vpc": {
        "description": "AI Platform VPC — no default routes to internet",
        "subnets": {
            "ai_gateway_subnet": {
                "access": "internet-facing (via ALB only)",
                "security_group_ingress": ["443 from 0.0.0.0/0 via ALB"],
                "security_group_egress": ["443 to ai_processing_subnet_sg"],
            },
            "ai_processing_subnet": {
                "access": "private — no internet access",
                "security_group_ingress": [
                    "443 from ai_gateway_subnet_sg",
                    "443 from ai_processing_subnet_sg (service mesh)",
                ],
                "security_group_egress": [
                    "443 to phi_data_subnet_sg",
                    "443 to VPC endpoint for Bedrock (no internet path)",
                    "443 to NAT Gateway for non-Bedrock LLM APIs (egress inspection)",
                ],
            },
            "phi_data_subnet": {
                "access": "private — no internet access",
                "security_group_ingress": ["443 from ai_processing_subnet_sg only"],
                "security_group_egress": ["None — data subnet has no egress"],
            },
        },
        "vpc_endpoints": [
            "com.amazonaws.us-east-1.bedrock-runtime",    # Private Bedrock access
            "com.amazonaws.us-east-1.secretsmanager",
            "com.amazonaws.us-east-1.logs",
        ],
    }
}

Enterprise Considerations

Egress inspection for LLM API calls: A Zero Trust egress proxy that inspects outbound LLM API calls can enforce the policy that PHI is not sent to providers without a BAA. Pattern matching on LLM prompt payloads at the egress layer provides a safety net for configurations where PHI must not leave the internal network.

Service mesh for AI microservices: In Kubernetes-based AI deployments, a service mesh (Istio, Linkerd) provides mTLS between all AI platform microservices without requiring each service to implement mTLS itself. Service mesh is the preferred implementation for Zero Trust service-to-service authentication at scale.

Common Mistakes

1. Implementing network perimeter security and calling it Zero Trust. Placing AI services behind a VPN or private subnet does not implement Zero Trust. Zero Trust requires identity verification on every request regardless of network location.

2. Not applying Zero Trust principles to AI agent tool calls. The AI agent that calls EHR APIs is itself a component that must authenticate (service account), be authorized (minimum necessary tool ACL), and log every action (audit log). Agents without Zero Trust controls become the most significant lateral movement risk in clinical AI.

3. No certificate rotation for mTLS. mTLS certificates with no expiry or manual rotation policy become unrotated in practice. Automate certificate rotation with cert-manager or Vault PKI; certificates should expire in 90 days or fewer.

Key Takeaways

  • Zero Trust replaces network perimeter trust with identity-based, per-request authorization for every AI component
  • AI agentic workflows are the highest Zero Trust risk: an authenticated agent with multiple tools can move laterally without triggering network-based detection
  • mTLS for service-to-service communication is the Zero Trust authentication mechanism — more secure than API keys because certificates are bound to service identity
  • Egress inspection on LLM API calls is the safety net for the policy that PHI must not be sent to providers without a BAA
  • Network segmentation should isolate the PHI data zone from the AI processing zone; PHI data zone has no egress

Further Reading