AI Governance

Executive Summary

AI governance is the organizational infrastructure that determines whether an AI system can be trusted, audited, corrected, and held accountable — qualities that are non-negotiable in regulated industries and increasingly expected in all enterprise contexts. This chapter covers the full governance stack: responsible AI principles, model risk management, model cards, bias evaluation, audit logging, incident response, and the institutional structures that make governance operational rather than aspirational. In healthcare, where AI outputs can influence clinical decisions affecting patient safety, governance is not a compliance checkbox — it is the engineering requirement that separates deployable clinical AI from perpetual pilot status.

Learning Objectives

After reading this chapter, you will be able to:

Design an AI governance framework appropriate for a regulated enterprise environment
Produce model cards that meet enterprise and regulatory documentation standards
Identify the categories of AI risk that require formal assessment before deployment
Implement audit logging that satisfies HIPAA and Joint Commission requirements
Distinguish between AI governance as organizational practice versus AI governance as technical architecture

Business Problem

Enterprise AI systems fail governance review for predictable reasons, most of which are architectural rather than political. A clinical AI system reaches the governance committee without documentation of its training data, evaluation methodology, or failure modes. An administrative AI tool is deployed with no audit trail, making it impossible to investigate errors after the fact. A vendor AI product is integrated without a Business Associate Agreement, creating HIPAA liability. An AI system that performs well in aggregate produces systematically biased outputs for a specific patient subpopulation — a fact that is not discovered until a clinician notices a pattern.

These failures share a root cause: governance was treated as a downstream approval process rather than an upstream design constraint. In a correctly designed AI system, governance artifacts — model cards, bias evaluations, audit logs, incident procedures — are generated as part of the development process, not assembled after the fact to satisfy a review committee.

Why This Technology Exists

The formalization of AI governance in enterprise settings was accelerated by three converging forces: high-profile AI failures that produced measurable harm (biased hiring algorithms, discriminatory loan approval models, clinical AI systems that performed differently across demographic groups), regulatory pressure (EU AI Act, FDA guidance on AI/ML-based Software as a Medical Device, NIST AI Risk Management Framework), and the maturation of enterprise risk management practices to cover AI as a distinct risk category.

In healthcare, FDA's 2021 action plan for AI/ML-based Software as a Medical Device formalized the expectation that AI systems influencing clinical decisions require documented pre-market validation, post-market monitoring, and a change management protocol for model updates — requirements directly analogous to those governing medical devices. Organizations that adopted AI governance frameworks prior to these regulatory developments found themselves ahead of compliance requirements rather than scrambling to retroactively document systems already in production.

Conceptual Explanation

AI governance operates at three layers that correspond to different organizational functions.

Policy Layer: The principles, standards, and requirements that define acceptable AI development and deployment. This includes responsible AI principles (fairness, transparency, accountability, safety), data governance policies that define acceptable data use for AI, and risk classification criteria that determine the level of review required for each AI system.

Process Layer: The workflows, approvals, and checkpoints that operationalize policy. This includes the model review board process (who approves clinical AI deployments?), the audit logging requirements (what events must be recorded and for how long?), the incident response procedure (what happens when an AI system produces a harmful output?), and the model change management process (how are model updates validated before reaching production?).

Technical Layer: The infrastructure that makes governance enforceable — audit logging pipelines, access control systems, model versioning registries, evaluation frameworks, and observability tools that surface governance-relevant signals (quality drift, demographic performance disparities, anomalous outputs).

Governance frameworks that operate only at the policy layer are aspirational. Governance frameworks that extend to the technical layer are operational. The distinction is critical: policy says "AI systems must be auditable"; technical layer says "every inference call writes a structured log entry to an append-only audit store with a 7-year retention policy."

Core Architecture

An enterprise AI governance framework consists of five structural components:

AI Risk Classification: A tiering system that determines the level of governance scrutiny required for each AI use case. Clinical decision support systems in Tier 1 (highest risk) require full clinical validation, FDA classification review, and governance committee approval. Administrative automation tools in Tier 3 (lower risk) require standard IT security review and operational monitoring. Risk tier determines which governance processes apply.

Model Documentation (Model Cards): Standardized documentation produced for every AI model entering production. Model cards record: intended use, out-of-scope uses, training data sources, evaluation methodology, performance metrics by subgroup, known limitations, and update procedures.

Audit Logging Infrastructure: A technical system that records every AI inference — input, output, model version, timestamp, user identity, and confidence signal — in an immutable, queryable log. For clinical AI, audit logs must be retained for the duration required by applicable regulations and must be accessible for clinical investigation, legal discovery, and regulatory audit.

Bias and Fairness Evaluation: A structured process for evaluating AI performance across demographic subgroups before deployment and on an ongoing basis. Disparities in performance across subgroups — race, ethnicity, sex, age, payer status — must be documented and disclosed to clinical stakeholders before clinical deployment.

AI Incident Response: A defined procedure for responding to AI system failures, unexpected outputs, or harmful events. The procedure must specify: detection (how is the incident identified?), escalation (who is notified?), containment (can the system be immediately disabled?), investigation (how is root cause determined?), remediation (how is the model corrected?), and disclosure (who must be notified — patients, regulators, governing bodies?).

Architecture Diagram

graph TD subgraph "Governance Policy Layer" RP["Responsible AI\nPrinciples"] RC["Risk Classification\nTier 1 / 2 / 3"] DP["Data Governance\nPolicy"] end subgraph "Governance Process Layer" MRB["Model Review\nBoard"] MC["Model Card\nProduction"] BFE["Bias & Fairness\nEvaluation"] IR["Incident Response\nProcedure"] CM["Change Management\nModel Update Protocol"] end subgraph "Technical Governance Layer" AL["Audit Logging\nInfrastructure"] MR["Model Registry\n+ Version Control"] EP["Evaluation\nPipeline"] OB["Observability\nDrift Detection"] end subgraph "Clinical AI — Tier 1" CDS["Clinical Decision\nSupport"] DOC["Clinical\nDocumentation AI"] end RC -->|"Tier 1 → Full Review"| MRB MRB --> MC MRB --> BFE MC --> MR BFE --> EP CDS --> AL DOC --> AL AL --> OB OB -->|"Drift Alert"| IR IR --> CM CM --> MRB

Enterprise Considerations

Governance at Scale: A governance framework designed for 5 AI systems will not scale to 50. As the AI portfolio grows, governance processes must be automated: model card templates that auto-populate from the ML pipeline, evaluation pipelines that run bias assessments automatically as part of CI/CD, audit log anomaly detection that surfaces governance signals without manual review.

Cross-Vendor Governance: Healthcare organizations often use multiple LLM vendors for different use cases. The governance framework must be vendor-agnostic — audit logging, model cards, and bias evaluation must apply equally to a GPT-4-based administrative tool and a Claude-based clinical summarization system.

Regulatory Evolution: FDA's regulatory framework for AI/ML-based Software as a Medical Device is actively evolving. The governance framework must include a regulatory monitoring function — a designated team member responsible for tracking FDA guidance updates, EU AI Act implementation, and state-level AI regulations as they affect deployed clinical AI systems.

Third-Party AI Risk: When a clinical AI capability is embedded in a vendor product (EHR-integrated AI, radiology AI overlaid on PACS), the hospital bears the clinical and regulatory responsibility even when the model is not theirs. Third-party AI must be subject to the same model card requirements, bias evaluation, and incident response procedures as internally built AI.

Healthcare Example

⊕ Healthcare Example

Educational Example — Illustrative Workflow. Not intended for clinical decision making.

The Reference Healthcare Organization deploys a discharge summary generation AI system. The governance path from development to production:

Weeks 1–2 — Risk Classification: The AI Governance team classifies the system as Tier 1 (clinical AI that produces content reviewed by a physician before entering the medical record). Tier 1 requires: full model card, bias evaluation across demographic subgroups, HIPAA data flow review, clinical validation with 50+ physician reviews, and Model Review Board approval.

Weeks 3–6 — Model Card and Documentation: The engineering team completes the model card. Training data: 18 months of de-identified discharge summaries from Epic, filtered to encounters with attending physician sign-off. Known gap: underrepresentation of patients with complex social determinants of health. Bias evaluation: performance is measured across race, ethnicity, primary language, and payer status. A 6-point accuracy gap is identified between English-primary and Spanish-primary patients and is disclosed in the model card. Mitigation: the system adds a language flag to outputs for Spanish-primary patients until additional training data is acquired.

Weeks 7–10 — Clinical Validation: 8 hospitalists review 200 AI-generated discharge summaries against human-authored summaries in a blinded evaluation. 91% of AI summaries are rated as "acceptable without significant revision." The 9% requiring significant revision cluster around complex multi-comorbidity cases, which are documented as a known limitation.

Week 11 — Model Review Board: The board reviews the model card, bias evaluation results, clinical validation report, HIPAA data flow assessment, and BAA with the LLM vendor (Anthropic, BAA in place). The board approves production deployment with two conditions: low-confidence outputs (< 0.80) must be flagged for extended physician review; the system must present its own uncertainty to the reviewing clinician.

Week 12 — Production Deployment: The system deploys with audit logging active. Every inference is logged with the encounter hash, clinician identity, model version, latency, and confidence score.

Ongoing: The governance team reviews audit logs monthly. After 30 days, 0.3% of outputs are flagged by clinicians as requiring substantial revision — below the 1% threshold that would trigger a governance review.

Common Mistakes

Governance as Approval Theater. The most common failure mode is a governance committee that approves AI systems based on slide decks rather than technical documentation. If the committee has not reviewed a completed model card, a bias evaluation report, and a HIPAA data flow assessment, the approval is not meaningful governance — it is approval theater that creates the appearance of oversight without the substance.

Audit Logs That Contain PHI. Writing patient names, dates of birth, or other PHI to audit logs that are accessible to non-clinical systems is a HIPAA violation. Audit logs must contain hashed or de-identified identifiers, never raw PHI. The audit log design must be reviewed by legal before the first inference.

No Subgroup Analysis. An AI system that achieves 90% overall accuracy but 75% accuracy for Black patients and 78% accuracy for patients with limited English proficiency has a bias problem that the aggregate metric conceals. Subgroup analysis is non-negotiable for clinical AI.

Governance Without Teeth. A governance framework that lacks enforcement mechanisms — the authority to block a deployment, require a rollback, or terminate a vendor contract — is advisory at best. The governance committee must have explicit authority over production deployments.

Best Practices

Establish risk classification criteria before the first AI use case, so every subsequent system is evaluated consistently
Treat model cards as living documents that are updated with every model version change
Automate bias evaluation as part of the CI/CD pipeline so it runs every time a model is retrained
Log AI inferences at the infrastructure layer (AI gateway), not the application layer, so logging cannot be bypassed by individual applications
Require that incident response procedures be tested annually in a simulated governance exercise
Make model cards accessible to clinical stakeholders in plain language, not just technical documentation
Require a BAA and data residency confirmation before any PHI-containing data flows to an external AI service

Alternatives

NIST AI Risk Management Framework (AI RMF). A voluntary framework published by NIST that provides structured guidance on AI risk identification, measurement, management, and governance. Suitable as a foundation for healthcare AI governance; complements rather than replaces healthcare-specific requirements.

EU AI Act Compliance Framework. For organizations with EU operations or patients, the EU AI Act classifies many clinical AI systems as "high-risk AI systems" subject to mandatory conformity assessment, technical documentation, and post-market monitoring. The EU framework is more prescriptive than US guidance.

ISO/IEC 42001 (AI Management Systems). An international standard for AI management systems, analogous to ISO 27001 for information security. Provides a certifiable framework that may satisfy regulatory and contractual requirements.

Trade-offs

Dimension	Light Governance	Rigorous Governance
Speed to deployment	Faster	Slower (weeks–months)
Regulatory risk	High	Managed
Clinical trust	Lower	Higher
Operational overhead	Low	High
Incident response capability	Reactive	Proactive
Audit readiness	Minimal	Comprehensive
Scalability	Does not scale	Scales with automation

Interview Questions

Q: Design an AI governance framework for a hospital deploying its first clinical AI system.

Category: System Design Difficulty: Senior Role: AI Architect

Answer Framework:

Begin with a risk classification system. Clinical AI that influences direct patient care decisions — documentation, alerts, diagnosis support — is Tier 1 and requires the most rigorous governance. Administrative AI — scheduling, coding, prior authorization — is Tier 2 with reduced requirements. Internal tools — staff-facing search, knowledge retrieval — are Tier 3 with standard IT governance.

For Tier 1, the framework requires: a model card completed by the engineering team before governance review, bias evaluation across the patient demographics served, HIPAA data flow documentation including vendor BAA status, clinical validation by domain-expert physicians, and formal Model Review Board approval with documented minutes.

At the technical layer: audit logging for every inference (no raw PHI in logs), model version registry with immutable artifacts, anomaly detection on output quality, and a documented rollback procedure that can execute in under 15 minutes.

Key Points to Hit:

Risk classification determines which processes apply
Technical enforcement (audit logging, model registry) makes governance operational
Bias evaluation is mandatory, not optional
Clinical validation by qualified physicians is required for Tier 1
Governance structure must have blocking authority, not just advisory authority

Q: An AI system in your hospital is producing systematically different outputs for patients of different racial groups. You discover this three months after production deployment. What do you do?

Category: Behavioral / System Design Difficulty: Principal Role: AI Architect

Answer Framework:

This is an active safety issue, not a post-hoc optimization. The immediate response is: contain first, investigate second, remediate third, disclose fourth.

Contain: Determine within the first hour whether the disparity is in a domain that could affect patient safety. If yes, immediately route affected cases to human review and present outputs only as drafts requiring physician review. If the bias is severe enough to constitute a patient safety risk, disable the system until remediated.

Investigate: Pull audit logs for the affected period. Segment outputs by the demographic variable showing disparity. Determine the performance gap magnitude, the number of patients affected, and whether any adverse events are potentially attributable to biased outputs.

Remediate: Retrain with augmented data, or implement post-processing corrections, or restrict the system's use to the population where it performs equitably.

Disclose: Notify the governance committee, legal, and compliance. Evaluate whether any clinical events during the affected period require root cause analysis. Consult legal on whether affected patients or regulators must be notified.

Key Points to Hit:

Safety first — contain before investigating
Audit logs are only useful if they exist and are queryable; this is why audit infrastructure matters
Disclosure to governance is not optional
The correct response is never to quietly fix it without documentation

Key Takeaways

AI governance must be designed as an engineering constraint from the start, not assembled retrospectively for review committees
Risk classification determines governance intensity; not all AI systems require the same scrutiny
Model cards are the foundational governance artifact: one per AI system, updated with every version change
Bias evaluation across demographic subgroups is non-negotiable for clinical AI
Audit logging must be implemented at the infrastructure layer with no raw PHI in the log
The Model Review Board must have blocking authority over production deployments to be meaningful
In healthcare, AI governance intersects directly with HIPAA, FDA regulatory requirements, and clinical liability

AI Governance#

Executive Summary#

Learning Objectives#

Business Problem#

Why This Technology Exists#

Conceptual Explanation#

Core Architecture#

Architecture Diagram#

Enterprise Considerations#

Healthcare Example#

Common Mistakes#

Best Practices#

Alternatives#

Trade-offs#

Interview Questions#

Q: Design an AI governance framework for a hospital deploying its first clinical AI system.#

Q: An AI system in your hospital is producing systematically different outputs for patients of different racial groups. You discover this three months after production deployment. What do you do?#

Key Takeaways#

Further Reading#

AI Governance

Executive Summary

Learning Objectives

Business Problem

Why This Technology Exists

Conceptual Explanation

Core Architecture

Architecture Diagram

Enterprise Considerations

Healthcare Example

Common Mistakes

Best Practices

Alternatives

Trade-offs

Interview Questions

Q: Design an AI governance framework for a hospital deploying its first clinical AI system.

Q: An AI system in your hospital is producing systematically different outputs for patients of different racial groups. You discover this three months after production deployment. What do you do?

Key Takeaways

Further Reading