AI Cost Management
Executive Summary
The economics of LLM-based AI systems at enterprise scale differ fundamentally from traditional software infrastructure costs: expenses scale with usage volume and input-output token length, are highly sensitive to prompt design decisions, and can exhibit unpredictable spikes when workload patterns change. A healthcare organization with 200 daily discharge summary AI requests may operate within budget, but the same cost structure applied to ambient clinical documentation across 500 clinicians or to prior authorization processing at scale produces dramatically different cost profiles. This chapter provides a rigorous framework for modeling, controlling, and optimizing AI costs across token economics, prompt caching, model tier selection, and cost attribution โ all essential disciplines for operating AI systems sustainably at clinical scale.
Learning Objectives
After reading this chapter, you will be able to:
- Model the total cost of an LLM-based clinical AI use case at production scale
- Apply prompt caching to reduce token costs for high-frequency clinical AI workflows
- Select the appropriate model tier for each use case using a cost-quality optimization framework
- Implement cost attribution that allocates AI spend to clinical departments and use cases
- Identify the cost efficiency patterns that reduce LLM spending by 40โ70% without quality degradation
Business Problem
Healthcare AI programs are failing financial viability reviews not because the AI does not work, but because the cost model was not built until after development was complete. A discharge summary AI that costs $0.18 per summary sounds reasonable until the finance team calculates that 300 daily discharges ร 365 days ร $0.18 = $19,710 per year per hospital โ and the network has 12 hospitals โ at which point the $236,520 annual AI cost requires a formal ROI justification that was never prepared.
More commonly, organizations discover that initial cost estimates based on average inputs dramatically understate actual costs when applied to clinical data, which is verbose, context-rich, and requires long system prompts to constrain model behavior appropriately. A discharge summary workflow that sends 4,000 tokens per request at $15 per million input tokens costs $60 per 1,000 requests โ not the $10 assumed from a short demo prompt.
Cost overruns also emerge from misuse patterns that are difficult to govern without cost attribution infrastructure: a single high-volume workflow running without token limits, a pilot use case that was never scaled down after the pilot ended, or an AI feature enabled for all users when only 20% need it.
Why This Technology Exists
Token-based pricing for LLM APIs emerged from the underlying economics of LLM inference: processing more tokens requires more compute, and longer sequences require more memory bandwidth during attention computation. Pricing proportional to token usage aligns the vendor's cost to serve with the organization's cost to use.
Prompt caching was introduced by LLM vendors (Anthropic, OpenAI, Google) to address a specific inefficiency: enterprise AI use cases typically have long, stable system prompts โ clinical guidelines, documentation standards, persona definitions โ that are sent unchanged with every inference request. Re-computing the key-value cache for identical prompt prefixes on every request is computationally wasteful. Prompt caching allows the vendor to cache the computed representation of a stable prompt prefix and charge a reduced rate for cache hits, typically 80โ90% less than the full input token price.
Model tier selection โ the ability to route different use cases to different-capability (and different-cost) models โ emerged as LLM providers developed differentiated model families. Not every clinical AI task requires the most capable model. Cheaper, faster models are sufficient for many tasks, and the cost difference between tiers is typically 10โ20ร.
Conceptual Explanation
The cost of an LLM-based AI system consists of five components, each with different optimization levers:
Input Token Cost: The cost of sending context to the model โ system prompt, few-shot examples, retrieved RAG context, and user input. Input token cost scales linearly with token count and is the primary cost driver for RAG-heavy clinical workflows. Optimization lever: context compression, prompt caching, selective context retrieval.
Output Token Cost: The cost of generating the model's response. Output tokens are typically 3โ5ร more expensive per token than input tokens (reflecting the sequential nature of generation versus parallel prefill processing). Optimization lever: constrain output format, use structured outputs that terminate early, avoid open-ended generation where summarization suffices.
Embedding Cost: The cost of generating vector embeddings for semantic search. Embedding is cheap compared to LLM inference but scales with the volume of documents being embedded. Optimization lever: cache embeddings for stable content (clinical guidelines, formularies), batch embedding operations.
Infrastructure Cost: The fixed and variable costs of the systems that support LLM inference: vector databases, orchestration layers, prompt registries, evaluation pipelines, monitoring infrastructure. These are often underestimated in cost models that focus only on API spend.
Operational Cost: Engineering time for prompt maintenance, evaluation, model version updates, and incident response. For clinical AI, add clinical validation time for each significant model or prompt update. These costs are invisible in API billing but are real and significant.
Core Architecture
The cost optimization architecture for clinical AI at the Reference Healthcare Organization implements three primary mechanisms:
Prompt Cache Architecture: Long system prompts that are stable across all requests โ clinical documentation standards, HMS persona definition, disclaimers โ are structured to appear at the beginning of every request message, satisfying the cache prefix requirement. The cache hit rate for these prompts is measured as a cost efficiency KPI.
Model Routing Layer: An AI gateway component that classifies incoming requests by complexity and routes them to the appropriate model tier. Simple extraction tasks route to fast, cheap models. Complex reasoning tasks, clinical summarization, and structured output generation route to higher-capability models.
Cost Attribution Pipeline: Every AI inference is tagged with a cost center identifier (clinical department, use case, team), and actual API cost is allocated to that identifier. Monthly cost reports by department and use case enable accountability and surface inefficient usage patterns.
Architecture Diagram
Enterprise Considerations
Cost Model Before Commitment: Before any production AI deployment, build a cost model that calculates annual spend at target volume. The model must use actual prompt sizes from development โ not synthetic examples โ because clinical prompts are typically 3โ5ร longer than demo prompts.
Budget Alerting: Configure real-time cost alerting at the AI gateway layer. Alert at 80% of monthly budget, and auto-limit (throttle or disable) at 110% to prevent runaway spend. Throttling non-urgent use cases before critical ones protects the most important clinical workflows when budgets are constrained.
Cost Transparency: Monthly cost reports by clinical department and use case build AI cost literacy across the organization. When clinical departments can see that their ambient documentation pilot consumed $18,000 in a month, they engage meaningfully with cost optimization โ reducing context window usage, enforcing output length limits, and selecting appropriate model tiers.
Multi-Year Cost Trajectory: LLM API pricing has declined significantly year-over-year. Cost models should include a price reduction assumption (historically 30โ50% per year for similar capability levels) when evaluating build-vs-buy decisions. A use case that is expensive to operate on current pricing may be economically attractive in 18โ24 months.
Healthcare Example
Educational Example โ Illustrative Workflow. Not intended for clinical decision making.
The Reference Healthcare Organization operates five clinical AI use cases in production. The CFO requests a quarterly cost review after the Q1 AI budget is exceeded by 23%.
Root cause analysis of Q1 overage:
| Use Case | Budget (Q1) | Actual (Q1) | Variance | Root Cause |
|---|---|---|---|---|
| Discharge summary | $14,200 | $14,800 | +$600 | Within tolerance |
| Prior auth processing | $8,100 | $9,400 | +$1,300 | Volume increase (new payer integration) |
| Ambient documentation | $12,000 | $22,600 | +$10,600 | Pilot expanded without re-budgeting |
| Patient message drafting | $3,200 | $3,100 | -$100 | On target |
| Clinical coding assist | $4,800 | $5,700 | +$900 | No token budget enforced |
Findings: The ambient documentation pilot was expanded from 12 volunteer physicians to 38 without a corresponding budget revision or cost alert. Clinical coding assist had no token output limit, and complex cases were generating 2,400-token responses where 600 tokens would suffice.
Remediation:
- Ambient documentation budget updated to reflect 38-clinician volume; weekly cost alert at 80% of weekly budget
- Clinical coding assist: output format constrained to structured JSON with 500-token limit; monthly cost drops from $5,700 to $2,100
- Prior auth processing: prompt caching enabled for the 3,200-token clinical criteria document. Cache hit rate achieves 84% within 2 weeks. Cost per 1,000 requests reduces by $10.40 (68% savings on the cached prefix)
Projected Q2 savings from remediations: $14,200 โ a 38% reduction in AI spend with no quality degradation.
Common Mistakes
Not Building a Cost Model Before Production. The cost model must be built before the production commitment, not after. Use actual prompt sizes, actual estimated volume, and actual tier pricing โ not simplified assumptions.
Identical Context for All Users. Sending the same maximum-length context to every request, regardless of whether the specific request needs it, inflates input token costs significantly. RAG retrieval should be calibrated to retrieve only what is needed for each specific query.
Ignoring Output Tokens. Output tokens cost 3โ5ร more than input tokens and are frequently underestimated. Use structured output formats (JSON, YAML, templates with fill-in fields) to constrain response length for structured extraction tasks.
No Token Budget Enforcement. Without enforced token limits, a small number of unusually long clinical documents can produce disproportionate cost spikes. Token budgets must be enforced at the infrastructure layer, not trusted to application code.
Model Tier Not Matched to Task. Using a premium model for tasks that a standard or economy model handles correctly is the most common source of unnecessary AI spend. Establish a task-tier mapping and enforce it through the routing layer.
Best Practices
- Build a cost model using actual prompt sizes and production volumes before committing to any AI deployment
- Enable prompt caching for all stable system prompts longer than 1,024 tokens
- Implement a model routing layer that defaults to economy tier and escalates to higher tiers only when justified by task requirements
- Enforce token output limits for all structured extraction and classification use cases
- Implement budget alerts at 80% of monthly spend per use case, auto-throttle at 110%
- Generate monthly cost reports by clinical department and use case to build cost accountability
- Review model tier assignments quarterly as vendor pricing evolves
Alternatives
Self-Hosted Open-Source Models: For use cases with very high volume and moderate quality requirements, self-hosted open-source models (Llama 3, Mistral) can reduce variable API costs to near zero (fixed infrastructure cost only). Trade-offs: significant engineering overhead, requires GPU infrastructure, responsible for model updates and security patches, may not meet HIPAA BAA requirements without dedicated deployment controls.
On-Premises LLM Inference: Major cloud providers (Azure, AWS, Google) offer dedicated deployment options for LLMs that provide data residency and BAA eligibility at a fixed cost. Suitable for organizations where variable API pricing creates budget unpredictability or where data residency requirements prevent use of shared inference endpoints.
Cached Embedding-Only Workflows: For use cases where retrieval quality is sufficient without LLM generation โ FAQ lookup, policy search, formulary retrieval โ a vector search system without an LLM generation step can serve the use case at a fraction of the cost.
Trade-offs
| Optimization | Cost Reduction | Quality Risk | Implementation Complexity |
|---|---|---|---|
| Prompt caching | 30โ70% on cacheable prefixes | None | Low |
| Model tier routing | 60โ90% vs. all-premium | Medium (requires calibration) | Medium |
| Context compression | 20โ40% | Low-Medium | Medium |
| Output length limits | 15โ40% | Low (structured use cases) | Low |
| Self-hosted models | 60โ80% long-term | High (quality gap) | Very High |
| Dedicated cloud deployment | 20โ40% vs. per-token | Low | Medium |
Interview Questions
Q: Your hospital's AI budget is $500,000 per year. You have five clinical AI use cases in production. How do you allocate and manage that budget?
Category: System Design Difficulty: Senior Role: AI Architect / FDE
Answer Framework:
Start by modeling the expected cost of each use case using actual prompt sizes, model tier, and estimated monthly volume. This produces a baseline budget allocation. Assign each use case a department owner who is accountable for staying within their allocation.
Implement cost attribution at the AI gateway layer โ every inference is tagged with a use case identifier and cost center. Configure monthly budget alerts (80% threshold) and auto-throttle (110% threshold) per use case. Auto-throttling non-urgent use cases during cost spikes protects critical clinical workflows.
Conduct monthly cost reviews with department representatives. When a use case exceeds budget, the root cause investigation typically reveals: volume increase without re-budgeting, no token output limits, or a model tier mismatch. Each has a different remediation path.
Key Points to Hit:
- Cost model before deployment, not after
- Attribution at the gateway layer enables accountability
- Per-use-case budget alerts and auto-throttle protect critical workflows
- Monthly review with department owners builds cost literacy
- Budget allocation is a dynamic process, not a set-and-forget exercise
Q: Explain how prompt caching works and when it provides the most cost benefit for clinical AI workflows.
Category: Architecture Difficulty: Mid-level Role: AI Engineer / AI Architect
Answer Framework:
Prompt caching works by allowing the LLM provider's inference infrastructure to cache the computed key-value attention representation of a stable prompt prefix. When a subsequent request begins with the same prefix, the provider can skip re-computing the attention for that prefix and charge a reduced rate (typically 80โ90% less than full input token price) for the cached portion.
The benefit is greatest when: the system prompt is long (clinical guidelines, persona definitions, documentation standards can be 1,000โ3,000 tokens), the same system prompt is reused across many requests within the cache TTL (5 minutes for ephemeral caches), and the use case volume is high enough that the cache hit rate materializes.
For a discharge summary workflow with a 2,500-token system prompt at 300 requests per day, prompt caching reduces the daily cost of just the system prompt from $1.13 to $0.23 at $0.15 per million cached tokens โ a 79% reduction on the cacheable portion.
Prompt caching provides minimal benefit when system prompts are short (< 1,024 tokens), requests are infrequent (cache expires between requests), or the system prompt varies per user (cache prefix is not consistent).
Key Points to Hit:
- Caching saves on the key-value attention computation, not just bandwidth
- Best for long, stable, high-frequency system prompts
- Cache TTL is the key constraint for low-frequency use cases
- Quantify the benefit: (full input cost - cached input cost) ร hit rate ร volume
Key Takeaways
- Token costs at clinical scale are dominated by system prompt length; prompt caching is the highest-leverage cost optimization for high-frequency clinical AI
- Model tier selection (economy / standard / premium) should be driven by task requirements, not by default to the most capable model
- Cost attribution by department and use case is an operational requirement, not a reporting nicety
- Output tokens cost 3โ5ร input tokens; constrained structured outputs save significant money compared to open-ended generation
- A cost model built before the production commitment is the single most valuable cost management action
- Budget alerts and auto-throttle at the AI gateway layer prevent runaway spend without human monitoring
Further Reading
- Chapter 6: AI Platform Architecture โ The AI gateway that implements cost routing and attribution
- Chapter 3: RAG Architecture โ Context optimization for RAG-heavy clinical workflows
- Chapter 7: Vendor Evaluation โ Comparative pricing analysis across LLM providers
- Anthropic Pricing Documentation โ Current pricing for prompt caching and model tiers (verify current rates)