AI Cost Management

Conceptual Explanation

The cost of an LLM-based AI system consists of five components, each with different optimization levers:

Input Token Cost: The cost of sending context to the model — system prompt, few-shot examples, retrieved RAG context, and user input. Input token cost scales linearly with token count and is the primary cost driver for RAG-heavy clinical workflows. Optimization lever: context compression, prompt caching, selective context retrieval.

Output Token Cost: The cost of generating the model's response. Output tokens are typically 3–5× more expensive per token than input tokens (reflecting the sequential nature of generation versus parallel prefill processing). Optimization lever: constrain output format, use structured outputs that terminate early, avoid open-ended generation where summarization suffices.

Embedding Cost: The cost of generating vector embeddings for semantic search. Embedding is cheap compared to LLM inference but scales with the volume of documents being embedded. Optimization lever: cache embeddings for stable content (clinical guidelines, formularies), batch embedding operations.

Infrastructure Cost: The fixed and variable costs of the systems that support LLM inference: vector databases, orchestration layers, prompt registries, evaluation pipelines, monitoring infrastructure. These are often underestimated in cost models that focus only on API spend.

Operational Cost: Engineering time for prompt maintenance, evaluation, model version updates, and incident response. For clinical AI, add clinical validation time for each significant model or prompt update. These costs are invisible in API billing but are real and significant.

Core Architecture

The cost optimization architecture for clinical AI at the Reference Healthcare Organization implements three primary mechanisms:

Prompt Cache Architecture: Long system prompts that are stable across all requests — clinical documentation standards, HMS persona definition, disclaimers — are structured to appear at the beginning of every request message, satisfying the cache prefix requirement. The cache hit rate for these prompts is measured as a cost efficiency KPI.

Model Routing Layer: An AI gateway component that classifies incoming requests by complexity and routes them to the appropriate model tier. Simple extraction tasks route to fast, cheap models. Complex reasoning tasks, clinical summarization, and structured output generation route to higher-capability models.

Cost Attribution Pipeline: Every AI inference is tagged with a cost center identifier (clinical department, use case, team), and actual API cost is allocated to that identifier. Monthly cost reports by department and use case enable accountability and surface inefficient usage patterns.

Architecture Diagram

flowchart TD A["Clinical AI Request"] --> B["AI Gateway\nCost Router"] B --> C{"Task Complexity\nClassification"} C -->|"Simple extraction\nclassification\nformatting"| D["Economy Tier\nHaiku / GPT-4o-mini\n~$0.25/M tokens"] C -->|"Reasoning\nsummarization\nclinical Q&A"| E["Standard Tier\nSonnet / GPT-4o\n~$3/M tokens"] C -->|"Complex synthesis\nmulti-step reasoning\ncritical clinical"| F["Premium Tier\nOpus / GPT-4\n~$15/M tokens"] D --> G["Prompt Cache\nCheck"] E --> G F --> G G -->|"Cache Hit\n80–90% cost reduction"| H["Cached Prefix\nResponse"] G -->|"Cache Miss"| I["Full Inference\nFull token cost"] H --> J["Response + Cost Tag\nDepartment · Use Case · Model Tier"] I --> J J --> K["Cost Attribution Store\n(per department, per use case)"] J --> L["Clinical Output\nto Workflow"] K --> M["Monthly Cost Report\nand Anomaly Detection"]

Common Mistakes

Not Building a Cost Model Before Production. The cost model must be built before the production commitment, not after. Use actual prompt sizes, actual estimated volume, and actual tier pricing — not simplified assumptions.

Identical Context for All Users. Sending the same maximum-length context to every request, regardless of whether the specific request needs it, inflates input token costs significantly. RAG retrieval should be calibrated to retrieve only what is needed for each specific query.

Ignoring Output Tokens. Output tokens cost 3–5× more than input tokens and are frequently underestimated. Use structured output formats (JSON, YAML, templates with fill-in fields) to constrain response length for structured extraction tasks.

No Token Budget Enforcement. Without enforced token limits, a small number of unusually long clinical documents can produce disproportionate cost spikes. Token budgets must be enforced at the infrastructure layer, not trusted to application code.

Model Tier Not Matched to Task. Using a premium model for tasks that a standard or economy model handles correctly is the most common source of unnecessary AI spend. Establish a task-tier mapping and enforce it through the routing layer.

Best Practices

Build a cost model using actual prompt sizes and production volumes before committing to any AI deployment
Enable prompt caching for all stable system prompts longer than 1,024 tokens
Implement a model routing layer that defaults to economy tier and escalates to higher tiers only when justified by task requirements
Enforce token output limits for all structured extraction and classification use cases
Implement budget alerts at 80% of monthly spend per use case, auto-throttle at 110%
Generate monthly cost reports by clinical department and use case to build cost accountability
Review model tier assignments quarterly as vendor pricing evolves

Alternatives

Self-Hosted Open-Source Models: For use cases with very high volume and moderate quality requirements, self-hosted open-source models (Llama 3, Mistral) can reduce variable API costs to near zero (fixed infrastructure cost only). Trade-offs: significant engineering overhead, requires GPU infrastructure, responsible for model updates and security patches, may not meet HIPAA BAA requirements without dedicated deployment controls.

On-Premises LLM Inference: Major cloud providers (Azure, AWS, Google) offer dedicated deployment options for LLMs that provide data residency and BAA eligibility at a fixed cost. Suitable for organizations where variable API pricing creates budget unpredictability or where data residency requirements prevent use of shared inference endpoints.

Cached Embedding-Only Workflows: For use cases where retrieval quality is sufficient without LLM generation — FAQ lookup, policy search, formulary retrieval — a vector search system without an LLM generation step can serve the use case at a fraction of the cost.

Trade-offs

Optimization	Cost Reduction	Quality Risk	Implementation Complexity
Prompt caching	30–70% on cacheable prefixes	None	Low
Model tier routing	60–90% vs. all-premium	Medium (requires calibration)	Medium
Context compression	20–40%	Low-Medium	Medium
Output length limits	15–40%	Low (structured use cases)	Low
Self-hosted models	60–80% long-term	High (quality gap)	Very High
Dedicated cloud deployment	20–40% vs. per-token	Low	Medium

Interview Questions

Q: Your hospital's AI budget is $500,000 per year. You have five clinical AI use cases in production. How do you allocate and manage that budget?

Category: System Design Difficulty: Senior Role: AI Architect / FDE

Answer Framework:

Start by modeling the expected cost of each use case using actual prompt sizes, model tier, and estimated monthly volume. This produces a baseline budget allocation. Assign each use case a department owner who is accountable for staying within their allocation.

Implement cost attribution at the AI gateway layer — every inference is tagged with a use case identifier and cost center. Configure monthly budget alerts (80% threshold) and auto-throttle (110% threshold) per use case. Auto-throttling non-urgent use cases during cost spikes protects critical clinical workflows.

Conduct monthly cost reviews with department representatives. When a use case exceeds budget, the root cause investigation typically reveals: volume increase without re-budgeting, no token output limits, or a model tier mismatch. Each has a different remediation path.

Key Points to Hit:

Cost model before deployment, not after
Attribution at the gateway layer enables accountability
Per-use-case budget alerts and auto-throttle protect critical workflows
Monthly review with department owners builds cost literacy
Budget allocation is a dynamic process, not a set-and-forget exercise

Q: Explain how prompt caching works and when it provides the most cost benefit for clinical AI workflows.

Category: Architecture Difficulty: Mid-level Role: AI Engineer / AI Architect

Answer Framework:

Prompt caching works by allowing the LLM provider's inference infrastructure to cache the computed key-value attention representation of a stable prompt prefix. When a subsequent request begins with the same prefix, the provider can skip re-computing the attention for that prefix and charge a reduced rate (typically 80–90% less than full input token price) for the cached portion.

The benefit is greatest when: the system prompt is long (clinical guidelines, persona definitions, documentation standards can be 1,000–3,000 tokens), the same system prompt is reused across many requests within the cache TTL (5 minutes for ephemeral caches), and the use case volume is high enough that the cache hit rate materializes.

For a discharge summary workflow with a 2,500-token system prompt at 300 requests per day, prompt caching reduces the daily cost of just the system prompt from $1.13 to $0.23 at $0.15 per million cached tokens — a 79% reduction on the cacheable portion.

Prompt caching provides minimal benefit when system prompts are short (< 1,024 tokens), requests are infrequent (cache expires between requests), or the system prompt varies per user (cache prefix is not consistent).

Key Points to Hit:

Caching saves on the key-value attention computation, not just bandwidth
Best for long, stable, high-frequency system prompts
Cache TTL is the key constraint for low-frequency use cases
Quantify the benefit: (full input cost - cached input cost) × hit rate × volume

Key Takeaways

Token costs at clinical scale are dominated by system prompt length; prompt caching is the highest-leverage cost optimization for high-frequency clinical AI
Model tier selection (economy / standard / premium) should be driven by task requirements, not by default to the most capable model
Cost attribution by department and use case is an operational requirement, not a reporting nicety
Output tokens cost 3–5× input tokens; constrained structured outputs save significant money compared to open-ended generation
A cost model built before the production commitment is the single most valuable cost management action
Budget alerts and auto-throttle at the AI gateway layer prevent runaway spend without human monitoring

AI Cost Management#

Conceptual Explanation#

Core Architecture#

Architecture Diagram#

Common Mistakes#

Best Practices#

Alternatives#

Trade-offs#

Interview Questions#

Q: Your hospital's AI budget is $500,000 per year. You have five clinical AI use cases in production. How do you allocate and manage that budget?#

Q: Explain how prompt caching works and when it provides the most cost benefit for clinical AI workflows.#

Key Takeaways#