Prompt Injection Defense

Executive Summary

Prompt injection is the highest-priority AI-specific security threat in enterprise deployments: it allows an attacker to subvert an AI system's intended behavior by embedding instructions in inputs the model processes as data. In clinical AI systems, a successful prompt injection attack can suppress safety disclaimers, produce fabricated clinical recommendations, or manipulate an agent into unauthorized actions involving PHI. This chapter covers the defense architecture for both direct and indirect prompt injection, with implementation patterns appropriate for clinical AI environments.

Learning Objectives

  • Distinguish direct and indirect prompt injection and identify which is harder to defend
  • Implement structural defenses: separator tokens, role pinning, and output validation
  • Apply input validation patterns that detect instruction-like content in user inputs and retrieved documents
  • Design a defense stack that provides meaningful protection without blocking legitimate use

Enterprise Considerations

Defense against unknown patterns: Pattern-based injection detection fails for novel attack patterns. Supplement pattern matching with behavioral monitoring: alert when AI responses contain unusual structural patterns (unexpected role changes, disclaimer removal, PHI-like content) not matched by known patterns.

Knowledge base integrity: Indirect injection attacks target the RAG knowledge base. Implement change control on all knowledge base source documents: require approval before new documents are indexed, monitor the knowledge base for unexpected content changes, and maintain an immutable index of document hashes.

Common Mistakes

1. Relying on a single defense layer. No single prompt injection defense is complete. A defense stack that relies solely on input pattern matching will be bypassed by novel patterns. Always implement multiple overlapping layers.

2. Blocking overly broadly on injection patterns. Clinical queries naturally contain imperative language ("tell me," "explain," "describe"). Overly aggressive pattern matching blocks legitimate clinical queries. Tune patterns against a representative clinical query set.

3. Not validating retrieved content for indirect injection. Many implementations validate user input but not retrieved RAG chunks. Indirect injection via retrieved content is the higher-risk vector in clinical RAG systems and must be validated independently.

Key Takeaways

  • Prompt injection cannot be fully eliminated; the goal is risk reduction through defense in depth
  • Indirect injection (via RAG-retrieved content) is the higher-risk vector in clinical RAG systems
  • Defense layers must include structural prompting, input validation, retrieved content validation, and output validation — no single layer is sufficient
  • Output validation is the last safety net: it catches PHI leakage and system prompt disclosure regardless of how they occurred
  • Monitor for behavioral anomalies (unusual response structure, disclaimer removal) as a supplement to pattern matching

Further Reading