Prompt Injection Defense

Common Mistakes

1. Relying on a single defense layer. No single prompt injection defense is complete. A defense stack that relies solely on input pattern matching will be bypassed by novel patterns. Always implement multiple overlapping layers.

2. Blocking overly broadly on injection patterns. Clinical queries naturally contain imperative language ("tell me," "explain," "describe"). Overly aggressive pattern matching blocks legitimate clinical queries. Tune patterns against a representative clinical query set.

3. Not validating retrieved content for indirect injection. Many implementations validate user input but not retrieved RAG chunks. Indirect injection via retrieved content is the higher-risk vector in clinical RAG systems and must be validated independently.

Key Takeaways

  • Prompt injection cannot be fully eliminated; the goal is risk reduction through defense in depth
  • Indirect injection (via RAG-retrieved content) is the higher-risk vector in clinical RAG systems
  • Defense layers must include structural prompting, input validation, retrieved content validation, and output validation — no single layer is sufficient
  • Output validation is the last safety net: it catches PHI leakage and system prompt disclosure regardless of how they occurred
  • Monitor for behavioral anomalies (unusual response structure, disclaimer removal) as a supplement to pattern matching