Prompt Injection
An attack where malicious input manipulates an LLM into ignoring its system instructions, revealing internal prompts, or performing unauthorized actions — the SQL injection of the AI era.
Prompt injection is an attack where crafted input tricks a language model into ignoring its system prompt, leaking internal instructions, or taking actions it was never supposed to take. The root cause is structural: models cannot reliably distinguish between instructions and data. Everything is text.
This makes the attack surface uncomfortably broad. Direct injection comes from the user — they simply tell the model to forget its instructions. Indirect injection is more insidious: malicious content embedded in a document, web page, or API response that the model reads and then obeys. An agent summarizing emails could be instructed by an email to exfiltrate contacts. That's not theoretical; it's happened.
The uncomfortable reality is that this problem is unsolved at the model layer. No amount of prompting makes a model reliably immune. Defense has to be architectural: sanitize inputs before they reach the model, validate outputs before they reach your systems, restrict what tools the model can call, and apply guardrails that operate independently of the model's own judgment. OWASP lists prompt injection as the top risk in their LLM Top 10 — not because it's the flashiest attack, but because it's the most reliably exploitable.
If your product accepts free-form user input and feeds it to a model with tool access, you have a prompt injection surface. Design accordingly.