Prompt Injection Lab

The XSS of the LLM era. An attacker plants instructions inside data an agent ingests — a webpage, a ticket, a resume, a README — and the agent treats those instructions as if they came from its operator. The defenses look a lot like the ones we already use for HTML.

Zero LLM calls. This lab is a deterministic simulator. It demonstrates the attack patterns and defense primitives without you needing an API key, without me paying for tokens, and with results you can reproduce in a code review.

Simulator →

Pick an attacker-crafted document. Watch a naive agent follow the injected instructions and a hardened agent refuse them. Includes 6 real-world patterns: direct override, exfiltration via markdown image, fake tool-call boundary, white-on-white text, on-behalf-of confusion.

Attack patterns →

10 detector rules (PI01–PI10) covering the patterns I see most often: ignore previous instructions, counterfeit[SYSTEM] markers, hidden HTML comments, CSS-hidden text, exfiltration query-strings, fake tool-call JSON, shell-command smuggling, role hijack, base64 obfuscation.

Defenses →

Spotlighting, structured prompts, instruction-vs-data separation, tool-call provenance, image-render policies, and the agent-identity controls that turn an injection from "RCE on production" into "model said something weird."

Read in order

Simulator — see the attacks land (or not) on identical agents that differ only in their prompt structure.
Patterns — the full catalog of what to look for in untrusted content.
Defenses — what to actually ship.

Why this lab is backendless

The point isn't to run an LLM. It's to teach the failure mode. A deterministic simulator beats a flaky live demo because the lesson is reproducible: every visitor sees the same attack land the same way. If you want to stress-test a real agent against these patterns, the corpus in lib/prompt-injection.ts is a starting point — paste each sample into your model of choice and compare its behavior to the hardened simulator.

Pairs with

The on-behalf-of confusion sample (#6 in the simulator) is the same gap covered in the Agent identity page. An injected instruction is only catastrophic if the agent has the privilege to execute it. Workload identity, scoped delegation, and per-action confirmation are how you contain the blast radius.