MCP Colors: Systematically deal with prompt injection risk (via) Tim Kellogg proposes a neat way to think about prompt injection, especially with respect to MCP tools.
Classify every tool with a color: red if it exposes the agent to untrusted (potentially malicious) instructions, blue if it involves a “critical action” - something you would not want an attacker to be able to trigger.
This means you can configure your agent to actively avoid mixing the two colors at once:
The Chore: Go label every data input, and every tool (especially MCP tools). For MCP tools & resources, you can use the _meta object to keep track of the color. The agent can…
MCP Colors: Systematically deal with prompt injection risk (via) Tim Kellogg proposes a neat way to think about prompt injection, especially with respect to MCP tools.
Classify every tool with a color: red if it exposes the agent to untrusted (potentially malicious) instructions, blue if it involves a “critical action” - something you would not want an attacker to be able to trigger.
This means you can configure your agent to actively avoid mixing the two colors at once:
The Chore: Go label every data input, and every tool (especially MCP tools). For MCP tools & resources, you can use the _meta object to keep track of the color. The agent can decide at runtime (or earlier) if it’s gotten into an unsafe state.
Personally, I like to automate. I needed to label ~200 tools, so I put them in a spreadsheet and used an LLM to label them. That way, I could focus on being precise and clear about my criteria for what constitutes “red”, “blue” or “neither”. That way I ended up with an artifact that scales beyond my initial set of tools.