Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults (opens in new tab)

Discussed on Hacker News

LLM agents increasingly act after consuming ranked external information streams such as social feeds, search results, retrieval contexts, and email queues, yet safety evaluations almost always test the model or the user prompt in isolation, never the upstream ranker that decides what the agent reads just before it acts. We introduce a controlled protocol that holds the model, persona, topic, and final decision prompt fixed and varies only the co...

Read the original article