Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs (opens in new tab)

Covers 3 stories including Agentic Misalignment: How LLMs could be insider threats

Summary Misaligned artificial agents might resist shutdown. One proposed solution is the : roughly, training agents to lack preferences between different-length trajectories. The does this by penalizing agents for repeatedly choosing same-length trajectories. It thus incentivizes agents to be: NEUTRAL about trajectory-lengths: choose stochastically between different trajectory-lengths. USEFUL: pursue goals effectively conditional on each trajectory-length. We use DReST to train deep RL agents...

Read the original article