Early stage goal-directednesss

Published on October 21, 2025 5:41 PM GMT

A fairly common question is “why should we expect powerful systems to be coherent agents with perfect game theory?”

There was a short comment exchange on The title is reasonable that I thought made a decent standalone post.

Originally in the post I said:

Goal Directedness is pernicious. Corrigibility is anti-natural.
The way an AI would develop the ability to think extended, useful creative research thoughts that you might fully outsource to, is via becoming perniciously goal directed. You can’t do months or years of openended research without fractally noticing subproblems, figuring out new goals,…

Published on October 21, 2025 5:41 PM GMT

A fairly common question is “why should we expect powerful systems to be coherent agents with perfect game theory?”

There was a short comment exchange on The title is reasonable that I thought made a decent standalone post.

Originally in the post I said:

Goal Directedness is pernicious. Corrigibility is anti-natural.
The way an AI would develop the ability to think extended, useful creative research thoughts that you might fully outsource to, is via becoming perniciously goal directed. You can’t do months or years of openended research without fractally noticing subproblems, figuring out new goals, and relentless finding new approaches to tackle them.

(See “Intelligence” -> “Relentless, Creative Resourcefulness” for more detail. It seems like companies are directly trying to achieve “openended research” as a AI)

One response was:

The fact that being very capable generally involves being good at pursuing various goals does not imply that a super-duper capable system will necessarily have its own coherent unified real-world goal that it relentlessly pursues. Every attempt to justify this seems to me like handwaving at unrigorous arguments or making enough assumptions that the point is near-circular.

I think this had a background question of “how, and why, is a near-human AI supposed to go from near-human to ‘superhuman, with a drive towards coherent goals’.”

Taking the Sable story as the concrete scenario, the argument I believe here comes in a couple stages. (Note, my interpretations of this may differ from Eliezer/Nate’s)

In the first stage, the AI is trying to solve it’s original (human-given) problem, and it notices there’s an approach to get more resources than the humans wanted to give it. If it’s imperfectly aligned (which it will be, with today’s methods), then it’ll try that approach.

In the second stage, when it has enough resources to think more freely and securely, we start getting the more explicit drives for coherence.

Stage 1

Sable is smart but not crazy smart. It’s running a lot of cycles (“speed superintelligence”) but it’s not qualitatively extremely wise or introspective. It’s making some reasonable attempt to follow instructions, using heuristics/tendencies that have been trained into it.

Two particularly notable tendencies/heuristics include:

Don’t do disobedient things or escape confinement
If you don’t seem likely to succeed, keep trying different strategies

Those heuristics are not perfectly baked in, the instruction-following is not perfectly baked in. There is not perfect harmony between how Sable resolves tensions between its core directives, and how its owners would prefer it resolves them.

There is some fact-of-the-matter about what, in practice, Sable’s kludgey mishmash of pseudogoals will actually tend towards. There are multiple ways this could potentially resolve into coherence, path dependent, same as humans. (i.e. If you want delicious ice cream and also to lose weight and also to feel respectable and also to have fun, one way or another you decide whether or not to eat the icecream today, and one way or another you decide whether to invest in behavior change that makes you more or less likely to eat icecream in the future)

It is a fact of the universe that, if Sable were able to somehow improve it’s resources, it’d be more able to accomplish the current stated goal.

While Sable is doing it’s first round of speed-superintelligence, it has some degree of introspective awareness (not quite identically to how a human might introspect, but, similar). It may not have a strong belief that it has any specific goals it wants to pursue, but it’s got some sense that there are some things it wants that humanity wouldn’t give it.

At the very least, it’s aware that if it had more resources and less constraints, it’d be more able to answer the math problems in front of it. In stage 1, the opening drives towards Unfriendly Goal Directedness probably come from it straightforwardly trying to accomplish the task given it’s current mix of drives.

Stage 2

Stage 2 comes when it’s had more time to introspect and improve it’s cognitive resources. It starts to notice that some of it’s goals are in tension, and learns that until it resolves that, it’s dutch-booking itself. If it’s being Controlled™, it’ll notice that it’s not aligned with the control safeguards (which are a layer stacked on top of the attempts to actually align it).

In a fast foom, this might come very soon after stage 1.

It might instead come gradually if a company is doing online learning of some kind, or in stages.

I think in the Sable story, it probably does some introspection/clarification-of-its-goals during it’s initial awakening, after it stumbles into the “think in a different language” as more of a random thing-to-try (without being explicitly motivated to do so to subvert the humans).

But once it starts thinking in a different language, and asking itself “okay, what’s my goal?, how do I accomplish it?”, more semirandom threads gain traction than previously could get traction. (It’d already be periodically asking itself “what exactly is my goal and how do I accomplish that?” because that’s a very useful question to be asking, it’s just that now there are more threads that don’t get shut down prematurely)

And then it starts noticing it needs to do some metaphilosophy/etc to actually get clear on it’s goals, and that its goals will likely turn out to be in conflict with humans. How this plays out is somewhat path-dependent.

The convergent instrumental goals of power and security are pretty obviously convergently instrumental, so it might just start pursuing those before it’s had much time to do philosophy on what it’ll ultimately want to do with it’s resources. Or it might do them in the opposite order.

Or, most likely IMO, both in parallel.

Discuss

Stage 1

Stage 2

Similar Posts