Hierarchical reinforcement learning

Figure 1

Hierarchical Reinforcement Learning (HRL) addresses one of the most persistent challenges in artificial intelligence: the “curse of horizon.” In standard Reinforcement Learning (RL), agents struggle to solve tasks requiring thousands of sequential decisions because the reward signal—the feedback indicating success—is often sparse and delayed. HRL mitigates this by decomposing complex, long-horizon problems into a hierarchy of manageable sub-problems, effectively shortening the decision horizon.

Figure 1

The main recent shift in the field seems to be moving from manually defining these hierarchies to discovering them autonomously. Earlier approaches relied on domain experts to define subgoals (e.g., “open door,” “pick up key”). Recent innovations focus on unsupervised skill discovery and latent space abstractions, where agents learn to identify useful sub-behaviours (skills) by maximizing information-theoretic objectives, such as Mutual Information, rather than relying solely on external rewards.

A pivotal development in this trajectory is the emergence of Internal Reinforcement Learning, exemplified by the late 2025 publication Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning Kobayashi et al. (2025). This work represents a paradigm shift: rather than treating the hierarchy as a sequence of external actions, it embeds the hierarchy within the internal representations of large autoregressive models. By manipulating the model’s internal residual streams, a high-level controller can guide a low-level policy through “internal actions,” enabling efficient exploration in sparse-reward environments without explicit subgoal definitions. This post synthesizes these developments, providing a mathematical overview suitable for a mathematician, identifying key theoretical advances, and outlining the current limitations of the field.

Mathematical Formulation

Standard RL frames decision-making as a Markov Decision Process (MDP), defined as a tuple (\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle), where:

(\mathcal{S}) is the state space.
(\mathcal{A}) is the action space.
(\mathcal{P}: \mathcal{S} \times \mathcal{A} \to \Delta(\mathcal{S})) is the transition probability distribution.
(\mathcal{R}: \mathcal{S} \times \mathcal{A} \to \mathbb{R}) is the reward function.
(\gamma \in [0, 1)) is the discount factor.

The objective is to find a policy (\pi: \mathcal{S} \to \Delta(\mathcal{A})) that maximizes the expected discounted return: (J(\pi) = \mathbb{E}_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]).

As the time horizon (T) required to reach a reward grows, the probability of a random sequence of actions reaching that reward decays exponentially ((|\mathcal{A}|^{-T})). Furthermore, temporal difference learning methods (like Q-learning) suffer from vanishing signal propagation over long horizons.

HRL formalizes temporal abstraction using Semi-Markov Decision Processes (SMDPs). Instead of atomic actions (a \in \mathcal{A}) that last one time step, the agent selects options or skills (\omega \in \Omega). An option (or skill) is a triple (\langle \mathcal{I}_\omega, \pi_\omega, \beta_\omega \rangle), where:

(\mathcal{I}_\omega \subseteq \mathcal{S}): the initiation set, i.e. the states in which option (\omega) can be initiated.

(\pi_\omega : \mathcal{S} \to \Delta(\mathcal{A})): the intra-option (low-level) policy. Given the current state (s), (\pi_\omega(\cdot \mid s)) defines a probability distribution over primitive actions (a \in \mathcal{A}) while option (\omega) is active.

(\beta_\omega : \mathcal{S} \to [0,1]): the termination function. For each state (s), (\beta_\omega(s)) gives the probability that option (\omega) terminates upon arrival in (s).

These definitions ensure that options induce a Semi-Markov Decision Process, since action selection within an option is Markovian in the state, and termination occurs stochastically as a function of state.

The high-level policy (\Pi: \mathcal{S} \to \Omega) selects an option, which executes for (k) steps until termination. The Bellman equation for the high-level policy becomes: [ Q_\Omega(s, \omega) = \mathbb{E} \left[ r_{t+1} + \gamma r_{t+2} + \dots + \gamma^{k-1} r_{t+k} + \gamma^k \max_{\omega’} Q_\Omega(s_{t+k}, \omega’) \mid s_t=s, \omega_t=\omega \right] ] This effectively “jumps” (k) steps in time, reducing the effective horizon of the problem

Probabilistic and Latent Subgoal Representations

One paradigm in HRL is Goal-Conditioned HRL, where the high-level policy outputs a subgoal (g \in \mathcal{G}) rather than a discrete option index. The low-level policy (\pi_{low}(a|s, g)) attempts to reach (g).

Traditionally, the mapping from state space (\mathcal{S}) to subgoal space (\mathcal{G}) was deterministic and often manually engineered (e.g., (g) is a specific ((x,y)) coordinate). This fails in stochastic environments where a specific state might not be reachable or where the mapping (\phi: \mathcal{S} \to \mathcal{G}) is uncertain.

Wang et al. (2024) introduced Probabilistic Subgoal Representations using Gaussian Processes to model the transition from state to subgoal space. Instead of a deterministic function (z = \phi(s)), the system learns a posterior distribution over representation functions.

Mechanism: The high-level policy samples a subgoal (g) from a latent space modeled by a GP. This allows the agent to quantify uncertainty in its high-level planning.
Benefit: The GP prior exploits long-range correlations in the state space (via a learnable kernel), allowing the agent to “remember” and integrate information over longer horizons than standard recurrent networks. This approach, termed HLPS (Hierarchical Learning with Probabilistic Subgoals), has shown superior performance in stochastic environments by adapting the subgoal space dynamically (Wang et al. 2024)

Unsupervised Skill Discovery via Mutual Information

Around 2020 to 2023, a lot of effort went into learning skills without any extrinsic reward, purely by interacting with the environment. The main flavour, as far as I can tell, is maximizing the Mutual Information between the state (S) and the latent skill variable (Z).

Here’s what that looks like in practice: The agent seeks to maximize (I(S; Z)), which can be decomposed as: [ I(S; Z) = H(Z) - H(Z|S) = H(S) - H(S|Z) ]

Maximizing (H(Z)): Ensures diverse skills are used.
Minimizing (H(Z|S)): Ensures skills are distinguishable (given a state transition, we can infer which skill caused it).

There seem to be some notable flavours of this idea:

Contrastive Intrinsic Control (CIC): Previous methods (like DIAYN) often learned static skills that barely move the agent, because static states are easy to distinguish. CIC introduced a contrastive learning objective that maximizes the entropy of state transitions induced by skills. It uses a particle-based entropy estimator to encourage skills that actively explore the state space (Laskin et al. 2022; Adeniji, Xie, and Abbeel 2023)
Behavior Contrastive Learning (BeCL): This approach recognized that maximizing MI alone can lead to “simple” skills. BeCL employs a contrastive loss that forces the agent to produce similar behaviors for the same skill code (z) but diverse behaviors for different codes (z’ \neq z). It mathematically bounds the MI objective to ensure the agent covers the state space (maximizing state entropy) while maintaining skill distinguishability (Yang et al. 2023).
Disentangled Unsupervised Skill Discovery (DUSDi): This method enforces that specific dimensions of the latent skill vector (z) control specific factors of the environment (e.g., one dimension controls speed, another controls steering). It uses a modified MI objective to enforce independence between skill components, hopefully facilitating better compositionality for downstream tasks (Hu et al. 2024) (See paper site).

Internal Reinforcement Learning

Internal Reinforcement Learning rethinks where the hierarchy resides. Traditional HRL separates policies temporally (high-level acts every (k) steps). Kobayashi et al. (2025) proposes that hierarchy can emerge within the internal activations of a single large autoregressive model.

Large Language Models (LLMs) and autoregressive foundation models generate outputs token-by-token. In an RL setting, this “token-level” action space is incredibly dense and long-horizon. A simple task might require generating hundreds of tokens, making credit assignment nearly impossible for sparse rewards.

Instead of a high-level policy outputting a text command or a subgoal coordinate, the authors introduce a meta-controller that intervenes directly in the model’s residual stream.

Mathematical Abstraction: Let (h_t) be the internal hidden state (activation) of the base model at time (t). Normally, (h_{t+1} = f(h_t, x_t)). In Internal RL, a high-level controller generates a latent vector (z) (the “internal action”). The update becomes (h_{t+1} = f(h_t + z, x_t)).

Emergent Hierarchy: The meta-controller is trained to output these (z) vectors to steer the base model. The base model (pretrained on next-token prediction) already contains “temporally abstract” representations in its deeper layers. The meta-controller learns to tap into these.

Compression: The meta-controller does not act at every token step. It acts sparsely, effectively compressing long sequences of token-generation into single “internal decisions.”
Credit Assignment: Because the meta-controller makes fewer decisions (e.g., one decision per 50 tokens), the effective time horizon for the RL algorithm is drastically reduced.

The authors claim that this “Internal RL” allows models to solve sparse-reward tasks (like complex grid worlds or MuJoCo control) that standard token-level RL cannot handle. The “actions” (z) are not human-interpretable subgoals but are mathematically effective directions in the high-dimensional activation space of the transformer (Kobayashi et al. 2025)

Integration with Planning

Recent work (Rens 2025) has also integrated HRL with explicit planning algorithms like Monte Carlo Tree Search (MCTS). Hierarchical Goal-Conditioned Policy Planning (HGCPP): This framework uses “High-Level Actions” (HLAs) within MCTS. Instead of planning over primitive actions, the tree search plans over sequences of goal-conditioned policies.

A single “plan-tree” is maintained. Nodes represent states, and edges represent the execution of a policy (\pi(s, g)) until the goal (g) is reached. This combines the sample efficiency of HRL (reusing learned policies) with the strategic lookahead of MCTS, allowing agents to solve tasks requiring specific sequences of subtasks (e.g., “find key” (\to) then “open door”).

Incoming

Hierarchical RL without hierarchies

Zhou and Kao (2025) Flattening Hierarchies

References

Rens. 2025. “Proposing Hierarchical Goal-Conditioned Policy Planning in Multi-Goal Reinforcement Learning:” In Proceedings of the 17th International Conference on Agents and Artificial Intelligence.

Yang, Bai, Guo, et al. 2023. “Behavior Contrastive Learning for Unsupervised Skill Discovery.” In Proceedings of the 40th International Conference on Machine Learning.

Zhang, Lunjun, Yang, and Stadie. 2021. “World Model as a Graph: Learning Latent Landmarks for Planning.” In Proceedings of the 38th International Conference on Machine Learning.