Main
The reinforcement learning (RL) framework in computational cognitive neuroscience has been tremendously successful, largely because RL purportedly bridges behaviour and brain levels of analysis1,2. Model-free RL algorithms track the expected value of a state and update it in proportion to a reward prediction error3; this inte…
Main
The reinforcement learning (RL) framework in computational cognitive neuroscience has been tremendously successful, largely because RL purportedly bridges behaviour and brain levels of analysis1,2. Model-free RL algorithms track the expected value of a state and update it in proportion to a reward prediction error3; this interpretable computation also accounts for important aspects of dopaminergic signalling and striatal activity4,5. Indeed, extensive research has supported the theory that cortico-striatal networks support RL-like computations for reward-based learning, and that disruption of this network causes predicted deficits in behaviour6,7. In parallel, similar model-free RL algorithms have been broadly and successfully used to explain and capture many aspects of reward-based learning behaviour across species, from simple classical conditioning8 to more complex multi-armed contextual bandit tasks9,10.
However, there is strong evidence that other cognitive processes, supported by separable brain networks, also contribute to reward-based learning11,12. Early research in rodents showed a double dissociation between so-called habits (thought to relate to the RL process) and more goal-directed processes, which are more sensitive to knowledge about the task environment and thus support more flexible behaviour13,14,15. Widely accepted dual-process theories of learning typically capture the slow/inflexible processes with model-free RL algorithms16. However, this apparent consensus hides broad ambiguity and disagreement about what the fast/flexible versus slow/inflexible processes are17. Indeed, recent literature has highlighted multiple processes that strongly contribute to learning. In more complex environments with navigation-like properties, this may entail the use of a map of the environment for forward planning16. Even in simple environments typically modelled with model-free RL, additional processes such as working memory (WM)11, episodic memory18,19 and choice perseveration strategies20 have been found to play an important role. In particular, instrumental learning tasks such as contextual multi-armed bandits rely mostly on WM, with contributions of a slow RL-like process when load overcomes WM resources21,22.
Because the RL family of models is highly flexible3, RL models have nonetheless successfully captured behaviour that is probably more driven by other processes such as WM. Indeed, in most simple laboratory tasks, non-RL processes make very similar predictions to RL ones—for example, perseveration strategies might be mistaken for a learning rate asymmetry in RL23, and WM contributions might be mistaken for high learning rates21. Non-RL processes become identifiable only in environments explicitly designed to attempt to disentangle them18,21. The contributions of non-RL processes to learning are thus often attributed to RL computations, and this misattribution of various processes to RL may lead to confusion in the literature, when findings relying on RL modelling are mistakenly attributed to RL brain processes24,25.
Here I investigate how much of reward-based instrumental learning actually reflects a model-free RL process, as typically formulated in the literature. Because of the well-characterized and major contributions of WM in instrumental learning, I focus on a task context where WM’s contribution can be adequately parsed out, the RLWM paradigm21. I parse out WM contributions to learning by its key characteristic: a strong limitation in resources or capacity26; note that this feature is not part of the typical characteristics of RL processes. I reason that a key characteristic of model-free RL is that it integrates reward outcomes over time to build a cached value estimate that drives policy directly, or indirectly through policy updates (for example, in actor–critic architectures27). More specifically, negative prediction error in model-free RL should make an agent less likely to repeat the corresponding choice. I thus focus here on how positive (correct, +1) and, more importantly, negative (incorrect, 0) outcomes affect later choices.
Behavioural analysis and computational modelling of seven datasets across two experimental paradigm versions (five previously published and one new for the deterministic version, RLWM; one previously published for the probabilistic version, RLWM-P) show that, when parsing out WM, we cannot detect evidence of RL in reward-based learning. Indeed, predictions including an RL process are falsified28. All behaviour can instead be explained by a mixture of a fast, flexible and capacity-limited process (WM) and a slower, broader process that tracks stimulus–action associations, irrespective of outcomes. Simulations show that neither process on its own can learn a reward-optimizing policy, and thus neither can be considered an RL process3; nonetheless, jointly as a mixture, the two non-RL processes do learn a good policy, supporting flexible human reward-based instrumental learning. These findings call for a reconsideration of how we interpret findings using the RL framework across levels of analysis.
Results
The RLWM task was designed to disentangle the contributions of WM-dependent learning from those of slower, iterative RL processes to reward-based learning via manipulating information load. Across independent blocks, participants learned stable stimulus–action associations between a novel set of stimuli (the set size (ns) ranged from two to six items within participants) and three actions. The correct action for each stimulus was deterministically signalled by correct (or +1) feedback, while the two incorrect actions were signalled with incorrect (or 0) (Fig. 1a). Participants’ behaviour in low set sizes appeared close to optimal, but increasing set size led to increasingly incremental learning curves (Fig. 1b), a pattern replicated across multiple previous studies in diverse populations21,22,24,29,30,31,32,33,34,35,36,37. This pattern was uniquely captured by the RLWM model, a mixture model of two processes representing WM and RL. In this model, the RL process is a standard delta-rule learner, while the WM module has a learning rate of 1 to capture immediate perfect learning but also decay to capture WM’s short timescale of maintenance; the mixture reflects WM resource limitations, such that behaviour is mostly driven by fast and forgetful WM when the load is within WM resources, but supplemented by RL with increasing load (Methods). This model included a bias weight parameterizing asymmetric updating of positive and negative feedback. This bias was shared between WM and RL and modulated learning rates for incorrect versus correct outcomes. Previous model fitting of the bias parameter (shared between WM and RL) revealed that incorrect outcomes had a weaker impact on subsequent choices than correct outcomes34.
Fig. 1: Protocol, behaviour and predictions.
a, RLWM experimental paradigm. Participants performed multiple independent blocks of an RL task, using deterministic binary feedback to identify which of three actions was correct (Cor.) for each of ns stimuli. Varying ns targets WM load and allows me to isolate its contribution21. b, Behaviour (plotted as mean ± standard error) across six datasets on the RLWM task: CF1221, SZ24, EEG31, fMRI30, Dev34 and GL (novel dataset). Top: learning curves showing the probability of a correct action choice as a function of stimulus iteration number, plotted per set size, illustrating a strong set-size effect that highlights WM contributions to behaviour. Bottom: error trial analysis showing the number of previous errors that are the same as the chosen error (purple) or the other possible error (unchosen; cyan) as a function of set size. The large gap in low set sizes indicates that participants avoid errors they made previously more often than other errors; the absence of a gap in high set sizes indicates that participants are unable to learn to avoid their past errors (black arrows). c, Qualitative predictions for the RL, WM and H modules, based on the trial example in a. Only the WM module predicts a set-size effect21. Only the H module predicts that participants are more likely to repeat a previous error (for example, selecting action A1 for the triangle) than to avoid it.
Value and reward integration
To better identify the non-WM, set-size-independent, slower and incremental component of learning (putatively RL) in this task, I first sought to understand how positive and negative outcomes were integrated to impact policy. Specifically, I reasoned that a process learning from reward prediction errors in an RL-like way should use negative feedback in error trials to make participants less likely to repeat mistakes, and more so the more they made the same mistakes (Methods and Fig. 1c). I thus computed, within error trials, whether the specific error participants made (out of two possible errors for a given stimulus) was indeed the one that had been made less frequently than the other error.
Across all six datasets in the RLWM task, the number of previous errors was overall lower for the chosen error than for the unchosen error (all t > 4, all P < 10−4; Supplementary Table 1), showing that participants did use negative feedback overall in the task. As expected if participants’ ability to use WM to guide choices decreased with set size, higher set sizes led to an increase in the number of previous errors for both chosen and unchosen. The difference between error type numbers, indicating participants’ ability to avoid previously unrewarded choices, decreased with set size, as expected if higher set sizes reflected a higher portion of responsibility from a slower learning process (all t > 2.28, P < 0.05; Supplementary Table 2). However, I observed in all datasets that the difference decreased strongly (see the blue versus purple curves in Fig. 1b, arrows at ns = 6), such that participants’ policy appeared to become insensitive to negative outcomes selectively at set size ns = 6 in four out of five datasets that included set size 6 (Supplementary Table 1). The effect even appeared to reverse in late learning in two datasets (Dev and SZ), such that errors committed late in learning in large set sizes had been repeated more often than the other error (all t > 4.4, P < 10−4; Supplementary Table 3), showing error perseveration effects. I note that this pattern of error cannot be explained simply by increased noise with set size—indeed, a sufficient increase in noise to capture the observed error pattern would lead to much worse learning accuracy.
I compared participants’ patterns of errors to the predictions from four variants of the RLWM model—one treating gains and losses equally in both WM and RL models, one with a shared bias34 and the two best-fitting RLWM models with no or weak bias against errors in WM and full bias in RL, indicating complete neglect of negative outcomes in the RL module. All models captured the set-size effect of performance in the qualitative pattern of the learning curves (Fig. 2a), the main effect of the chosen versus unchosen error and the increase in the number of previous errors for both chosen and unchosen. The models also predicted that the difference between error type numbers (indicating participants’ ability to avoid previously unrewarded choices) decreased with increasing set size. However, all models predicted that the difference should remain large even in large set sizes (see the blue versus purple curves in Fig. 2a; arrows at ns = 6), contrary to what I observed empirically. In all six datasets, the magnitude of the difference decrease between the past numbers of chosen and unchosen errors could not be accounted for by any RLWM model, particularly late in learning (Fig. 2a, bottom, grey curves). Multiple other variants of models within the family of mixture models with RL and WM modules, relaxing some model assumptions or including other mechanisms, were tested but could not improve fits (Methods and Supplementary Fig. 2).
Fig. 2: Mixture models with WM and H capture errors better than mixture models with WM and RL.
a, Varying bias parameterization within the RL-WM family of models improves fit compared with previous models, by capturing the spread in learning curves better (top); however, the models cannot capture the pattern of errors (middle). The difference in past numbers of chosen and unchosen errors in error trials for early (iterations 1–5, black) versus late (iterations 6 and above) is not captured by any model (bottom). The models are illustrated on dataset CF12; see Supplementary Information for the other datasets. The dashed lines show the empirical data; the solid lines show the model simulations. b, The winning model WM=H captures patterns of behaviour better in all six datasets. The spread in learning curves across set sizes is better captured (top). The new model captures the qualitative pattern of errors, such that in large set sizes, participants’ errors are not dependent on their history or negative outcomes (middle). Neglect of negative feedback pattern differs in early (iterations 1–5) and late (iterations 6 and above) parts of learning; the WM=H model captures this dynamic (bottom). The models are indexed by their modules (WM, RL or H; Methods) and the bias term within their module (0 indicates α− = 0; 1 indicates α− = α+; no number indicates a free parameter; = indicates a shared free parameter). The data in all panels are plotted as mean ± standard error; the numbers of individual participants contributing to the plots for each dataset are indicated in Fig. 1.
The new WMH model explains behaviour
The behavioural and modelling results so far showed efficient integration of negative outcomes in low set sizes but not high set sizes, supporting the idea that WM uses negative outcomes to guide avoidance in policy, but the slower, less resource-limited process that supports instrumental learning under higher loads does not. However, even with an RL negative learning rate α− = 0, RLWM models could not capture the pattern, because WM contributes to the choices even in high set sizes where its contribution is diminished. Further variants of the RLWM family model, including those with policy-compression mechanisms, could not reproduce the qualitative pattern (Supplementary Fig. 6). I reasoned that the slow process should, to a degree, counteract WM’s ability to learn to avoid errors from negative outcomes. I thus explored a family of models where the slow module association weights (Q values for RL) were updated with a subjective outcome r0 for negative outcomes of r = 0. Surprisingly, the best-fitting model across six datasets (Fig. 3) was a model with fixed r0 = 1, such that receiving incorrect feedback led to the same positive prediction error as correct feedback would. Negative learning rates still included a bias term shared across both modules. Note that this slow module cannot be interpreted as an RL module anymore, as the association weights track a relative frequency of stimulus–action choice, irrespective of outcomes, rather than an estimated value, and consequently the module cannot learn a good policy on its own. This module can be thought of as an associative ‘Hebbian’ or ‘habit-like’ module; thus, I label it H agent, with the mixture model WMH. While it is similar to a choice perseveration kernel38, note that it is not purely motor but stimulus-dependent—indeed, all models also include a motor choice perseveration mechanism capturing participants’ tendency to repeat actions across trials.
Fig. 3: Quantitative model comparison confirms a better fit for the WM=H than the WMRL family of models.
The top row shows individual (dots) and group mean AIC (± standard error), baselined to the group mean best model; the bottom row shows the proportion of participants best fit by each model. Both measures show that the WM=H model fits best in all datasets. r0 indicates a free parameter for the 0 outcome in RL; C indicates the use of policy compression. Results from models that can be interpreted as WMH are highlighted in pink and RLWM in brown. The numbers of individual participants contributing to the plots for each dataset are indicated in Fig. 1.
The WMH model fit quantitatively better than models with RL and WM (Fig. 3; see also Methods and Supplementary Fig. 3 for further models considered). It was also successful at producing the qualitative pattern of errors observed in real participants, such that errors at high set sizes appeared to fully neglect negative outcomes in a way that RLWM models could not (Fig. 2b, bottom; see Supplementary Fig. 6 for full validation of all models in Fig. 2a in all datasets). I further verified that this pattern of error changed dynamically over the course of learning in participants in a way that the model could capture (Fig. 2b, bottom).
WMH also explains behaviour in a probabilistic reward learning task
While using the RLWM task was useful to adequately factor out WM contributions to reward-based learning, a downside is that the task does not necessitate the integration of reward in the same way probabilistic tasks do6. I thus sought to confirm whether my findings would hold in a probabilistic version of the task, RLWM-P; to that effect, I reanalysed a previously published dataset (see ref. 22, experiment 3). As previously reported, behaviour in this task was sensitive to set size (F1,33 = 55.99, P < 0.001; Fig. 4b), indicating that WM contributes to learning even in probabilistic environments thought to be more suited to eliciting RL-like computations. Similar to the deterministic task, I modelled behaviour with a mixture of two processes: a process capturing WM characteristics of set-size dependence and fast forgetting, and a process capturing the slower, non-forgetful and non-capacity-limited features (Methods). As in the previous datasets, the WM process model included RL-like equations; however, it is important to note that this process does not correspond to standard RL assumptions due to the strong capacity limitation. I compared mixture models where the slow process was either RL-like (that is, integrating negative outcomes differently from positive ones; RLWM) or association-like (that is, integrating negative outcomes similarly to positive outcomes). Supporting previous results, the best model was a WMH model including a fast, WM-like process that integrated negative outcomes as well as an outcome-insensitive, slower-learning component (Fig. 4a and Supplementary Fig. 10). This WMH model also fit better than the best single-process model and captured the qualitative pattern of learning curves (Fig. 4b, third panel from the left).
Fig. 4: Results replicate in a probabilistic learning task.
a, Model comparison showing the results from a family of models manipulating the subjective outcome value of outcome 0, r0, for RL, WM or both—with r0 a free parameter unless labelled to its fixed value. r0 = 0 corresponds to standard RL or WM computations; r0 = 1 corresponds to an H agent that handles both outcomes similarly. Highlighted in pink are agents that can be interpreted as WMH and in brown those that correspond to RL mixtures. The winning model RLr0 = 1; WMr0 = 0 assumes RLr0 = 1 and WMr0 = 0 and is thus a WMH agent, replicating the findings in the deterministic version of the task. I further verified that the winning model was better than the best single-process model, WMf (Methods). The data are plotted as individual (dots) and group mean AIC (± standard error), baselined to the group mean best model; the right plot shows the proportion of participants best fit by each model. b, A set-size effect was also observed in a probabilistic version of the task; the winning model (third from the left) captures the learning curve pattern better than the competing models. The error bars indicate the standard error of the mean across n = 34 individual participants (dots in a).
RL-like policy with a simpler H algorithm
My results show that behaviour that is typically modelled with RL algorithms appears to instead be generated with non-RL processes, including a fast, forgetful and capacity-limited process that integrates outcome valence, and a slow and resource-unlimited H process that encodes association strengths between stimuli and actions, irrespective of outcome valences. This leaves two questions open: what is the computational function of this slow process, and why is it mistaken for value-based RL, for example in previous RLWM modelling21,37? Indeed, on its own, the slow H process cannot learn a good policy but only tends to repeat previous actions, and thus seems functionally maladaptive. To investigate this question, I simulated both RLWM and WMH models in a standard probabilistic two-armed bandit task, varying the probability p of a reward for the correct choice (Fig. 5, left, and Methods). RL policies track this value and thus convert to a graded policy where the agent is more likely to select the correct bandit at higher values of p (green curve in Fig. 5, right). By contrast, an H agent on its own performs at chance, regardless of p (blue curve in Fig. 5, right; mixture weight of the WM module (ρWM), 0). However, when the agents’ choices invoke a mixture of policies, including a WM policy that tracks outcomes, the policy learned by the H agent does resemble a standard RL policy (dark blue curves). Indeed, even with low WM weights (for example, ρWM = 0.5), WM’s contribution is enough to bootstrap choice selection of the good option, which leads the H agent to select this action more often and thus develop a good policy. This simulation shows that in the absence of specific task features decorrelating contributions of rewards from contributions of errors to behaviour (such as the ability to consider multiple errors, something not feasible in most binary choice tasks), contributions of an H agent might be mistaken for an RL policy. Furthermore, in this mixture context, which probably corresponds to most human learning, I observe that the H agent does implement an adaptive policy with a simpler learning rule than the RL process.
Fig. 5: H agents learn to mimic an RL policy when WM contributes to guiding choice.
Left: I simulated RLWM (top) or WMH (bottom) mixture agents on a simple probabilistic two-armed bandit task. Right: the policy learned by the H agent (bottom) resembles an RL policy (top) when there is enough WM contribution to choices, in a probabilistic two-armed bandit task. I varied parameters ρ (indicating the contributions of the WM module) and β (indicating the noise in the softmax policy). The error bars indicate the standard error of the mean across n = 1,000 simulations.
Discussion
I analysed six previously published datasets and one new dataset to investigate how different processes contribute to reward-based learning in humans. Such learning had previously been explained with model-free RL algorithms, which use a cached value estimate integrating past reward outcomes for given stimuli and actions to guide decisions. Behavioural analyses gave strong evidence across six datasets that the integration of outcomes to guide future decisions is dependent on load and becomes weak or absent at higher set sizes. My findings were present not only in healthy young adults but also in children ages 8–18, in healthy older adults matched to patients with schizophrenia and in patients, emphasizing the robustness of the findings across diverse populations. Computational modelling revealed that this pattern could be explained only by a mixture model, with two very distinct processes. The first, a WM-like process that learns fast but is limited in both the amount of information and the duration it can be held, appeared to successfully integrate reward outcomes into its policy. By contrast, a second, slower but less limited process appeared to fully neglect outcomes, updating in the same direction for wins and losses, and thus only tracked association strengths, in what could be likened to a Hebbian or habit-like process (H agent).
Although reward-based learning is, at first glance, well approximated by model-free RL algorithms, neither of these processes correspond to what is typically thought of as an RL cognitive process. The fast (WM) process integrates outcome values into a policy as an RL algorithm should, but it has properties not typically associated with RL, such as capacity limitations and rapid forgetting. By contrast, the slow, unlimited H process is more in line with what is typically thought of as RL along those dimensions, but it does not rely on reward-prediction errors—and indeed does not approximate values—as is typically expected from model-free RL algorithms in the context of cognitive neuroscience3,39. These processes also cannot, individually, be thought of as RL agents, in the typical sense of an algorithm that attempts to derive a policy that optimizes future reward expectations: on its own, the WM process can learn such a policy only under very minimal loads, while the H agent cannot learn such a policy at all.
I showed with simulations that the H agent, despite its learning rule that is on its own unsuited to learning from rewards, is nonetheless able to develop appropriate policies within a mixture model context. Indeed, using WM to bootstrap adaptive choice selection leads the agent to more frequently select actions avoiding bad outcomes, which further enables it to select good actions and reinforce them. This agent is mathematically equivalent to a stimulus-dependent choice perseveration kernel, which has been found to improve fit in other learning models16,38,40, but is considered as an integral part of the learning mechanism rather than a low-level nuisance factor. In this way, my approach is reminiscent of the ‘habits without value’ model41,[42](https://www.nature.com/articles/s41562-025-02340-0#ref-CR42 “Miller, K. J., Ludvig, E. A., Pezzulo, G. & Shenhav, A. in Goal-Directed Decision Making (eds Morris, R. et al.) 407–428 (Elsevier, 2018); https://doi.org/10.1016/B978-0-12-812098-9.00018-8
“), which showed similar properties of developing good policies without value tracking. Here, my model extends the same theoretical approach to a stimulus-dependent learning context, and I experimentally validated the usefulness of this approach across seven datasets. The H agent uses a simpler learning rule to learn a similar policy to an RL agent in a mixture context, which might be a more resource-rational way to lead to adaptive behaviour.
An important question concerns the generalizability of this finding to other learning tasks. Is it possible that the RLWM task, with deterministic feedback, incites participants to de-activate RL-like processes? While this is a possible explanation, I think it is unlikely. First, RL is not typically thought to be under explicit meta-control but rather to occur implicitly in the background43,44; thus, it is unclear why this would not be the case here. Second, computational modelling supports similar conclusions in the probabilistic version, RLWM-P, where integrating reward outcomes over multiple trials is useful, and H-like perseveration kernels have been found to improve fit in other probabilistic learning tasks16,40. Third, similar conclusions, using different methods, have very recently been drawn in different instrumental learning tasks in humans[45](https://www.nature.com/articles/s41562-025-02340-0#ref-CR45 “Eckstein, M., Summerfield, C., Daw, N. & Miller, K. J. Hybrid neural-cognitive models reveal how memory shapes human reward learning. Preprint at OSF https://doi.org/10.31234/osf.io/u9ks4
(2025).“). I limited my investigation here to the RLWM experimental framework because it offers a solid grounding for factoring out explicit WM processes and analysing what remains. However, an important future research direction is to find experimental and modelling approaches that will better allow us to parse out different processes, including WM, from learning behaviour, and to probe the generalizability of this finding to other instrumental tasks typically modelled with RL. A promising direction will be to systematically manipulate factors that decorrelate choice and reward history, allowing their separate contributions as well as their interactions to be investigated46,[47](https://www.nature.com/articles/s41562-025-02340-0#ref-CR47 “Wagner, B. J., Wolf, H. B. & Kiebel, S. J. Explaining decision biases through context-dependent repetition. Preprint at bioRxiv https://doi.org/10.1101/2024.10.09.617399
(2024).“).
Another important question concerns the interpretation of the concept of RL across behaviour, algorithmic models and the brain mechanisms underlying the processes identified through modelling of behaviour. RL is a broadly used term, and ambiguity in its use across researchers can lead to confusion25,48,[49](https://www.nature.com/articles/s41562-025-02340-0#ref-CR49 “Collins, A. G. Reinforcement learning. in Open Encyclopedia of Cognitive Science (eds Frank, M. C. & Majid, A.) https://doi.org/10.21428/e2759450.36d1ca92
(MIT Press, 2024).“). A reason for the success of model-free RL frameworks is their ability to map onto brain mechanisms in striato-cortical loops with dopaminergic signalling, including, for example, RL reward-prediction errors in striatal neural signal50 (Supplementary Fig. 12). If learning from reward in humans appears RL-like to the first approximation but actually reflects two non-RL processes, how can we reconcile this with a wealth of RL-model-based neuroscience findings? I consider multiple possible explanations.
One possibility is that most of human reward-based learning tasks tap on WM processes that are at first approximation well described by RL (as here in the RLWM-P dataset), such that the striatal circuits support a more cognitive, explicit version of RL than typically assumed; in parallel, the H agent might reflect Hebbian cortico-cortical associations6. Inde