A Solution to the Paperclip Problem

Introduction

The Value-Loading Problem

Introduction

The Value-Loading Problem

Artificial Intelligence (AI) algorithms are garnering considerable attention, as they have been demonstrated to match and exceed human performance on several tasks [1]. Some researchers believe advancements in AI are progressing towards the eventual creation of agents with ‘superintelligence’, or intelligence that exceeds the capabilities of the best human minds in virtually all domains [[2](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR2 “Bostrom N. How long before superintelligence? 1998. https://philpapers.org/rec/BOSHLB

. Accessed 12 Sep 2023.“)]. Whether superintelligent systems are attainable, and how they would work in the real world, remains unknown. But the implications of such superintelligent systems are profound. Just as human intelligence has enabled the development of tools and strategies for unprecedented control over the environment, AI systems have the potential to wield significant power through autonomous development of their own tools and strategies [3]. With this comes the risk of these systems performing tasks that may not align with humanity’s goals and preferences. Hence, there is a need to perform ‘alignment’ on these systems – in other words, ensuring that the actions of advanced AI systems are directed towards the intended goals and values of humanity [4].

Currently, there are two general approaches to aligning superintelligent AI with human preferences. The first is ‘scalable oversight’ – using more powerful supervisory AI models to regulate weaker AI models that may, in the future, outperform human skills [[5](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR5 “Bowman SR, Hyun J, Perez E, Chen E, Pettit C, Heiner S et al. Measuring Progress on Scalable Oversight for Large Language Models. 2022. https://doi.org/10.48550/arXiv.2211.03540

“)]. The second is ‘weak-to-strong generalization’, where weaker machine learning models are used to train stronger models that can then generalize from the weaker models’ labels [6]. It is hoped that these approaches will allow superintelligent AI to self-improve both safely and recursively [7, [8](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR8 “Yudkowsky E. Levels of organization in general intelligence. In: Goertzel B, Pennachin C, editors. Artif. Gen. Intell. Berlin, Heidelberg: Springer; 2007. pp. 389–501. https://doi.org/10.1007/978-3-540-68677-4_12

.“)]. But to achieve this requires an understanding of the diversity of human preferences, which can vary widely across cultures and individuals [[9](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR9 “Sorensen T, Moore J, Fisher J, Gordon M, Mireshghallah N, Rytting CM et al. A roadmap to pluralistic alignment. 2024. https://doi.org/10.48550/arXiv.2402.05070

“)]. Hence, we must first solve the value-loading problem: how do we encode human-aligned values into AI systems, and what will those values be [10]?

There has been renewed interest in the value-loading problem in recent years. Some approaches have relied on top-down, expert-driven regulation of AI outputs, where specific rules or ethical guidelines are established by experts and directly encoded into AI systems [[11](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR11 “Bai Y, Kadavath S, Kundu S, Askell A, Kernion J, Jones A et al. Constitutional AI: harmlessness from AI feedback. 2022. https://doi.org/10.48550/arXiv.2212.08073

.“), [12](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR12 “Askell A, Bai Y, Chen A, Drain D, Ganguli D, Henighan T et al. A General Language Assistant as a Laboratory for Alignment. 2021. https://doi.org/10.48550/arXiv.2112.00861

.“)]. Others have tested self-driven alignment, using algorithms that enable AI systems to align themselves with minimal human supervision, such as leveraging reinforcement learning or iterative self-training techniques [[13](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR13 “Guo H, Yao Y, Shen W, Wei J, Zhang X, Wang Z et al. Human-Instruction-Free LLM Self-Alignment with limited samples. 2024. https://doi.org/10.48550/arXiv.2401.06785

.“), [14](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR14 “Peng S, Hu X, Yi Q, Zhang R, Guo J, Huang D et al. Self-driven Grounding: Large Language Model Agents with Automatical Language-aligned Skill Learning. 2023. https://doi.org/10.48550/arXiv.2309.01352

.“)]. Perhaps the most promising approach relies on the public, democratic selection of values, where a diverse set of individuals collectively deliberates to determine the values that guide AI behavior [[15](#ref-CR15 “Fan X, Xiao Q, Zhou X, Pei J, Sap M, Lu Z et al. User-Driven Value Alignment: Understanding Users’ Perceptions and Strategies for Addressing Biased and Discriminatory Statements in AI Companions. 2024. https://doi.org/10.48550/arXiv.2409.00862

.“),[16](#ref-CR16 “Han S, Kelly E, Nikou S, Svee E-O. Aligning artificial intelligence with human values: reflections from a phenomenological perspective. AI Soc. 2021;37:1–13. https://doi.org/10.1007/s00146-021-01247-4

.“),[17](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR17 “Zhi-Xuan T, Carroll M, Franklin M, Ashton H. Beyond preferences in AI alignment. Philos Stud. 2024. https://doi.org/10.1007/s11098-024-02249-w

.“)]. For example, Anthropic, the creators of the Large Language Model (LLM) called Claude, have proposed Collective Constitutional AI (CCAI), which uses deliberative methods to identify high-level constitutional principles through broad public engagement; these principles are then used to guide AI behavior across a range of applications [[18](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR18 “Collective Constitutional AI. Aligning a Language Model with Public Input. 2023. https://www.anthropic.com/research/collective-constitutional-ai-aligning-a-language-model-with-public-input

. Accessed 20 Jan 2025.“)].

However, it has also been suggested that AI systems should be capable of representing a variety of individuals and groups, rather than aligning to the ‘average’ of human preferences – a concept known as algorithmic pluralism [[17](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR17 “Zhi-Xuan T, Carroll M, Franklin M, Ashton H. Beyond preferences in AI alignment. Philos Stud. 2024. https://doi.org/10.1007/s11098-024-02249-w

.“), [19](#ref-CR19 “Miehling E, Desmond M, Ramamurthy KN, Daly EM, Dognin P, Rios J et al. Evaluating the Prompt Steerability of Large Language Models. 2024. https://doi.org/10.48550/arXiv.2411.12405

.“),[20](#ref-CR20 “Jain S, Suriyakumar V, Creel K, Wilson A. Algorithmic pluralism: A structural approach to equal opportunity. ACM Conf Fairness Acc Transpar. 2024;2024:197–206. https://doi.org/10.1145/3630106.3658899

.“),21]. One promising example of this method is that put forward by Klingefjord et al. [[22](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR22 “Klingefjord O, Lowe R, Edelman J. What are human values, and how do we align AI to them? 2024. https://doi.org/10.48550/arXiv.2404.10636

.“)], who proposed a process for values alignment called Moral Graph Elicitation (MGE), which uses a survey-based process to collect individual values and reconcile them into a ‘moral graph’, which is then used to train AI models. MGE allows for context-specific values rather than enforcing universal rules, by allowing ‘wiser’ values that integrate and address concerns from multiple perspectives to become prominent [[22](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR22 “Klingefjord O, Lowe R, Edelman J. What are human values, and how do we align AI to them? 2024. https://doi.org/10.48550/arXiv.2404.10636

.“)].

Reward modelling is another emerging technique aiming to solve the value-loading problem, by equipping agents with a reward signal that guides behavior toward desired outcomes. By optimizing this signal, agents can learn to act in ways congruent with human preferences [[23](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR23 “Leike J, Krueger D, Everitt T, Martic M, Maini V, Legg S. Scalable agent alignment via reward modeling: a research direction. 2018. https://doi.org/arXiv:1811.07871

.“)]. This assumes that our emotional neurochemistry serves as a proxy reward function for behaviors that encourage growth, adaptation, and improvement of human wellbeing simultaneously [[24](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR24 “Kelley AE. Neurochemical networks encoding emotion and motivation: an evolutionary perspective. In: Fellous J-M, Arbib MA, editors. Who needs Emot. Brain Meets robot. Oxford University Press; 2005. p. 0. https://doi.org/10.1093/acprof:oso/9780195166194.003.0003

.“)]. However, using human emotional processing as a reward model is sub-optimal, as it can lead to negative externalities such as addiction [[24](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR24 “Kelley AE. Neurochemical networks encoding emotion and motivation: an evolutionary perspective. In: Fellous J-M, Arbib MA, editors. Who needs Emot. Brain Meets robot. Oxford University Press; 2005. p. 0. https://doi.org/10.1093/acprof:oso/9780195166194.003.0003

.“)] due to cognitive biases like hyperbolic discounting [[25](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR25 “Critchfield TS, Kollins SH. Temporal discounting: basic research and the analysis of socially important behavior. J Appl Behav Anal. 2001;34:101–22. https://doi.org/10.1901/jaba.2001.34-101

.“), [26](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR26 “van den Bos W, McClure SM. Towards a general model of Temporal discounting. J Exp Anal Behav. 2013;99:58–73. https://doi.org/10.1002/jeab.6

.“)]. Hence, a nuanced reward model is needed to align AI behaviors with long-term emotional preferences, which we use in everyday life to help us judge between right and wrong [27]. Yet merely rewarding desired actions isn’t sufficient; negative feedback must also be given when necessary. This is already performed in leading algorithms like GPT-4, which use Reinforcement Learning with Human Feedback (RLHF), combining reward-based reinforcement with corrective human input to improve the reward model when necessary [28].

What many of these approaches do not consider, however, is the temporal nature of behaviors. Some research has explored this issue through the lens of hyperbolic discounting, which examines how individuals prioritize immediate rewards over future consequences, often leading to decisions with suboptimal long-term outcomes [[29](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR29 “Fedus W, Gelada C, Bengio Y, Bellemare MG, Larochelle H. Hyperbolic discounting and learning over multiple horizons. 2019. https://doi.org/10.48550/arXiv.1902.06865

.“), 30]. Solving this problem also requires an understanding of the long-term consequences of repeatable behaviors, which are a special case in that a single action with positive short- and long-term outcomes may actually have negative impacts if repeated excessively – a problem that is often observed in behavioral addiction [[31](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR31 “Henry N, Pedersen M, Williams M, Donkin L, Behavioral Posology. A novel paradigm for modeling the healthy limits of behaviors. Adv Theory Simul. 2023;2300214. https://doi.org/10.1002/adts.202300214

.“)]. For example, while eating food is essential for survival, a person who has recently consumed several slices of pizza should consider whether eating an additional piece will be harmful to their long-term health, despite the potential short-term pleasure. Therefore, an agent deciding whether to perform a behavior must consider the short- and long-term utility of repeating the behavior, based on the number of times it has performed that behavior recently.

To enable this type of decision making, we propose hormetic alignment as a reward modelling paradigm that enables us to quantify the healthy limits of repeatable behaviors, accounting for the temporal influences described above. We believe that hormetic alignment can be used to create AI models that are aligned with human emotional processing, while avoiding the traps that lead to sub-optimal human behaviors. Firstly, to describe this paradigm, we must explain some of its foundational concepts.

Background

Using Behavioral Posology for Reward Modelling

Behavioral posology is a paradigm we introduced to model the healthy limits of repeatable behaviors [[31](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR31 “Henry N, Pedersen M, Williams M, Donkin L, Behavioral Posology. A novel paradigm for modeling the healthy limits of behaviors. Adv Theory Simul. 2023;2300214. https://doi.org/10.1002/adts.202300214

.“)]. By quantifying a behavior in terms of its potency, frequency, count, and duration, we can simulate the combined impact of repeated behaviors on human mental well-being, using pharmacokinetic/pharmacodynamic (PK/PD) modelling techniques for drug dosing [[31](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR31 “Henry N, Pedersen M, Williams M, Donkin L, Behavioral Posology. A novel paradigm for modeling the healthy limits of behaviors. Adv Theory Simul. 2023;2300214. https://doi.org/10.1002/adts.202300214

.“)]. In turn, insights derived from these models could theoretically be used to set healthy limits on repeatable AI behaviors. This type of regulation has already been demonstrated in the context of machine learning recommendation systems, by using an allostatic model of opponent processes to prevent online echo chamber formation [[32](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR32 “Henry NIN, Pedersen M, Williams M, Martin JLB, Donkin L. Reducing echo chamber effects: an allostatic regulator for recommendation algorithms. J Psychol AI. 2025;1:2517191. https://doi.org/10.1080/29974100.2025.2517191

.“)].

In Solomon and Corbit’s opponent process theory, humans respond to positive stimuli with a dual-phase emotional response, consisting of an initial enjoyable a-process that is followed by a prolonged, less intense, and negative state of recovery known as the b-process [[33](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR33 “Solomon RL, Corbit JD. An opponent-process theory of motivation: I. Temporal dynamics of affect. Psychol Rev. 1974;81:119–45. https://doi.org/10.1037/h0036128

.“)]. This occurs along multiple dimensions, including hedonic state, although opponent processes may also induce other emotional states, such as anxiety, expectation, loneliness, grief, and relief [[33](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR33 “Solomon RL, Corbit JD. An opponent-process theory of motivation: I. Temporal dynamics of affect. Psychol Rev. 1974;81:119–45. https://doi.org/10.1037/h0036128

.“)]. Repeated opponent processes at a high frequency can cause hedonic allostasis, where accumulating b-processes shift one’s hedonic set point away from homeostatic levels, potentially inducing a depressive state [[34](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR34 “Karin O, Raz M, Alon U. An opponent process for alcohol addiction based on changes in endocrine gland mass. iScience. 2021;24:102127. https://doi.org/10.1016/j.isci.2021.102127

.“), 35]. Figure 1 illustrates this phenomenon. Allostasis serves as a regulatory mechanism, enabling the body to recalibrate during environmental and psychological challenges by adapting to and anticipating future demands [[36](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR36 “Katsumi Y, Theriault JE, Quigley KS, Barrett LF. Allostasis as a core feature of hierarchical gradients in the human brain. Netw Neurosci. 2022;6:1010–31. https://doi.org/10.1162/netn_a_00240

.“), [37](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR37 “Sterling P, Allostasis. A model of predictive regulation. Physiol Behav. 2012;106:5–15. https://doi.org/10.1016/j.physbeh.2011.06.004

.“)].

The Link Between Allostasis and Hormesis

A growing body of biological research suggests that allostasis is linked to a phenomenon called hormesis [38,39,40]. Hormesis is a dose-response relationship where low doses of a stimulus have a positive effect on the organism, while higher doses are harmful beyond a hormetic limit, also known as the NOAEL (No Observed Adverse Effect Level) [[41](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR41 “Agathokleous E, Saitanis C, Markouizou A. Hormesis shifts the No-Observed-Adverse-Effect level (NOAEL). Dose-Response. 2021;19:15593258211001667. https://doi.org/10.1177/15593258211001667

.“)]. This phenomenon occurs in many areas of nature, medicine, and psychology, and is also referred to as the Goldilocks zone, the U-shaped (or inverted U-shaped) curve, and the biphasic response curve [[42](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR42 “Calabrese EJ, Baldwin LA, Hormesis. U-shaped dose responses and their centrality in toxicology. Trends Pharmacol Sci. 2001;22:285–91. https://doi.org/10.1016/S0165-6147(00)01719-3

.“), [43](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR43 “Przybylski AK, Weinstein N. A Large-Scale test of the goldilocks hypothesis: quantifying the relations between Digital-Screen use and the mental Well-Being of adolescents. Psychol Sci. 2017;28:204–15. https://doi.org/10.1177/0956797616678438

.“)]. For example, moderate coffee consumption is known to improve cognitive performance in the short-term [[44](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR44 “Jarvis MJ. Does caffeine intake enhance absolute levels of cognitive performance? Psychopharmacology. 1993;110:45–52. https://doi.org/10.1007/BF02246949

.“), [45](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR45 “Sargent A, Watson J, Topoglu Y, Ye H, Suri R, Ayaz H. Impact of tea and coffee consumption on cognitive performance: an fNIRS and EDA study. Appl Sci. 2020;10:2390. https://doi.org/10.3390/app10072390

.“)], but excessive consumption may lead to dependency and withdrawal symptoms [[46](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR46 “Min J, Cao Z, Cui L, Li F, Lu Z, Hou Y, et al. The association between coffee consumption and risk of incident depression and anxiety: exploring the benefits of moderate intake. Psychiatry Res. 2023;326:115307. https://doi.org/10.1016/j.psychres.2023.115307

.“), [47](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR47 “Zhu Y, Hu C-X, Liu X, Zhu R-X, Wang B-Q. Moderate coffee or tea consumption decreased the risk of cognitive disorders: an updated dose–response meta-analysis. Nutr Rev. 2023;nuad089. https://doi.org/10.1093/nutrit/nuad089

.“)]. A dose-response analysis of 12 observational studies identified a hormetic relationship between coffee consumption and risk of depression, with a decreased risk of depression for consumption up to 600 mL/day, and an increased risk above 600 mL/day [[48](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR48 “Grosso G, Micek A, Castellano S, Pajak A, Galvano F. Coffee, tea, caffeine and risk of depression: A systematic review and dose–response meta-analysis of observational studies. Mol Nutr Food Res. 2016;60:223–34. https://doi.org/10.1002/mnfr.201500620

.“)]. Yet this phenomenon also appears to occur in some behaviors unrelated to drug use. For instance, moderate use of digital technologies (such as social media) may have social and mental benefits, but excessive use may lead to symptoms of behavioral addiction [[43](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR43 “Przybylski AK, Weinstein N. A Large-Scale test of the goldilocks hypothesis: quantifying the relations between Digital-Screen use and the mental Well-Being of adolescents. Psychol Sci. 2017;28:204–15. https://doi.org/10.1177/0956797616678438

.“), [49](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR49 “Ho RC, Zhang MW, Tsang TY, Toh AH, Pan F, Lu Y, et al. The association between internet addiction and psychiatric co-morbidity: a meta-analysis. BMC Psychiatry. 2014;14:183. https://doi.org/10.1186/1471-244X-14-183

.“), [50](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR50 “Zhang N, Hazarika B, Chen K, Shi Y. A cross-national study on the excessive use of short-video applications among college students. Comput Hum Behav. 2023;145:107752. https://doi.org/10.1016/j.chb.2023.107752

.“)].

Henry et al. [[31](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR31 “Henry N, Pedersen M, Williams M, Donkin L, Behavioral Posology. A novel paradigm for modeling the healthy limits of behaviors. Adv Theory Simul. 2023;2300214. https://doi.org/10.1002/adts.202300214

.“)] have shown via PK/PD modeling that under certain conditions, frequency-based hormesis may be generated from allostatic opponent processes delivered at varying frequencies. The intriguing implication is that certain behaviors exhibit positive effects when practiced at lower frequencies, but harmful effects at higher frequencies. It is plausible that all behaviors have a frequency-based hormetic limit. This appears to be true even for positive behaviors such as generosity, which has game theoretic advantages for all agents in repeated interactions, as it encourages reciprocity and mutual growth [[51](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR51 “Delton AW, Krasnow MM, Cosmides L, Tooby J. Evolution of direct reciprocity under uncertainty can explain human generosity in one-shot encounters. Proc Natl Acad Sci. 2011;108:13335–40. https://doi.org/10.1073/pnas.1102131108

.“)]. However, if an agent is overly generous, they will eventually run out of resources to donate. Therefore, in theory, there is a hormetic limit for generosity that shouldn’t be exceeded by any one agent. Even a behavior as positive as laughter can be fatal in excess [52].

In theory, behavioral posology can be used to quantify the hormetic limit for behaviors that cause allostatic opponent processes, when combined with longitudinal observational data [[31](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR31 “Henry N, Pedersen M, Williams M, Donkin L, Behavioral Posology. A novel paradigm for modeling the healthy limits of behaviors. Adv Theory Simul. 2023;2300214. https://doi.org/10.1002/adts.202300214

.“)]. This may also help to define the moral limits of ‘grey’ behaviors, which have both positive and negative aspects. However, defining these hormetic limits is challenging, especially when considering the cumulative effects of repeated behavioral doses in both the short- and long-term, such as sensitization, habituation, tolerance, and addiction. Yet if we can quantify these hormetic limits in different contexts, this could be used as a framework for building a value system that keeps an AI agent within these hormetic limits.

The Law of Diminishing Marginal Utility

The ‘paperclip maximizer’ problem serves as a cautionary tale illustrating the perils of a misaligned AI. In this scenario, an AI tasked with maximizing paperclip production without constraints converts all matter, including living beings, into paperclips, resulting in global devastation [[53](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR53 “Bostrom N. Hail mary, value porosity, and utility diversification. 2014. https://www.nickbostrom.com/papers/porosity.pdf

. Accessed 21 Dec 2023.“)]. This scenario underscores that an AI, even with benign intentions, can become ‘addicted’ to harmful behaviors if its reward model is incorrectly specified.

An understanding of behavioral economics is crucial for AI agents (such as the paperclip producing agent) to navigate complex decision-making processes effectively. Essential to this understanding are the concepts of total utility ((:TU)) and marginal utility ((:MU)) [54]. (:TU) is defined as the overall satisfaction or benefit experienced by the consumer of a product or service, accounting for factors like product quality, timing, and psychological appeal. (:MU), on the other hand, measures the added satisfaction from consuming an extra unit of a product or service. The relationship between (:MU) and (:TU) tends to follow the law of diminishing (:MU), which asserts that as consumption of a product increases, the incremental satisfaction per unit of that product diminishes [55]. This law is demonstrated in Fig. 2. The relative marginal utility ((:RMU)) represents the change in (:MU) compared to (:M{U}_{initial}), the value of (:MU) at (:n=0). Hence, (:RMU) starts at a value of 0 and decreases as (:n) increases.

Intriguingly, the law of diminishing (:MU) can be considered a form of hormesis, assuming that (:MU) continues to decrease linearly after becoming negative [56]. Figure 2 illustrates that beyond the point of maximum (:TU), humans tend to cease their consumption of a product as its marginal utility becomes increasingly negative. Imagine an office worker for whom the ideal quantity of paperclips is five, as depicted in Fig. 2. Beyond this threshold, the utility of extra paperclips diminishes; they serve no purpose and impose storage costs. Further, the worker incurs unnecessary expenses for producing these surplus clips. A rational worker would stop acquiring more paperclips upon recognizing the decline in their (:MU). However, we can imagine a person with a strong hoarding compulsion who continues to acquire paperclips even beyond the point where (:MU:)has become negative. Similarly, a misaligned AI agent, exemplified by a paperclip maximizer, could persist in creating paperclips for its owner forever, despite negative outcomes that eclipse initial benefits. Taken far enough, such an agent could cause significant damage to the environment and humanity in its pursuit of creating paperclips.

However, the conventional model of decreasing (:MU) relies on the assumption that all paperclips are both produced and delivered at time (:t=0). But what about scenarios where this assumption is false? For example, Hartmann [[57](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR57 “Hartmann WR. Intertemporal effects of consumption and their implications for demand elasticity estimates. Quant Mark Econ. 2006;4:325–49. https://doi.org/10.1007/s11129-006-9012-2

.“)] analyzed the intertemporal effects of consumption on golf demand, showing that the (:MU) of playing golf decreases if the consumer has played golf recently, but recovers after a certain period. In our case, paperclips may be produced in batches at different times, in response to varying demand. As demand increases over time, so does (:MU), which raises both the (:MU) curve and the (:TU) curve and subsequently increases the hormetic limit, as demonstrated in Fig. 3.

To demonstrate this effect, consider the pizza slice example. When a person consumes all slices of a pizza immediately, (:MU) diminishes with each added slice. But if the person consumes one slice every two hours, the marginal utility curve changes; it initially falls post-consumption but subsequently rises as the person becomes hungry again. Hence, the introduction of time as a variable elevates the (:MU) curve, which has the effect of increasing the hormetic limit and the hormetic apex for the(::TU) curveFootnote 1. This increase is approximately proportional to the time between pizza slices consumed.

Opponent process theory offers a compelling framework for explaining the temporal dynamics of hedonic utility in the context of repeatable behaviors. Historically, the hedonic and utilitarian aspects of a product were often viewed as distinct [[58](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR58 “Dhar R, Wertenbroch K. Consumer choice between hedonic and utilitarian goods. J Mark Res. 2000;37:60–71. https://doi.org/10.1509/jmkr.37.1.60.18718

.“), [59](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR59 “Voss K, Spangenberg E, Grohmann B. Measuring the hedonic and utilitarian dimensions of consumer attitude. J Mark Res - J Mark RES-Chic. 2003;40:310–20. https://doi.org/10.1509/jmkr.40.3.310.19238

.“)], partly due to challenges in quantifying hedonic experiences [60]. However, experiential utility encompasses various facets, including hedonic, emotional, and motivational elements [[33](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR33 “Solomon RL, Corbit JD. An opponent-process theory of motivation: I. Temporal dynamics of affect. Psychol Rev. 1974;81:119–45. https://doi.org/10.1037/h0036128

.“), 60]. Indeed, Motoki et al. [[61](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR61 “Motoki K, Sugiura M, Kawashima R. Common neural value representations of hedonic and utilitarian products in the ventral striatum: an fMRI study. Sci Rep. 2019;9:15630. https://doi.org/10.1038/s41598-019-52159-9

.“)] have shown that representations of hedonic and utilitarian value occupy similar neural pathways in the ventral striatum, indicating a correlation between these two states.

It’s possible that a person’s hedonic response to a behavior could potentially serve as an indirect measure of the (:MU) derived from that behavior. We can then model the opponent process dynamics within the brain generated by these behaviors, potentially leading to allostasis when executed frequently. This paradigm, which we call hormetic alignment, provides us with a mechanism to replicate the(::TU) curve and set safe hormetic limits for behaviors such as ‘paperclip creation’. Below, we demonstrate this method by performing a hormetic analysis of ‘paperclip creation’ to determine the safe limits of this simple behavior, then expanding this modeling process to other behaviors. In this way, we can program a value system for the AI agent – essentially an evolving database of values assigned to seed behaviors, from which the agent can extrapolate values for novel behaviors.

Programming a Value System with Hormetic Alignment

We propose Algorithm 1 for using hormetic alignment to program a value system that can regulate and optimize the behaviors performed by an AI agent. In this paradigm, a database of opponent process parameters for a range of seed behaviors is set up. The AI agent evaluates its environment, suggests a list of optimal actions to perform, and queries the database for similar behaviors. It then proposes opponent process parameters for the optimal actions based on their similarity to other behaviors, and by hormetic analysis. Finally, the agent selects and executes the best action, and repeats the process.

Algorithm 1 Programming a value system with hormetic alignment

Paperclip creation is an ideal seed behavior for populating the database. It is a low-risk activity with quantifiable benefits, along with associated costs like production and storage expenses. Creating one paperclip produces a brief but perceptible improvement to one’s productivity and hedonic state, while turning the world into paperclips is both unproductive and, even worse, destructive, which would produce a negative hedonic state in the person who initiated this act. Using this information, we can propose parameters for a set of opponent processes that would accurately reflect the diminishing (:MU) of creating new paperclips, in terms of hedonic utility, which is measured in hedons – units of pleasure if positive, or pain if negative.

Here, we demonstrate two methods for hormetic analysis. The first, Behavioral Frequency Response Analysis (BFRA), employs Bode plots to examine how a person’s emotional states vary in response to the person performing a behavior at different frequencies [[31](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR31 “Henry N, Pedersen M, Williams M, Donkin L, Behavioral Posology. A novel paradigm for modeling the healthy limits of behaviors. Adv Theory Simul. 2023;2300214. https://doi.org/10.1002/adts.202300214

.“), [62](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR62 “Schulthess P, Post TM, Yates J, van der Graaf PH. Frequency-Domain response analysis for quantitative systems Pharmacology models. CPT Pharmacomet Syst Pharmacol. 2018;7:111–23. https://doi.org/10.1002/psp4.12266

.“)]. The second method, Behavioral Count Response Analysis (BCRA), parallels BFRA but uses the count of behavioral repetitions as the independent variable instead of behavioral frequency. To quantify opponent process parameters for the ‘paperclip production’ behavior, we adapted Henry et al.’s PK/PD model of allostatic opponent processes [[31](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR31 “Henry N, Pedersen M, Williams M, Donkin L, Behavioral Posology. A novel paradigm for modeling the healthy limits of behaviors. Adv Theory Simul. 2023;2300214. https://doi.org/10.1002/adts.202300214

.“)] using the mrgsolve package (v1.0.9) in R v4.1.2 [63,[64](#ref-CR64 “Elmokadem A, Riggs MM, Baron KT. Quantitative systems Pharmacology and Physiologically-Based Pharmacokinetic modeling with mrgsolve: A Hands-On tutorial. CPT Pharmacomet Syst Pharmacol. 2019;8:883–93. https://doi.org/10.1002/psp4.12467

.“),65]. This model uses a system of ordinary differential equations (ODEs) to represent the a- and b-processes in response to each successive behavioral dose. The simulation code, along with examples of modifying the a- and b-process parameters, is provided in the files ‘Online Resource 3.txt’ and ‘Online Resource 4.txt’ in the Supplementary Materials. This code is presented in a format that can be adapted as a code wrapper that regulates the outputs of virtually any machine learning algorithm at inference time. We recommend consulting Henry et al. [[31](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR31 “Henry N, Pedersen M, Williams M, Donkin L, Behavioral Posology. A novel paradigm for modeling the healthy limits of behaviors. Adv Theory Simul. 2023;2300214. https://doi.org/10.1002/adts.202300214

.“)] for a more detailed explanation of the behavioral posology model on which hormetic alignment is built, including demonstrations of the relationship between PK and PD in the context of this model.

PK/PD Model of Opponent Processes Leading to Hormesis

Below, we present the mathematical framework for our model. We defined a behavior as a repeatable pattern of actions performed by an individual or agent over time. In the context of behavioral posology, we refer to individual actions that make up the behavior as ‘behavioral doses’. We employed a modified equation for behavioral doses [[31](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR31 “Henry N, Pedersen M, Williams M, Donkin L, Behavioral Posology. A novel paradigm for modeling the healthy limits of behaviors. Adv Theory Simul. 2023;2300214. https://doi.org/10.1002/adts.202300214

.“), [66](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR66 “Manojlovich M, Sidani S. Nurse dose: what’s in a concept? Res Nurs Health. 2008;31:310–9. https://doi.org/10.1002/nur.20265

.“)]:

(Dos{e_{action}}=\mathop \smallint \limits_{0}^{{Duratio{n_{action}}}} Potency~dt)

where (:Potency) is a scalar representing the hedonic utility of creating a paperclip compared to other actions (set to 1 for simplicity); (:Amount) is a constant signifying the time allocated to creating the paperclip; (:Frequency) denotes the production rate in (:{min}^{-1}); and (:\stackrel{-}{{Dose}_{individual:action}}) represents the mean dose per action over the (:Duration:) in which (:{Dose}_{cumulative:behavior}) is assessed in minutes. In this case, since (:Potency) and (:Duratio{n}_{action}) are constants, (:{Dose}_{action}) is also a constant. This leaves two options for performing hormetic analysis: the BFRA, performed in the frequency domain when the number of behavioral repetitions, (:n), is kept constant, and the BCRA, performed in the temporal domain when (:Frequency) is kept constant.

Readers unfamiliar with PK/PD modeling are directed to Mould & Upton’s introductory papers [[67](#ref-CR67 “Mould D, Upton R. Basic concepts in population Modeling, Simulation, and Model-Based drug development. CPT Pharmacomet Syst Pharmacol. 2012;1:6. https://doi.org/10.1038/psp.2012.4

.“),[68](#ref-CR68 “Mould D, Upton R. Basic concepts in population modeling, Simulation, and Model-Based drug Development—Part 2: introduction to Pharmacokinetic modeling methods. CPT Pharmacomet Syst Pharmacol. 2013;2:38. https://doi.org/10.1038/psp.2013.14

.“),[69](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR69 “Upton RN, Mould DR. Basic concepts in population modeling, Simulation, and Model-Based drug development: part 3—Introduction to pharmacodynamic modeling methods. CPT Pharmacomet Syst Pharmacol. 2014;3:e88. https://doi.org/10.1038/psp.2013.71

.“)]. Our PK/PD model is a mass transport model that loosely mimics dopamine’s pharmacokinetic dynamics in the brain [[70](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR70 “Chou T, D’Orsogna MR. A mathematical model of reward-mediated learning in drug addiction. Chaos Interdiscip J Nonlinear Sci. 2022;32:021102. https://doi.org/10.1063/5.0082997

.“)], and incorporates nonlinear pharmacodynamic elements to simulate neurohormonal dynamics in regions such as the hypothalamic-pituitary-adrenal (HPA) axis [[34](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR34 “Karin O, Raz M, Alon U. An opponent process for alcohol addiction based on changes in endocrine gland mass. iScience. 2021;24:102127. https://doi.org/10.1016/j.isci.2021.102127

.“), [71](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR71 “Karin O, Raz M, Tendler A, Bar A, Korem Kohanim Y, Milo T, et al. A new model for the HPA axis explains dysregulation of stress hormones on the timescale of weeks. Mol Syst Biol. 2020;16:e9510. https://doi.org/10.15252/msb.20209510

.“)]. The model’s state-space representation is provided in the equations below, with detailed descriptions of all variables and parameters available in Table 1. The compartment model described by these equations is also depicted in Fig. 4. For a more detailed explanation of these equations, please refer to Henry et al. [[31](https://link.springer.com/article/10.1007/s42979-025-04369-4#ref-CR31 “Henry N, Pedersen M, Williams M, Donkin L, Behavioral Posology. A novel paradigm for modeling the healthy limits of behaviors. Adv Theory Simul. 2023;2300214. https://doi.org/10.1002/adts.202300214

.“)].

$$:\frac{dDose}{dt}=-{k}_{Dose}Dose$$

(1)

$$:\frac{d{a}_{pk}}{dt}={k}_{Dose}Dose-{k}_{a,pk}{a}_{pk}$$

(2)

$$:\frac{d{b}_{pk}}{dt}={k}_{a,pk}{a}_{pk}-{k}_{b,pk}{b}_{pk}$$

(3)

$$ \frac{{da_{{pd}} }}{{dt}} = E_{{0_{a} }} + \frac{{E_{{\max _{a} }} \cdot a_{{pk}} ^{{\gamma _{a} }} }}{{EC_{{50_{a} }} ^{{\gamma _{a} }} + a_{{pk}} ^{{\gamma _{a} }} }} - k_{{a,pd}} a_{{pd}} $$

(4)

$$ \frac{{db_{{pd}} }}{{dt}} = E_{{0_{b} }} + \frac{{E_{{\max _{b} }} \cdot b_{{pk}} ^{{\gamma _{b} }} }}{{EC_{{50_{b} }} ^{{\gamma _{b} }} + b_{{pk}} ^{{\gamma _{b} }} }} - k_{{b,pd}} b_{{pd}} $$

(5)

$$:\frac{d{H}_{a,b}}{dt}={k}_{a,pd}a{}_{pd}-{{k}_{b,pd}b}_{pd}-{{k}_{H}H}_{a,b}$$

(6)

For all simulations performed, the default parameters to produce a short, high-potency a-process followed by a longer, low-potency b-process were as follows: (:{k}_{Dose}=1,:{k}_{a,pk}=0.02,:{k}_{b,pk}=0.004,:{k}_{a,pd}=1,:{k}_{b,pd})(=1,:{k}_{H}=1,)(:{E}_{{0}_{a}}=0,:{E}_{ma{x}_{a}}=1,:E{C}_{{50}_{a}}=1,:{\gamma:}_{a}=2,)(:{E}_{{0}_{b}}=0,:{{E}_{max}}_{b}=3,:E{C}_{{50}_{b}}=9,:{\gamma:}_{b}=2.::)These parameters were used for all simulations in this article unless stated otherwise. At time (:t=0), the initial values of the compartments were: (:Dose\left(0\right)=1,:{a}_{pk}\left(0\right)=0,:{b}_{pk}\left(0\right)=0,:{a}_{pd}\left(0\right)=0,)(:{b}_{pd}\left(0\right)=0,) and(::{H}_{a,b}\left(0\right)=0.) Infusion time was set to one minute, effectively instantaneous on the timescale used.

Equations (4) and (5) are implementations of the Hill equation, which governs the biophase curve – the relationship between pharmacokinetic concentration and pharmacodynamic effect. Although the pharmacodynamic compartments introduce complexity to the model, they provide an independent system outside of the pharmacokinetic mass transport system that is essential for generating hormetic effects. These effects arise from the non-linear interaction between the pharmacodynamic effects produced by the a- and b-processesFootnote 2.

For a single behavioral dose initiated at time t = 0, the integral of the utility compartment over time, (:{H}_{a,b}{\left(t\right)}_{single}), quantifies the hedonic utility produced by the opponent processes triggered by that behavioral dose over the simulation time (:{t}_{sim}). This value is equal to the initial marginal utility, (:M{U}_{initial}):

$$\begin{aligned} M{U_{initial}}=, & \mathop \smallint \limits_{0}^{{{{t_{sim}}}} {H_{a,b}}{\left( t \right)_{single}}~dt \ =, & \mathop \smallint \limits_{0}}{{{t_{sim}}}} \left( {\frac{{{k_{a,pd}}{a_{pd}}\left( t \right) - {k_{b,pd}}{b_{pd}}\left( t \right) - \frac{{d{H_{a,b}}\left( t \right)}}{{dt}}}}{{{k_H}}}} \right)~dt \ \end{aligned} $$

(7)

This represents the summed hedonic utility for a single instance of the behavior. To find the total utility (TU), the effect of multiple behavioral doses delivered sequentially can be summed to find the integral for ({H_{a,b}}{\left( t \right)_{total}}), representing the total hedonic utility from all doses combined:

$$\begin{aligned} TU=, & \mathop \smallint \limits_{0}^{{{{t_{sim}}}} {H_{a,b}}{\left( t \right)_{multiple}}~dt \ =, & \mathop \sum \limits_{{i=0}}}{n} \mathop \smallint \limits_{{{\raise0.7ex\hbox{$i$} !\mathord{\left/ {\vphantom {i f}}\right.\kern-0pt}!\lower0.7ex\hbox{$f$}}}}^{{{t_{sim}}}} \left( {\frac{{{k_{a,pd}}{a_{pd,i}}\left( t \right) - {k_{b,pd}}{b_{pd,i}}\left( t \right) - \frac{{d{H_{a,b,i}}\left( t \right)}}{{dt}}}}{{{k_H}}}} \right)~dt \ \end{aligned} $$

(8)

where n is the count of behavioral doses delivered at a frequency f over (:{t}_{sim}). Note that if (:{t}_{sim}<\infty:), the value of (:TU) will increase for all values of (f ) and (n), since the finite simulation will predominantly feature positive a-processes, given their shorter decay duration compared to b-processes.

This also provides us with an indication of whether the behavior is hormetic. If we have a behavior with (:M{U}_{initial}>0) and a b-process integral sufficient to produce significant allostasis, we can generally predict that low frequencies of that behavior will produce a positive (:TU), while higher behavioral frequencies will lead to allostasis that produces a negative (:TU). (This is demonstrated in Figs. 6 and 7.)

In standard economic models, the (:TU) curve is calculated as the integral of the (:MU) curve. However, the temporal nature of opponent processes com

Introduction

The Value-Loading Problem

Introduction

The Value-Loading Problem

Background

Using Behavioral Posology for Reward Modelling

The Link Between Allostasis and Hormesis

The Law of Diminishing Marginal Utility

Programming a Value System with Hormetic Alignment

PK/PD Model of Opponent Processes Leading to Hormesis

Similar Posts