Published on September 22, 2025 7:06 PM GMT
tl;dr: I outline my research agenda, post bounties for poking holes in it or for providing general relevant information, and am seeking to diversify my funding sources. This post will be followed by several others, providing deeper overviews of the agenda’s subproblems and my sketches of how to tackle them.
Back at the end of 2023, I wrote the following:
I’m fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a ti…
Published on September 22, 2025 7:06 PM GMT
tl;dr: I outline my research agenda, post bounties for poking holes in it or for providing general relevant information, and am seeking to diversify my funding sources. This post will be followed by several others, providing deeper overviews of the agenda’s subproblems and my sketches of how to tackle them.
Back at the end of 2023, I wrote the following:
I’m fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a timely manner. (My semi-arbitrary deadline is 2030, and I expect to arrive at intermediate solid results by EOY 2025.)
On the inside view, I’m pretty satisfied with how that is turning out. I have a high-level plan of attack which approaches the problem from a novel route, and which hopefully lets us dodge a bunch of major alignment difficulties (chiefly the instability of value reflection, which I am MIRI-tier skeptical of tackling directly). I expect significant parts of this plan to change over time, as they turn out to be wrong/confused, but the overall picture should survive.
Conceit: We don’t seem on the track to solve the full AGI alignment problem. There’s too much non-parallelizable research to do, too few people competent to do it, and not enough time. So we… don’t try. Instead, we use adjacent theory to produce a different tool powerful enough to get us out of the current mess. Ideally, without having to directly deal with AGIs/agents at all.
More concretely, the ultimate aim is to figure out how to construct a sufficiently powerful, safe, easily interpretable, well-structured world-model.
- “Sufficiently powerful”: contains or can be used to generate knowledge sufficient to resolve our AGI-doom problem, such as recipes for mind uploading or adult intelligence enhancement, or robust solutions to alignment directly.
- “Safe”: not embedded in a superintelligent agent eager to eat our lightcone, and which also doesn’t spawn superintelligent simulacra eager to eat our lightcone, and doesn’t cooperate with acausal terrorists eager to eat our lightcone, and isn’t liable to Basilisk-hack its human operators into prompting it to generate a superintelligent agent eager to eat our lightcone, and so on down the list.
- “Easily interpretable”: written in some symbolic language, such that interpreting it is in the reference class of “understand a vast complex codebase” combined with “learn new physics from a textbook”, not “solve major philosophical/theoretical problems”.
- “Well-structured”: has an organized top-down hierarchical structure, learning which lets you quickly navigate to specific information in it.
Some elaborations:
Safety: The problem of making it safe is fairly nontrivial: a world-model powerful enough to be useful would need to be a strongly optimized construct, and strongly optimized things are inherently dangerous, agent-like or not. There’s also the problem of what had exerted this strong optimization pressure on it: we would need to ensure the process synthesizing the world-model isn’t itself the type of thing to develop an appetite for our lightcone.
But I’m cautiously optimistic this is achievable in this narrow case. Intuitively, it ought to be possible to generate just an “inert” world-model, without a value-laden policy (an agent) on top of it.
That said, this turning out to be harder than I expect is certainly one of the reasons I might end up curtailing this agenda.
Interpretability: There are two primary objections I expect here.
- “This is impossible, because advanced world-models are inherently messy”. I think this is confused/wrong, because there’s already an existence proof: a human’s world-model is symbolically interpretable by the human mind containing it. More on that later.
- “(Neuro)symbolic methods have consistently failed to do anything useful”. I’ll address that below too, but in short, neurosymbolic methods fail because it’s a bad way to learn: it’s hard to traverse the space of neurosymbolic representations in search of the right one. But I’m not suggesting a process that “learns by” symbolic methods, I’m suggesting a process that outputs a symbolic world-model.
Why Do You Consider This Agenda Promising?
On the inside view, this problem, and the subproblems it decomposes into, seems pretty tractable. Importantly, it seems tractable using a realistic amount of resources (a small group of researchers, then perhaps a larger-scale engineering effort for crossing the theory-practice gap), in a fairly short span of time (I optimistically think 3-5 years; under a decade definitely seems realistic).[1]
On the outside view, almost nobody has been working on this, and certainly not using modern tools. Meaning, there’s no long history of people failing to solve the relevant problems. (Indeed, on the contrary: one of its main challenges is something John Wentworth and David Lorell are working on, and they’ve been making very promising progress recently.)
On the strategic level, I view the problem of choosing the correct research agenda as the problem of navigating between two failure modes:
- Out-of-touch theorizing: If you pick a too-abstract starting point, you won’t be able to find your way to the practical implementation in time. (Opinionated example: several agent-foundations agendas.)
- Blind empirical tinkering: If you pick a too-concrete starting point, you won’t be able to generalize it to ASI in time. (Opinionated example: techniques for aligning frontier LLMs.)
I think most alignment research agendas, if taken far enough, do produce ASI-complete alignment schemes eventually. However, they significantly differ in how long it takes them, and how much data they need. Thus, you want to pick the starting point that gets you to ASI-complete alignment in as few steps as possible: with the least amount of concretization or generalization.
Most researchers disagree with most others regarding what that correct starting point is. Currently, this agenda is mine.
High-Level Outline
As I’d stated above, I expect significant parts of this to turn out confused, wrong, or incorrect in a technical-but-not-conceptual way. This is a picture is painted with a fairly broad brush.
I am, however, confident in the overall approach. If some of its modules/subproblems turn out faulty, I expect it’d be possible to swap them for functional ones as we go.
Theoretical Justifications
1. Proof of concept. Note that human world-models appears to be “autosymbolic”: able to be parsed as symbolic structures by the human mind in which they’re embedded.[2] Given that the complexity of things humans can reason about is strongly limited by their working memory, how is this possible?
Human world-models rely on chunking. To understand a complex phenomenon, we break it down into parts, understand the parts individually, then understand the whole in terms of the parts. (The human biology in terms of cells/tissues/organs, the economy in terms of various actors and forces, a complex codebase in terms of individual functions and modules.)
Alternatively, we may run this process in reverse. To predict something about a specific low-level component, we could build a model of the high-level state, then propagate that information “downwards”, but only focusing on that component. (If we want to model a specific corporation, we should pay attention to the macroeconomic situation. But when translating that situation into its effects on the corporation, we don’t need to model the effects on all corporations that exist. We could then narrow things down further, to e. g. predict how a specific geopolitical event impacted an acquaintance holding a specific position at that corporation.)
Those tricks seem to work pretty well for us, both in daily life and in our scientific endeavors. It seems that the process of understanding and modeling the universe can be broken up into a sequence of “locally simple” steps: steps which are simple given all preceding steps. Simple enough to fit within a human’s working memory.
To emphasize: the above implies that the world’s structure has this property at the ground-true level. The ability to construct such representations is an objective fact about data originating from our universe; our universe is well-abstracting.
The Natural Abstractions research agenda is a formal attempt to model all of this. In its terms, the universe is structured such that low-level parts of the systems in it are independent given their high-level state. Flipping it around: the high-level state is defined by the information redundantly represented in all low-level parts.
That greatly simplifies the task. Instead of defining some subjective, human-mind-specific “interpretability” criterion, we simply need to extract this objectively privileged structure. How can we do so?
2. Compression. Conceptually, the task seems fairly easy. The kind of hierarchical structure we want to construct happens to also be the lowest-description-length way to losslessly represent the universe. Note how it would follow the “don’t repeat yourself” principle: at every level, higher-level variables would extract all information shared between the low-level variables, such that no bit of information is present in more than one variable. More concretely, if we wanted to losslessly transform the Pile into a representation that takes up the least possible amount of disk space, a sufficiently advanced compression algorithm would surely exploit various abstract regularities and correspondences in the data – and therefore, it’d discover them.
So: all we need is to set up a sufficiently powerful compression process, and point it at a sufficiently big and diverse dataset of natural data. The output would be isomorphic to a well-structured world-model.
… If we can interpret the symbolic language it’s written in.
The problem with neural networks is that we don’t have the “key” for deciphering them. There might be similar neat structures inside those black boxes, but we can’t get at them. How can we avoid this problem here?
By defining “complexity” as the description length in some symbolic-to-us language, such as Python.
3. How does that handle ontology shifts? Suppose that this symbolic-to-us language would be suboptimal for compactly representing the universe. The compression process would want to use some other, more “natural” language. It would spend some bits of complexity defining it, then write the world-model in it. That language may turn out to be as alien to us as the encodings NNs use.
The cheapest way to define that natural language, however, would be via the definitions that are the simplest in terms of the symbolic-to-us language used by our complexity-estimator. This rules out definitions which would look to us like opaque black boxes, such as neural networks. Although they’d technically still be symbolic (matrix multiplication plus activation functions), every parameter of the network would have to be specified independently, counting towards the definition’s total complexity. If the core idea regarding the universe’s “abstraction-friendly” structure is correct, this can’t be the cheapest way to define it. As such, the “bridge” between the symbolic-to-us language and the correct alien ontology would consist of locally simple steps.
Alternate frame: Suppose this “correct” natural language is theoretically understandable by us. That is, if we spent some years/decades working on the problem, we would have managed to figure it out, define it formally, and translate it into code. If we then looked back at the path that led us to insight, we would have seen a chain of mathematical abstractions from the concepts we knew in the past (e. g., 2025) to this true framework, with every link in that chain being locally simple (since each link would need to be human-discoverable). Similarly, the compression process would define the natural language using the simplest possible chain like this, with every link in it locally easy-to-interpret.
Interpreting the whole thing, then, would amount to: picking a random part of it, iteratively following the terms in its definition backwards, arriving at some locally simple definition that only uses the terms in the initial symbolic-to-us language, then turning around and starting to “step forwards”, iteratively learning new terms and using them to comprehend more terms.
I. e.: the compression process would implement a natural “entry point” for us, a thread we’d be able to pull on to unravel the whole thing. The remaining task would still be challenging – “understand a complex codebase” multiplied by “learn new physics from a textbook” – but astronomically easier than “derive new scientific paradigms from scratch”, which is where we’re currently at.
(To be clear, I still expect a fair amount of annoying messiness there, such as code-golfing. But this seems like the kind of problem that could be ameliorated by some practical tinkering and regularization, and other “schlep”.)
4. Computational tractability. But why would we think that this sort of compressed representation could be constructed compute-efficiently, such that the process finishes before the stars go out (forget “before the AGI doom”)?
First, as above, we have existence proofs. Human world-models seem to be structured this way, and they are generated at fairly reasonable compute costs. (Potentially at shockingly low compute costs.[3])
Second: Any two Turing-complete languages are mutually interpretable, at the flat complexity cost of the interpreter (which depends on the languages but not on the program). As the result, the additional computational cost of interpretability – of computing a translation to the hard-coded symbolic-to-us language – would be flat.
5. How is this reconciled with the failures of previous symbolic learning systems? That is: if the universe has this neat symbolic structure that could be uncovered in compute-efficient ways, why didn’t pre-DL approaches work?
This essay does an excellent job explaining why. To summarize: even if the final correct output would be (isomorphic to) a symbolic structure, the compute-efficient path to getting there, the process of figuring that structure out, is not necessarily a sequence of ever-more-correct symbolic structures. On the contrary: if we start from sparse hierarchical graphs, and start adding provisions for making it easy to traverse their space in search of the correct graph, we pretty quickly arrive at (more or less) neural networks.
However: I’m not suggesting that we use symbolic learning methods. The aim is to set up a process which would output a highly useful symbolic structure. How that process works, what path it takes there, how it constructs that structure, is up in the air.
Designing such a process is conceptually tricky. But as I argue above, theory and common sense say that it ought to be possible; and I do have ideas.
Subproblems
The compression task can be split into three subproblems. I will release several posts exploring each subproblem in more detail in the next few days (or you can access the content that’d go into them here).
Summaries:
1. “Abstraction-learning”. Given a set of random low-level variables which implement some higher-level abstraction, how can we learn that abstraction? What functions map from the molecules of a cell to that cell, from a human’s cells to that human, from the humans of a given nation to that nation; or from the time-series of some process to the laws governing it?
As mentioned above, this is the problem the natural-abstractions agenda is currently focused on.
My current guess is that, at the high level, this problem can be characterized as a “constructive” version of Partial Information Decomposition. It involves splitting (every subset of) the low-level variables into unique, redundant, and synergistic components.
Given correct formal definitions for unique/redundant/synergistic variables, it should be straightforwardly solvable via machine learning.
Current status: the theory is well-developed and it appears highly tractable.
2. “Truesight”. When we’re facing a structure-learning problem, such as abstraction-learning, we assume that we get many samples from the same fixed structure. In practice, however, the probabilistic structures are themselves resampled.
Examples:
- The cone cells in your eyes connect to different abstract objects depending on what you’re looking at, or where your feet carry you.
- The text on the frontpage of an online newsletter is attached to different real-world structures on different days.
- The glider in Conway’s Game of Life “drifts across” cells in the grid, rather than being an abstraction over some fixed set of them.
- The same concept of a “selection pressure” can be arrived-at by abstracting from evolution or ML models or corporations or cultural norms.
- The same human mind can “jump substrates” from biological neurons to a digital representation (mind uploading), while still remaining “the same object”.
I. e.,
- The same high-level abstraction can “reattach” to different low-level variables.
- The same low-level variables can change which high-level abstraction they implement.
On a sample-to-sample basis, we can’t rely on any static abstraction functions to be valid. We need to search for appropriate ones “at test-time”: by trying various transformations of the data until we spot the “simple structure” in it.
Here, “simplicity” is defined relative to the library of stored abstractions. What we want, essentially, is to be able to recognize reoccurrences of known objects despite looking at them “from a different angle”. Thus, “truesight”.
Current status: I think I have a solid conceptual understanding of it, but it’s at the pre-formalization stage. There’s one obvious way to formalize it, but it seems best avoided, or only used as a stepping stone.
3. Dataset-assembly. There’s a problem:
- Solving abstraction-learning requires truesight. We can’t learn abstractions if we don’t have many samples of the random variables over which they’re defined.
- Truesight requires already knowing what abstractions are around. Otherwise, the problem of finding simple transformations of the data that make them visible is computationally intractable. (We can’t recognize reoccurring objects if we don’t know any objects.)
Thus, subproblem 3: how to automatically spot ways to slice the data into datasets entries from which are isomorphic to samples from some fixed probabilistic structure, to make them suitable for abstraction-learning.
Current status: basically greenfield. I don’t have a solid high-level model of this subproblem yet, only some preliminary ideas.
Bounties
1. Red-teaming. I’m interested in people trying to find important and overlooked-by-me issues with this approach, so I’m setting up a bounty: $5-$100 for spotting something wrong that makes me change my mind. The payout scales with impact.
Fair warnings:
- I expect most attempts to poke holes to yield a $0 reward. I’m well aware of many minor holes/“fill-in with something workable later” here, as well as the major ways for this whole endeavor to fail/turn out misguided.
- I don’t commit to engaging in-depth with every attempt. As above, I expect many of them to rehash things I already know of, so I may just point that out and move on.
A reasonable strategy here would be to write up a low-effort list of one-sentence summaries of potential problems you see, I’ll point out which seem novel and promising at a glance, and you could expand on those.
2. Blue-teaming. I am also interested in people bringing other kinds of agenda-relevant useful information to my attention: relevant research papers or original thoughts you may have. Likewise, a $5-$100 bounty on that, scaling with impact.[4]
I will provide pointers regarding the parts I’m most interested in as I post more detailed write-ups on the subproblems.
Both bounties will be drawn from a fixed pool of $500 I’ve set aside for this. I hope to scale up the pool and the rewards in the future. On that note…
Funding
I’m looking to diversify my funding sources. Speaking plainly, the AI Alignment funding landscape seems increasingly captured by LLMs; I pretty much expect only the LTFF would fund me. This is an uncomfortable situation to be in, since if some disaster were to befall the LTFF, or if the LTFF were to change priorities as well, I would be completely at sea.
As such:
- If you’re interested and would be able to provide significant funding (e. g., $10k+), or know anyone who’d be interested-and-willing, please do reach out.
- I accept donations, including smaller ones, through Manifund and at the crypto addresses listed at the end of this post.
Regarding target funding amounts: I currently reside in a country with low costs of living, and I don’t require much compute at this stage, so the raw resources needed are small; e. g., $40k would cover me for a year. That said, my not residing in the US increasingly seems like a bottleneck on collaborating with other researchers. As such, I’m currently aiming to develop a financial safety pillow, then immigrate there. Funding would be useful up to $200k.[5]
If you’re interested in funding my work, but want more information first, you can access a fuller write-up through this link.
If you want a reference, reach out to @johnswentworth.
Crypto
BTC: bc1q7d8qfz2u7dqwjdgp5wlqwtjphfhct28lcqev3v
ETH: 0x27e709b5272131A1F94733ddc274Da26d18b19A7
SOL: CK9KkZF1SKwGrZD6cFzzE7LurGPRV7hjMwdkMfpwvfga
TRON: THK58PFDVG9cf9Hfkc72x15tbMCN7QNopZ
Preference: Ethereum, USDC stablecoins.
You may think a decade is too slow given LLM timelines. Caveat: “a decade” is the pessimistic estimate under my primary, bearish-on-LLMs, model. In worlds in which LLM progress goes as fast as some hope/fear, this agenda should likewise advance much faster, for one reason: it doesn’t seem that far from being fully formalized. Once it is, it would become possible to feed it to narrowly superintelligent math AIs (which are likely to appear first, before omnicide-capable general ASIs), and they’d cut years of math research down to ~zero.
I do not centrally rely on/expect that. I don’t think LLM progress would go this fast; and if LLMs do speed up towards superintelligence, I’m not convinced it would be in the predictable, on-trend way people expect.
That said, I do assign nontrivial weight to those worlds, and care about succeeding in them. I expect this agenda to fare pretty well there.
It could be argued that they’re not “fully” symbolic – that parts of them are only accessible to our intuitions, that we can’t break down the definitions of the symbols/modules in them down to the most basic functions/neuron activations. But I think they’re “symbolic enough”: if we could generate an external world-model that’s as understandable to us as our own world-models (and we are confident that this understanding is accurate), that should suffice for fulfilling the “interpretability” criterion.
That said, I don’t expect this caveat to come into play: I expect a world-model that would be ultimately understandable in totality.
The numbers in that post feel somewhat low to me, but I think it’s directionally correct.
Though you might want to reach out via private messages if the information seems exfohazardous. E. g., specific ideas about sufficiently powerful compression algorithms are obviously dual-use.
- ^
Well, truthfully, I could probably find ways to usefully spend up to $1 million/year, just by hiring ten mathematicians and DL engineers to explore all easy-to-describe, high-reward, low-probability-of-panning-out research threads. So if you want to give me $1 million, I sure wouldn’t say no.
Discuss