Published on November 9, 2025 7:08 PM GMT
Condensation: a theory of concepts is a model of concept-formation by Sam Eisenstat. Its goals and methods resemble John Wentworth’s natural abstractions/natural latents research.[1] Both theories seek to provide a clear picture of how to posit latent variables, such that once someone has understood the theory, th…
Published on November 9, 2025 7:08 PM GMT
Condensation: a theory of concepts is a model of concept-formation by Sam Eisenstat. Its goals and methods resemble John Wentworth’s natural abstractions/natural latents research.[1] Both theories seek to provide a clear picture of how to posit latent variables, such that once someone has understood the theory, they’ll say “yep, I see now, that’s how latent variables work!”.
The goal of this post is to popularize Sam’s theory and to give my own perspective on it; however, it will not be a full explanation of the math. For technical details, I suggest reading Sam’s paper.
Brief Summary
Shannon’s information theory focuses on the question of how to encode information when you have to encode everything. You get to design the coding scheme, but the information you’ll have to encode is unknown (and you have some subjective probability distribution over what it will be). Your objective is to minimize the total expected code-length.
Algorithmic information theory similarly focuses on minimizing the total code-length, but it uses a “more objective” distribution (a universal algorithmic distribution), and a fixed coding scheme (some programming language). This allows it to talk about the minimum code-length of specific data (talking about particulars rather than average expectations becomes more central).
Both versions of information theory study compression. Sam’s alternative is called condensation in contrast.
Condensation doesn’t concern itself with the total code-length. Instead, it concerns itself with how easy it is to use the encoding to answer questions about the data. This forces the encoding to be organized in a conceptually meaningful way, rather than compressed together into an uninterpretable blob.
Compression creates density. Condensation forms discrete droplets.
Condensing data creates a system of latent variables, which organize the data into meaningful parts. Sam has proven a theorem showing that different approximately-efficient condensations will posit approximately-isomorphic random variables. IE, two people who condense the data well will have similar “understandings” of the data and will be able to translate between their different conceptual schemes.
Like classical information theory, condensation has both a probabilistic and algorithmic variant. I’ll briefly review classical information theory for the sake of comparison.
Shannon’s Information Theory
Claude Shannon’s notion of information can be justified in the following way:
- The amount of information that can be stored in an information-storage device is, apparently, only a function of the number of distinguishable configurations of the device. If I want to communicate a single letter (26 options), and my communication channel is a knob with 26 settings, then I can come up with a code which allows me to communicate any letter by setting the knob to an appropriate setting. Reversing directions, I can also communicate knob-settings using letters.[2]
- When we take two information-storage devices together, we intuitively add quantities of information; two notebooks can store twice as much information as one.
- Since the number of configurations of two objects taken together is the product of their individual configurations,[3] we can infer that information must be measured as the logarithm of the number of configurations.
This analysis can be further refined by taking the above discussion to apply to the case where all configurations are equally probable. When the probabilities of configurations differ, the information contained in a configuration with probability is measured as if there were configurations (even if is not a whole number); ie, .[4]
Most commonly, we take log base 2, in which case we are measuring information in bits.
I find this derivation to be beautiful, but the immense impact of information theory cannot be justified from this alone. The practical power of this theory comes from the relationship between information and codes.
Imagine that Alice is keeping a diary of birds that visit her birdfeeder. She wants to conserve ink, paper, and her own time spent writing the diary; so, she wants her diary entries to be as brief as possible. However, she does want to accurately record the type of each bird, and the time and duration of its visit. Therefore, Alice wants to come up with an efficient encoding scheme for her diary.
If Alice has probabilistic beliefs, then it makes sense for Alice to leverage these beliefs in designing her encoding scheme by assigning shorter codes to more common events. If most birds that visit are starlings, Alice might decide to record only the time and duration of a starling visit. Bird type only needs to be specified for non-starlings, saving Alice time and ink.
Shannon proved that, no matter how clever Alice’s code, the expected code length must be at least the expected information.[5] This gives a lower bound on the compression of information Alice can expect to achieve with her code. Although Shannon did not give good upper bounds, the study of near-optimal codes has continued to progress; these days, we can use arithmetic coding to get extremely close to this lower bound.
Thus, there is a tight relationship between probability distributions and efficient codes. In the case of codes written with the alphabet (“binary codes”), the number of bits of information corresponds closely to the number of 0s and 1s. (Indeed, this correspondence is so close that the difference is commonly elided; 0s and 1s are bits, in common parlance.)
Given a binary code, I can infer a probability distribution: by imagining the 0s and 1s are heads and tails from coin-flips, I get a probability distribution such that . Given a probability distribution, I can infer a binary code: I can apply arithmetic encoding so that is very close to .[6]
This means information is a very good proxy for minimal compression, while being much nicer mathematically. Code-lengths can only ever be whole numbers, but information is a continuous function of probability, and is typically fractional (even when probabilities are rational numbers). The length of the optimal code for an event is a complicated combinatorial problem, which depends on the whole probability distribution; the information of an event is simply a function of that one event’s probability. Code-length is a very direct physical quantity, whereas measuring information requires subjective probabilities.
I emphasize this because condensation is going to be a similar sort of abstraction. Optimal codes are a complicated optimization problem; information is a simpler thing which acts like an optimal-or-slightly-better-than-possible code. Similarly, condensation is going to discuss schemes for organizing data, but it discusses them in terms of a bound that’s not always achievable.
Universal Codes
One more idea will be useful: universal codes (the subject of algorithmic information theory). The basic idea: any useful compression needs to be computable – meaning, I can write a computer program which decompresses the code to return the original events.[7] However, computer programs are themselves represented via a binary coding scheme (a programming language). Instead of coding a decoder ( bits) and then coding the compressed data ( bits) and then running the decoder on my compressed data to get the result, I could instead just directly write a computer program which, if run, produces the decompressed data.
This approach will take bits to compress the data. Hence, any programming language[8] is “almost as good as” whatever other code you might come up with: it is at most bits worse, where is the description length (in the chosen programming language) of a decompression algorithm for your better code. This constant is ubiquitous in algorithmic information theory. Since it is a constant, it becomes insignificant (in compression-ratio terms) as the data grows. A special-purpose code will always be better if you have a strong probabilistic model of the data you want to compress, but a programming language is a good general-purpose code if you don’t have a specific probability distribution to model your data.
In some circumstances, it is conceptually nicer to represent the complexity of the coding scheme; that is, can be the more meaningful measure than alone. For example, the Hutter Prize is a competition for compressing Wikipedia. Since the objective is to compress a single file which is known ahead of time and does not change, from the probabilistic perspective, the optimal compression scheme should compress it down to 0 bits (an empty file). However, if we looked at the code for such a compression scheme, we would see that it “cheats” by memorizing the data. (It could be something like “print: “ followed by the Wikipedia text.) Algorithmic information theory penalizes this sort of cheating by counting the length of the decompression program plus the length of the compressed file, whereas probabilistic information theory doesn’t.
For our purposes here, the important thing I want to emphasize is how to translate back and forth between the probabilistic view and the algorithmic view.
- The probabilistic view deals in possibilities, whereas the algorithmic view deals in particulars.
- In Shannon’s information theory, we need to first be uncertain about the contents of a file (have some probability distribution over its contents) in order to then talk about the “information” of a specific string.
- In algorithmic information theory, we assume a universal distribution (ie, the probability distribution implied by a Turing-universal code), so we can talk about the “description length” of a file outright, without explicitly specifying a probability distribution.
- The probabilistic view deals in random variables, whereas the algorithmic view deals in binary strings (sequences of 1s and 0s).
- In the Shannon view, events to be encoded can have some arbitrary type (EG, colors). We then encode those things (usually in binary strings).
- In the algorithmic view, everything is already in binary; we encode things to compress them further, but we’re just transforming one sequence of 1s and 0s to another.
Condensation
Shannon & Weaver’s seminal book introducing information theory was titled The Mathematical Theory of Communication. However, they do call out some ways in which it falls short of a full theory of communication. In particular, they note that it is not a theory of semantics or meaning (concepts which are, informally, strongly associated with the terms “information” and “communication”).
I want to call out a slightly different way in which it falls short.
In my story about Alice keeping a bird-watching journal, Alice’s problem was to record everything faithfully, so that she could reconstruct the entire series of events later. Realistically, however, Alice might want to consult her journal to reconstruct specific events (to answer specific questions). Shannon-style optimal codes do not optimize for this.
When a file is compressed, it typically has to be decompressed before anything else can be done to it. Arithmetic Coding, in particular, turns things into an incomprehensible mess of 1s and 0s. Shannon’s notion of optimal codes prioritizes compression over everything else.[9]
Similarly, I think many proponents of Algorithmic Information Theory imagine the shortest program that outputs a target string of data to be something like a beautiful scientific theory of that data, but this need not be the case. It can instead be an incomprehensible mess. The only thing the math talks about is compression.
Sam’s condensation is, roughly, an attempt to put the right constraints on the encoding to force it to be a beautiful scientific theory instead of an unintelligible compressed mess.
Universal Data-Structure?
Shannon’s theory of communication prioritizes compression alone. I claim that natural languages (as well as most artificial languages, EG, programming languages[10]) prioritize other things as well. I said some informal things about comprehensibility. How can we formalize these ideas?
Some time ago, I had the idea that there should be a theory of universal data-structures. Algorithmic information theory has given rise to something like a theory of universal algorithms,[11] but (as taught in a typical undergrad degree) that’s only half of the picture of computer science; the other half is data-structures.
Sam’s condensation is the first thing I’ve seen that feels like progress towards this goal.
Condensation viewed as a universal data-structure:
- You’re given a blob of possibly-disorganized data.
- You have a prior over (sets of) queries (future computations that depend on this data). These can be any computable function of the data.
- Storage cost of the data is negligible, but retrieval cost matters: given a set of queries, you want to retrieve the minimal amount of data needed to compute the answers.
- Retrieval works by organizing information into “bins” which get tagged with questions the data in the bin is relevant for. When a set of queries is received, all bins with any overlapping tags get retrieved.
- How should we organize the information to minimize retrieval costs?
Sam considers a best-case scenario, where the information can be organized very well (I’ll describe this in more detail later). Under this best-case assumption, the prior does not actually matter: well-organized information can focus on answering each set of questions efficiently, rather than prioritizing one over another.
The main point here is that we are optimizing for answering a variety of questions, rather than just optimizing for efficient reproduction of the whole blob of data, as in Shannon’s picture.
More generally, I find the idea of optimizing representations for more than just compression highly appealing. It is extremely cognitively plausible that we need to optimize our representations for how they are used. Sam’s picture is one step in this direction, and shockingly, gives us an elegant theory rather than a big mess.
I find the “universal data-structure” picture highly technically motivating,[12] but for the purpose of explaining things more intuitively, I prefer to describe things in terms of organizing notebooks.
Well-Organized Notebooks
Suppose you’re in charge of managing an information-desk. Guests come to the desk with (one or more) questions. The full list of questions you’re obliged to answer is known; anything outside of this list, it is ok if the response is “I don’t know”. However, you can’t get all your clerks to memorize all the answers. Instead, you rely on a set of reference notebooks.
Each reference notebook is “tagged” with one or more questions. Guests ask all their questions at once, before the clerk answers. Once a guest has asked all their questions, the clerk is supposed to retrieve all the notebooks which are tagged with any of the questions.
The clerks can follow instructions in the notebooks perfectly (like a computer), and we don’t care about how much cognitive labor the clerk has to do.
The “score” – the only thing we care about – is retrieval cost. This is just the total amount of text in all the notebooks retrieved to answer the questions. Or you can think of it as the total weight of the pile of notebooks the clerk has to retrieve.
We’ve got a probability distribution over what a guest might ask when they approach the information desk, so we can compute an expected score for a set of notebooks based on how well they organize information. (However, as I think I mentioned previously, Sam’s theory doesn’t end up being very sensitive to this probability distribution.)
Why would we ever tag a notebook with multiple questions? The idea is that such a notebook holds the common information between those two questions. Perhaps you have a notebook about fluid mechanics and a notebook about population dynamics; they both use differential equations, so your notebook on differential equations has a superset of the tags of both, so that a clerk will retrieve the differential equation notebook whenever they retrieve either fluid mechanics or population dynamics. You could have duplicated information about differential equations in both the other notebooks instead, but the duplicated information would worsen your score in cases where you’re asked questions about both things. (This is why it is so crucial to Sam’s theory that clerks might handle multiple questions at once, rather than dealing with one question at a time.)
So, the game we’re playing is similar to compressing the answers individually, except we’re incentivized to share information between answers, creating a “library” of useful abstractions.
One of the notebooks can have “all the tags” (IE, clerks always retrieve that notebook). This is similar to the from algorithmic information theory; it tells the clerks what they need to know to interpret any other notebook in the library. (This includes information about how to use notebooks together, not just interpreting them individually.)
Random Variables
At the beginning, I said that condensation was a theory of latent variables, but I haven’t mentioned them so far. That’s because I’ve secretly been describing the algorithmic version of condensation, rather than the probabilistic version. The probabilistic version is what Sam actually describes in his paper. Here’s a translation key:
| Algorithmic | Probabilistic |
|---|---|
| answers we might need to compute / “given strings” | given variables |
| uncondensed data / data from which the answers are to be computed | underlying probability space in which the givens live |
| notebooks / “latent strings” | latent variables |
| tags | contribution function |
| notebook with all the tags / program which says how to interpret the rest of the latent strings in service of computing answers | “top” latent (latent which contributes to all givens) |
| notebooks with one tag | “bottom” latents (latents which contribute to just one given variable) |
| score (length of all strings retrieved) | conditioned-perfect score |
Givens
Since this is a theory of how to postulate latent variables, it might be tempting to interpret Sam’s “given v