The Theory of Universal Computation: Bayesian Optimality, Solomonoff Induction & AIXI

In a seminal but underappreciated book titled Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability, Marcus Hutter attempted a mathematical formulation of universal artificial intelligence, shortened to AIXI. This article aims to make AIXI accessible to data scientists, technical enthusiasts and general audiences both conceptually and formally.

We begin with a brief overview of the axioms of probability theory. Subsequently we delve into conditional probability, whose calculation is governed by Bayes’s theorem. While Bayes’s theorem provides the framework for updating beliefs, it leaves open the question of how to assign priors. To address this, we turn to algorithmic information theory, which connects Kolmogorov complexity, defined as the length of the shortest program that outputs a string, with the assignment of Bayesian priors. The bridge between these two ideas is the Solomonoff prior, also known as the universal prior. The universal prior provides us with the necessary scaffolding to explore the AIXI formalism, which integrates sequential decision theory, the Solomonoff prior and Occam’s Razor. In the final section, we briefly discuss the limitations of AIXI and alternative approaches to formalizing an universal agent, while acknowledging that the term “universal agent” carries significant philosophical ambiguity. In particular, we discuss Active Inference as a philosophical alternative to AIXI, wherein the former models an embodied predictive agent, whereas the latter, models a disembodied utility maximization algorithm.

Probability Axioms

The Kolmogorov axioms define probabilistic space as a triple (Ω , 𝒜, 𝓟), where Ω defines the total sample space, 𝒜 the collection of subsets of events of interest, and 𝓟 the function that assigns a probability to each event normalized to the unit interval.

If 𝑨 ϵ 𝒜, then 𝓟(𝑨) ≥ 0
If 𝑨, Bϵ 𝒜 and A ∩ B = ∅, then P (A**⋃**B) = P(A) + P(B)
P(Ω) = 1 The first axiom, non-negativity, ensures that probabilities are meaningful as real-valued measures of belief or frequency. The second, additivity, formalizes the principle that the probability of disjoint outcomes is the sum of their individual probabilities. And the third, normalization, ensures that the total probability assigned to the entire sample space equals one.

However, while the probability axioms specify the structural rules of probability, they do not prescribe how probabilities should be updated in light of new evidence. In this sense, the Kolmogorov framework is analytic and a priori: it defines what a probability measure must satisfy, but not how such a measure is revised through evidence.****To move from static probability assignments to dynamic inference, we need a way to relate new data to existing hypotheses, namely conditional probability. This epistemic gap is addressed in frequentist statistics by interpreting conditional probabilities through long-run frequencies of repeated events, typically under the assumption that such events are independently and identically distributed (i.i.d.), whereas Bayes’s theorem provides a normative rule for updating beliefs about hypotheses in light of new evidence, useful especially when observations are made incrementally or sample spaces are not well-defined.

Bayesian Inference

First formalized by a Scottish monk, Bayes’ theorem is an algebraic derivation of conditional probability. Once we understand how conditional probability is calculated, Bayes’ Theorem can be derived through a few algebraic operations. Let’s recall how we compute conditional probability:

Conditional probability formula This states that the probability of the hypothesis H given the evidence D is computed as the joint probability of the hypothesis and the evidence divided by the probability of the evidence.

Why would we compute conditional probability this way? Let’s walk through it by way of an example. The probability that it rained — the hypothesis— given that the ground is wet — the data — assumes that these two events are dependent. If they were independent, then the probability of their joint occurrence would be computed by their product P(H) · P(D). This is because the probability of P(H|D) = P(H) and the probability of P(D|H)=P(D) assume that the event of the ground being wet is independent of it raining. Notice what we’ve just asserted: P(H∩D) = P(H) **·**P(D). This means that the intersection of independent events is computed by the product of their individual probabilities.

But how is the intersection P(H∩D) of dependent events computed? Set-theoretically, the joint probability is defined as the intersection of the two sets:

In order to understand the proportion of the distribution of the sample spaces, we can visualize the conditional probability we seek to calculate as follows:

But in practice we almost never have prior knowledge of the joint probability distribution. This is where we can use a simple algebraic step to help us approximate the joint probability. We multiply both sides of the equation by the denominator to solve for the joint probability P(H∩D):

Now conversely, if we wanted to compute the probability that the ground is wet given that it rained, our conditional probability formula would be the following:

The same transformation give us:

Notice that the joint probability of the two events in question is the term both conditional probabilities share. Since P(H∩D) is a symmetrical relation and consequently an algebraic constant between the equations, we get the following crucial equality:

Therefore, we if want to test the hypothesis that “it rained” given that the “ground is wet”, we can rearrange this equality to obtain Bayes’ formula:

In Bayesian terminology, we refer to P(H|D) as the posterior(namely, the probability we wish to ascertain), P(H) as the prior, P(D|H) as the likelihood, and P(D) as the marginal.

This conventional nomenclature is important when dealing with Bayesian statistics. The likelihood supplies the conditional probabilities for the data points under the hypothesis (providing the values used to update our beliefs), whereas the marginal normalizes the posterior to the sample space of the condition or data.

Since we need approximations of values for all the terms in Bayes’ formula, a major hurdle in Bayesian statistics is how to best assign these values. In particular, specifying the prior can be challenging since we don’t always have the requisite knowledge in advance. Some strategies in approximating the prior include using a uniformly distributed prior, where we assign the same probability to all possible outcomes, known as the Laplacian principle of indifference, and other strategies include using an*** informative prior***, namely a prior that aims to approximate the actual probability distribution of the event. In our example, this could be the Poisson distribution of daily rain.

As we shift from viewing hypotheses as fixed values to treating them as random variables, the goal becomes to infer the full posterior distribution rather than a single estimate. Accordingly, we can now move from the treating hypotheses as point estimates toward statistical inference over random variables with corresponding probability distributions. To do this, we model the prior P(H) and the likelihood P(D|H) as probability distributions and compute the marginal probability of the data P(D), which is obtained either as a sum (the probability mass function for discrete variables), or an integral (the probability density for continuous variables). These components allow us to apply Bayes’ theorem to obtain the posterior distribution: P(H|D).

The law of total probability (the marginal) expressed as a sum over the hypothesis space for discrete variables. The probability mass function (PMF) is computed as the sum of all values of the distribution, equaling to one, whereas the probability density function (PDF) is the integral of the area under the curve of the distribution approaching one as the limit. For continuous variables we integrate because we’re dealing with infinite values in the distribution. Below is the formula for the marginal for continuous variables, namely the law of total probability expressed as the probability density function:

The law of total probability (the marginal) expressed as an integral over the hypothesis space for continuous variables. Bayesian statistics forms an alternative framework to the more established frequentist approach in statistics. Even though its historical roots are as deep as the frequentist formulation, computational intractability limited its adoption for much of the twentieth century. With advances in computational power, Bayesian methods have undergone rapid development and increased application. Today, Bayesian statistics plays a central role in machine learning (ML), particularly probabilistic modeling, uncertainty estimation and decision-making under uncertainty.

Kolmogorov Complexity

We saw that our Kolmogorov axioms supplied an analytic framework for computing probabilities which included defining the union of disjoint sets as sums and their intersection as products. But it didn’t tell us how to compute joint sets. For this we invoked Bayes’ theorem, which utilizes set intersection to derive a universal formula of conditional probability.

However, in our explication of Bayesian probability we identified the assignment of priors as a problem for the framework: what information should we encode in the prior? Should we make it indifferent, as per the principle of indifference, or make it informative?

This is where the notion of Kolmogorov Complexity comes in. While Kolmogorov complexity does not entail probability, through the Coding Theorem(which we will explicate below), it encodes an a posteriori meta-theoretic assumption as a mathematical selection bias. This meta-theoretic generalization states that simplicity encodes higher probability. If we are faced with the datum or outcome that the ground is wet, what hypothesis from the available store of all possible hypotheses should we select? Intuitively we want the hypothesis that assigns the highest probability to the observed outcome. But without additional information, how are we to know which hypothesis maximizes the probability of the outcome? Kolmogorov answered this within the bounds of algorithmic information theory: the simplest hypothesis is the hypothesis that encodes the least information or the sequence with shortest length.

In order to understand the motivation behind this, let’s first state the problem within algorithmic information theory, then circle back to its application within less abstract, real-world scenarios.

In algorithmic information theory, we encode symbolic sequences or strings according to some symbolic base such as base-2, binary strings. We define a Universal Turing Machine, U as a (partial — since p cannot be defined for all outputs) function from a program p to an output x, i.e.**U(p) = x. Think of this as loosely cognate to: f(x) = y. The program p represents the hypothesis or theory, whereas the output x represents the data or the evidence. This mapping is important to understand the intuitive thrust of the theory.

The Kolmogorov Complexity of an information object defines as the shortest length of an algorithmic sequence that outputs that *object,*where K(x) defines the length of the program in bits:

This expression tells us that out of all the programs p that produce x as output, we select the shortest one, namely the minimal {|p|}. Kolmogorov complexity is defined over finite binary strings x ϵ {0,1}

What we now need to do is connect Kolmogorov complexity to probability theory so that it can inform our Bayesian priors. To do this, we notice a connection, at least superficial at first, between between Kolmogorov complexity and Shannon information entropy. Both seem to quantify some measure of information content: K(x) defines length in bits whereas information entropy H defines the average amount of information required to encode the distribution of a random variable, where information is defined as uncertainty and quantified as the expected value of -log P(x) over all possible outcomes in bits. The greater the uncertainty, the larger the amount of information required to encode the event. Both K(x) and H(X) are measured in bits, so what’s the difference?

K(x) describes the length of the shortest program in bits that outputs the string x, whereas H(X) computes the number of bits needed on average to encode the program drawn from the probability distribution of possible values of x, over the sample space of X. It seems like some deep connection must hold between these two measures. What then is the relationship between Kolmogorov complexity and Shannon entropy? We need a bridge from raw length values to their probabilities.

If we isolate a single outcome from the Shannon distribution, we can define it as the self-information of *x,*the output of our program:

Self-information of a single probability outcome This says that the self-information (think of this as the entropy measure for a single outcome) is proportional to the log-inverse probability of x occurring. I(x) is an instance of the full distribution that defines Shannon entropy for a particular event:

Shannon entropy defines the expected self-information for the entire distribution as average entropy. Now, the Coding Theorem states that Kolmogorov Complexity is approximately equal to the Shannon information contained in a string.

This states that the shortest program that outputs x is approximately as long as the log-inverse of the total probability that outputs x. In other words, our Shannon distribution contains the shortest program as the one assigned with the highest probability! We have now connected raw program length with probability theory: the more compressible a program output is, the more likely it is to occur.

This is how we connect algorithmic compressibility, namely program length defined for an instance, to probability and information theory, enabling us to treat compressibility as a Bayesian likelihood. On a sidenote, the reason we don’t have a precise equality in the equation above is that the postulated relationship is not exact up to an additive constant c, which depends on the choice of Universal Turing Machine (UTM), making K(x) machine-dependent prior to an upper bound c across UTMs:

Now you’re probably wondering, what kind of distribution enables us to assign probabilities to all possible program lengths? That distribution is the Solomonoff Universal Prior.

Solomonoff Induction

As we discussed, the choice of prior affects the posterior especially when sample sizes are sufficiently small. This begs the question: *what if we had a prior function that could be applied to all possible events in the sample space?*This is what Solomonoff’s prior encodes. Precisely, the Solomonoff prior encodes the probability of observing an output sequence x as the total probability that a random program outputs x on a universal Turing machine.

Now, let’s take a look at Solomonoff’s universal prior formula, which should click into place our earlier assertion that algorithmic probability is intimately connected with simplicity. Solomonoff defined the universal prior P(x) as the sum of the probabilities of all finite binary prefix-free programs p, where the probability that p is defined by its simplicity, 2^-|p|.

The Universal Solomonoff Prior Because we define the probability of a program as shrinking by half for every additional bit it contains, the more information bits in the program, the smaller its weight in the distribution. Therefore, the total probability over all prefix-free binary programs will be dominated by shortest programs.

We stated that the Solomonoff prior is defined for prefix-free finite binarysequences of strings. Let’s ensure we understand each of these qualifiers. We use binary sequences because a Universal Turing machine can be defined in terms of binary inputs and outputs, where we can represent any information object with binary encoding. We define the prior over finite sequences in order to meet the conditions of computability: infinite sequences are Turing incomputable.

A set of strings is prefix free if no string in the set is a prefix of another:

This yields sets of disjoint finite binary string sequences. In other words, we need disjoint sets to compute their union. Per the Kolmogorov axiom of additivity, the probability of the union of the members of the set can be expressed as the sum of their probabilities.

Disjointness ensures that the probability associated with each hypothesis or prior string obeys Kraft’s inequality, which states that the sum of all probabilities does not exceed the unit interval:

Kraft’s Inequality This tells us that for some some prefix-free string C, the probability of that sequence is expressed as 2 raised to the negative exponent, where the exponent describes the length. Because all the sequences are disjoint, their sum cannot exceed 1 (though it can be less than one, making it a semi-measure). This enables us to treat code weights as probabilities and consequently to compute the probability mass function of the entire distribution by summing over string weights.

Accordingly, Solomonoff’s prior is defined as the sum of the weights or probabilities of finite binary programs p:

The Universal Solomonoff Prior Therefore, in order to compute the probability of getting some output x from a possible program, we condition that probability to the sum of probabilities of all possible programs:

The Solomonoff marginal Because p is deterministic, the likelihoods and the posterior are defined as delta functions: either a program outputs x or it doesn’t.

Further, because the prior is defined over prefix-free binary strings, we can express the conditionalization in terms of bitwise strings:

Instead of a joint distribution over events, we have a weighted sum over bitstrings that syntactically generate x as a stand-in for all possible events. This reveals some of the limitations of the formalism: ***does formal compressibility suffice as an explanatory bias for certain programs or theories over others?***We will delve into these limitations later such as the lack of structural bias and representational alignment.

Together with the Coding Theorem, the Solomonoff prior tenders a deep connection between induction and compressibility: ***generalization is revealed to be formally equivalent to information compression such that the more compressible a dataset is, the more probable the program that produces it.***In the real world, however, we know that the most “compressible” theories are not always the ones with the greatest explanatory or predictive power, though the preponderance of best theoretical approximations tend toward simplicity.

The formula below expresses our notions of algorithmic complexity, the universal prior, and information entropy as approximately equivalent to each other (under specific ranges of x):

AIXI

As it stands, our theory of universal induction which combines the Solomonoff prior and Bayesian posteriors is not defined for a constrained agent. What if we combine Solomonoff induction with** sequential decision theory**?

This is where Marcus Hutter’s AIXI theorycomes in: it integrates Solomonoff induction, decision theory, and reinforcement learning so that our universal prior can do work for an agent.

Transitioning from Solomonoff induction into the territory of decision theory and reinforcement learning requires expanding our ontology to actions, observations, and rewards. AIXI models a universal agent whose interaction with any computable environment enables it to choose the action that maximizes expected rewards. In more formal terms, AIXI selects an action at each time step and in return receives an observation and reward from the environment. How does AIXI select the optimal action? As we will see, because AIXI encodes an ideal Bayesian agent, it constitutes a model-based agent. However, unlike a typical Bellman-based deterministic agent (which solves the Bellman equations to determinate optimality, see my earlier article on reinforcement learning for that), AIXI maintains uncertainty over all possible environments. It does so by computing the sum of products of the likelihoods, namely the probabilities of environmental feedback given the space of actions, and the Solomonoff weights assigned to every computable environment (or program), called the universal mixture.

Put succinctly, the universal mixture constitutes a term within AIXI that defines the probabilistic prediction of the next observation and reward pair given the current action. It is computed as the sum of products of the weighted distribution of every possible environment. The universal mixture exhausts the environment space by summing over the product of each environment weight and its probability distribution where each environment model μ assigns probabilities to observation-reward sequences given hitherto actions. The universal mixture is given by the formula below:

The universal mixture assigns probabilities to future observation and reward pairs given action history. The universal mixture accumulates a predictive distribution of the environment through each action that it takes. Think of ξas assigning probabilities to every possible future trajectory of observations and rewards given a sequence of actions.

The universal mixture provides us with the probability of observation and reward pairs under the action history, but it doesn’t tell us which of these trajectories is the most values or reward maximizing. For this we sum the reward per environment or trajectory:

Sum of accumulated rewards where k is the time index and m is the farthest timestep. In order to find out which trajectory to choose, we multiply the sum of rewards per trajectory by the probability assigned to that trajectory by the universal mixture:

Calculation of expected reward under environment prediction As such, we compute expectation by weighting each trajectory by its cumulative rewards.

After we compute expectations, the final step involves selecting the action that maximizes expected return given the weighting of rewards and environment probabilities. For this we employ the arg max function as follows:

Arg max selects the action that maximizes cumulative returns across all possible trajectories:

AIXI aims to formalize a universal agent whose equipment with the Solomonoff prior biases it toward environments with minimal Kolmogorov complexity. Other than assuming all possible computable environments and associating compressibility with higher probability, the AIXI agent acquires structural bias from interaction with the environment. This ensures that AIXI should be able to navigate any environment (provided that it’s structured/computable to begin with).

While AIXI formalizes a universal agent, it does not sufficiently bias this agent so that it can efficiently navigate actual environments. That is to say, the procedures for optimal action or decision that AIXI formalizes do not encode efficient structural biases such as domain-specific heuristics or architectural constraints that would accelerate learning. However, this feature is a consequence of AIXI’s scope of setting universal bounds on decision optimality across all possible environments. In principle, AIXI acquires efficient structural biases indirectly through Bayesian updating. With sufficient environmental sampling, AIXI asymptotically converges to the true environment across infinite interactions, assuming the environment is computable and has non-zero prior. However, in practice convergence can be inefficient due to overspreading over posterior weights leaving the agent’s actions suboptimal for an indefinite amount of time.

Noncomputability & Structural Bias

In its universal form AIXI is not algorithmically implementable because Kolmogorov complexity and the Solomonoff prior are both not computable. The class of all programs that halt and produce valid (computable) environments is uncountable or not recursively enumerable. Computing the universal prior requires an infinite simulation of environments whereas computing future expectations requires infinite foresight, all of which are mathematically intractable.

For this reason, computable approximations of AIXI exist such as AIXItl. AIXItlintroduces time and program-length bounds (which is what the tl stands for) restricting the environment space Mtl to ≤ t time steps and l bits long**.**However, AIXItlis still inefficient since the combinatorial possibilities of environments is exponential: O(2^l). Model-free alternatives such as DQN and gradient-based alternatives such as Dreamerand World Models represent alternatives in the search for a universal agent. These later alternatives use heuristics and sampling-based methods for exploration such as Monte Carlo Tree Search for optimal decision making. Fundamentally, the contest lies between model-based and model-free methods, the latter which derive their biases entirely from interaction with the environment.

Representational Alignment

As we already stated, AIXI treats the universe as a computable sequence represented through finite binary strings. The assumption that the environment is Turing computable is not entailed in the Church-Turing Thesis and thereby represents an additional assumption of AIXI. The truth of this assumption, in principle, that is, not with respect to a realizable machine, is an open question even though there’s good reason to think it false.

As we saw AIXI treats observations as bitstrings, whereas real world data require structured representations such as causal structure, temporal relationships, and spatial dimensions, to name a few. In order to encode richer structures within AIXI we would need priors that encode structured representations such as graphs, tensors, differential equations etc. Encoding structural biases would make AIXI more efficient, but at the expense of its universality. The cost of encoding real-world representational structures within a Bayesian model is therefore specializing the model at the expense of environment-generalizability. Given that in practice an agent that realizes AIXI is not possible, we should therefore look to agents that encode real-world representations such as AIXItl, model-free agents or deep-learning-based agents.

Active Inference

We saw that AIXI incorporates a maximally informative prior, but this prior is entirely unstructured and embodies no prior knowledge about the world except the meta-selection bias for short or compressible programs as being the most likely. We also saw that this makes both AIXI and the Solomonoff prior computationally intractable, which precludes its implementation in its complete form.

Another strand of agent modeling, more recently branded active inference, whose centerpiece is the free-energy minimization principle, aims to integrate the modeling lineages of utility maximization, reinforcement learning, Bayesian inference, predictive coding, statistical mechanics, and far from thermodynamic equilibrium dynamics into the unified model of a hierarchical generative Bayesian agent. Optimality in the active inference generative Bayesian model consists of minimizing free energy, where free energy is defined as the expected surprise about the joint occurrence of sensations and their inferred causes.

Put in colloquial terms, the generative model predicts future perceptions given actions through a forward model that estimates the causes of those sensations through the prior, and it conversely estimates or predicts the actions required to bring about preferred states through an inverse model. The agent dynamically navigates the environment through loops of forward and inverse models or, more simply, loops perception and action. Free energy results from mismatch between predictions and environmental feedback, minimized through hierarchical Bayesian updating of the model’s priors.

The formula below formally expresses the computation of variational free energy as the divergence between the recognition density (the approximate posterior) and the conditional density (the true posterior), where ỹ stands for observed input, 𝜗 for latent causes, p( ỹ, 𝜗) defines the generative model as the joint probability density of perceptions and latent causes, whereas q(𝜗) defines the approximate posterior:

The first expression in the formula defines the Kullback-Leibler divergence between the approximate posterior q(𝜗) and the true posterior p(𝜗| ỹ) minus the log model evidence log p(ỹ). In general, the Kullback-Leibler divergence quantifies the geometric divergence between a model distribution Q and the actual distribution P. Free energy results from the information-theoretic divergence between the approximate and true posterior offset by the log model evidence. We compute variational free energy by integrating the log ratio between the approximate and true joint distributions over latent causes. The second term expresses this same quantity as the sum of the entropy of the approximate posterior q(𝜗) and the cross-entropy between the posterior and the generative model p( ỹ, 𝜗). Minimizing free energy amounts to reducing the divergence between the recognition model and the conditional density.

Both AIXI and Active Inference supply optimal Bayesian agents in different ways. But whereas AIXI is formally non-computable in its unbounded form, Active Inference enables tractable approximations via variational Bayesian models. Optimization in AIXI consists of maximizing rewards, whereas in Active Inference in minimizing free energy. In the former, model accuracy results implicitly from maximizing rewards, whereas in the latter maximizing rewards results implicitly from minimizing expected surprise or free energy. In this regard, Active Inference constitutes a structured generative model that hierarchically estimates latent causes, guiding action selection through inference rather than AIXI’s enumeration, which selects the reward maximizing action from all possible environments. Yet, Active Inference remains an incomplete framework since it glosses over many details about concrete agents, such as goal-setting, model learning (where it is extremely vague), and a viable description of agent boundary (the Markov blanket formulation is insufficient and unable to distinguish biological agents from physical systems that do not amount to actual agents).

References

Friston, K., Kilner, J., & Harrison, L. (2006). A free energy principle for the brain. Journal of Physiology, Paris, 100(1–3), 70–87. https://doi.org/10.1016/j.jphysparis.2006.10.001

Hutter, M. (2005). Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability (1st ed.). Springer Berlin Heidelberg. https://doi.org/10.1007/b138233 SpringerLink+13SpringerLink+13Google Books+13

Sommaruga, G. (Ed.). (2009). Formal Theories of Information: From Shannon to Semantic Information Theory and General Concepts of Information (Lecture Notes in Computer Science, Vol. 5363). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-00659-3

Zabell, S. (2009). Philosophy of inductive logic: The Bayesian perspective. In L. Haaparanta (Ed.), The development of modern logic (pp. 725–774). Oxford University Press. https://doi.org/10.1093/acprof:oso/9780195137316.003.0044