Published on November 2, 2025 2:45 AM GMT

I will be discussing weak-to-strong generalization with Sahil on Monday, November 3rd, 2025, 11am Pacific Daylight Time. You can join the discussion with this link.

Published on November 2, 2025 2:45 AM GMT

I will be discussing weak-to-strong generalization with Sahil on Monday, November 3rd, 2025, 11am Pacific Daylight Time. You can join the discussion with this link.

Weak-to-strong generalization is an approach to alignment (and capabilities) which seeks to address the scarcity of human feedback by using a weak model to teach a strong model. This is similar to Paul Christiano’s iterated distillation and amplification (IDA), but without the “amplification” step: the strong model is trained directly on labels generated by the weak model, not some “amplified” version of the weak model. I think of this as “reverse distillation”.^[1]

Why would this work at all? From a naive Bayesian perspective, it is tempting to imagine the “strong model” containing the “weak model” within its larger hypothesis space. Given enough data, the strong model should simply learn to imitate the weak model. This is not what’s desired – the strong model is supposed to improve upon the performance of the weak model.

Theoretical analysis shows that weak-to-strong generalization works “when the strong model is unable to fit the mistakes of the weak teacher without incurring additional error”. This is surprising from a naive Bayesian perspective: usually, Bayesian methods are at their strongest when there is a hypothesis which models the data well, and degrade when this assumption is violated.

Still, this mechanism should fail in the limit of a very strong student and a very weak teacher: at some point, the strong model will learn the errors of the weak model.

My aim here is to provide a Bayesian analysis that does not fall apart in the limit, and hence, a variant of weak-to-strong generalization that can serve as a mathematically robust target of training rather than only being a convenient empirical phenomenon. (This is not to be confused with “solution to alignment” or “safe” – I’m only aiming for a clear mathematical picture of what’s being optimized.)

Why does it work?

The phenomenon of weak-to-strong generalization is similar to a student learning correctly from a textbook filled with typos. We can imagine that the student only considers hypotheses which are grammatically correct, while the typos are usually ungrammatical. The student has no choice but to accept the “error” inherent in being unable to predict the typos, learning as if they’d read a version of the textbook with most of the typos corrected.

Why don’t strong learners imitate weak teachers?

To elaborate on the “naive Bayesian perspective” mentioned earlier: I’ll formalize the weak model as a probability distribution , the strong pre-trained model as another probability distribution $P_{s} ()$ . The event algebras ( $σ$ -algebras) of these two probability distributions share a sub-algebra over tokens (observations/data). I’ll write token-events $T_{i}$ with $i \in I$ to distinguish them from events in general. For events in general, I’ll write $E_{j}^{w}$ with $j \in J$ for events in the weak model, and $E_{k}^{s}$ with $k \in K$ for events in the strong model.

A naive way to formalize the idea that the weak model is weaker than the strong model is to assume that the strong model has strictly more events. That is: for every event $E_{j}^{w}$ in the weak model, there exists a corresponding event $E_{j}^{s}$ in the strong model, such that the conditional probabilities over tokens match:

\forall_{j \in J} \exists_{k \in K} \forall_{i \in I} : P_{w} (T_{i} | E_{j}^{s}) = P_{s} (T_{i} | E_{k}^{s})

For a given weak-model event $E_{j}^{w}$ , I’ll use the function $c o r r$ to get the corresponding strong-model event: $P_{w} (T_{i} | E_{j}^{w}) = P_{s} (T_{i} | c o r r (E_{j}^{w}))$ .

This isn’t enough to prove that the strong model will learn to exactly imitate the weak model, however. The weak pre-trained model will have learned some mixture over its hypotheses. There isn’t necessarily a single event $E_{k}^{s}$ such that $P_{s} (T_{i} | E_{k}^{s}) = P_{w} (T_{<span class=“mjx-char MJXc-TeX-math-I” style=“padding-top: 0.446em; padd}$

Why does it work?

Why don’t strong learners imitate weak teachers?

Similar Posts