Formal or not formal? That is the question in AI for theorem proving.

So it’s an interesting time for computers-doing-mathematics. A couple of interesting things happened in the last few days, which have inspired me to write about the question more broadly.

First there is the question on whether computers will ever prove interesting research-level mathematical theorems autonomously at all, which is really hard to answer right now, although clearly a lot of people are betting on “yes”. I personally am unsure: I’ve seen what can currently be done, and can’t quite extrapolate that far once you cut through the hype. I am however far more optimistic about AI tools becoming useful helpers to mathematics researchers, and this is one of the things which motivates my current work (trying to teach Lean a lot of [research-level definitions, theorem statements](…

So it’s an interesting time for computers-doing-mathematics. A couple of interesting things happened in the last few days, which have inspired me to write about the question more broadly.

First there is the question on whether computers will ever prove interesting research-level mathematical theorems autonomously at all, which is really hard to answer right now, although clearly a lot of people are betting on “yes”. I personally am unsure: I’ve seen what can currently be done, and can’t quite extrapolate that far once you cut through the hype. I am however far more optimistic about AI tools becoming useful helpers to mathematics researchers, and this is one of the things which motivates my current work (trying to teach Lean a lot of research-level definitions, theorem statements and proofs).

But whether they beat us or just help us, one can still ask what an autonomous theorem-proving tool will look like. Let’s initially take a gigantic step down from research level mathematics and talk instead about problems which can be solved by smart schoolchildren, namely those on the International Mathematics Olympiad (IMO). I wrote recently about the attempts by AI on the 2025 IMO but all you need to know here is that there were four attempts which were made public: two of them were using language models to write proofs in English (which were then graded by humans); let’s call this the “informal approach”. The other two were using a language model/Lean hybrid set-up; let’s call this the “formal approach”. Lean is a computer proof checker or interactive theorem prover; it checks mathematical arguments at a superhuman level. In this hybrid setup, the language model writes proofs in Lean’s language which are then checked automatically by Lean; once Lean says the proofs are correct, humans only need to check that the statements have been translated correctly into Lean, although that is an important part of the process.

Worries about the non-formal approach.

What worries me mostly about the non-formal approach is the horror stories I am hearing from journal editors. Mathematics journals are being spammed with LLM-generated papers, which are becoming more and more plausible-looking, and sometimes the use of LLMs is not even revealed in the paper. I want to flag some differences between mathematics papers generated by LLMs and those generated by humans:

1. LLMs are sycophantic. You want to hear that the LLM-generated argument correctly proves the theorem? Sure you do, and that’s why the LLM will assure you that this is the case, even if the argument contains a hallucination (and remember, just one hallucination is enough to break a mathematical argument). In my experience, human mathematics researchers doing mathematics in the traditional way are vanishingly unlikely to send a paper to a journal claiming that they have proved something which they actually think they have not proved. Being stuck is a standard place for a mathematics researcher to be in. An LLM is far more likely to write plausible-looking rubbish to cover their tracks; in some sense then, they are trained to mislead. One thus has to read an LLM-generated paper with far more suspicion than a human-generated paper.

2. LLMs won’t say “I don’t know”. We expect them to perform very well on in-distribution data (i.e., what they have been trained on, which certainly includes all the techniques needed to solve an IMO problem). But what happens on out-of-distribution data (e.g. theorems that no humans know how to prove)? Well, the correct answer to many intermediate questions which show up when doing research is “I don’t know — indeed, nobody knows”. Whereas the LLM’s answer will be “I will confidently have a go”. Remember: one hallucination breaks an argument.

In short: I do not yet trust mathematics written by an LLM. Several mathematicians I have talked to have backed me up on this; they have tried typing hard research problems into language models and get nonsense output which they can instantly see through. Will LLMs get 100% reliable when out-of-distribution (i.e., when trying to do original research)? Nobody knows yet, and anyone who thinks they know is just guessing, or is stoking the hype. The disgraceful story from earlier this week about OpenAI and the ten Erdos problems is a case in point. People seem to be quick to believe that it’s happening; nothing close to it is happening right now.

One thing which is becoming more popular with the informal approach is testing systems on questions where the answer is not a long proof (indeed most breakthroughs in mathematics will consist of 30-page proofs or even much longer) and instead testing them on hard problems for which the answer is a number. An example of this is Epoch AI’s Frontier Math Tier 4 dataset. I was amused this week to have been sent data on what happens if you ask lots of agents to try and solve these problems and you mark the question as being solved if at least one agent gets the answer correct at least once. I would like to submit my own AI-based approach to the Tier 4 dataset, if that’s the mark scheme now: it’s written in python.


i = 0
while True:
print(i)
i = i + 1

My point of course is that a system which gives out lots of answers, one of which is correct but you don’t know which one, is rather less useful than it sounds.

Worries about the informal approach.

The amazing Erdos 707 story from earlier today (tl;dr: two researchers found a solution to an Erdos problem and vibe-coded the entire proof in Lean using ChatGPT 5 despite not having a clue about how to write Lean code; here is the paper by Alexeev and Mixon, with both Lean and ChatGPT listed as co-authors) might make people start to think that Lean is now becoming viable for research-level mathematics. And certainly this story shows that it can be used for some research level mathematics. But all of it? Not so fast.

My worries about the formal approach are that currently even the best formal mathematics libraries do not contain the definitions needed to understand most of what is happening in modern research mathematics. Because of my Renaissance Philanthropy AI for Math grant, I have been ploughing through recent issues of the Annals and attempting to see which definitions need to be formalized before we can even formally state the main results in these papers. Examples of what I saw: Tate-Shaferevich groups, Calabi-Yau varieties, algebraic stacks, automorphic representations, compactly-supported etale cohomology, Jacobians, Maass forms. These are everyday objects used by graduate students in mathematics departments across the world; no theorem prover knows what these things are. If Lean and the other provers cannot even understand the statements of most modern theorems at research level, how can they possibly help mathematicians to prove them?

Whose job is it to fix this?

As is probably clear, I am currently skeptical about whether a next-token-prediction system is ever going to get to the point where it can be 100% accurate when generating 50 or more pages full of out-of-training data (which is probably the kind of thing which is needed for a typical new breakthrough in research mathematics). A LLM/Lean hybrid is something which seems to me to be far more plausible (and we’ve just seen it work to great effect with the Erdos 707 story above, even though here it was the humans who had had the ideas, not machines). Teaching Lean many modern definitions seems to me to be the bottleneck; it is by no means an impossible problem to solve, but it will take time, and solving it seems to me to be a fundamental problem. But who is in a position to solve this problem? Let’s look at some of the likely candidates.

1. The mathematics researcher. These are the people who have taught Lean pretty much all of the nontrivial mathematical definitions in its mathematics library so far. For example I taught mathlib what elliptic curves were, and Chris Birkbeck taught them modular forms, so we can state the Shimura-Taniyama-Weil conjecture relating elliptic curves to modular forms; this was the conjecture partially resolved by Wiles and Taylor–Wiles and which implied Fermat’s Last Theorem. The big problem with scaling this up is that the mathematics researcher lives in a “publish or perish” environment, and work of this nature does not fit into the traditional publishing model; the current set-up rewards new proofs, not formalisations of old material, even if it has never been formalised before.

2. The LLM. This is an approach being pushed by Christian Szegedy and his colleagues at math.inc. My worry about this approach is that right now LLMs write really unidiomatic and poor code, and poor definitions might ultimately result in technical debt when things start scaling. Another worry about LLM-generated definitions is that, unlike proofs, Lean cannot check that the Lean definition aligns with the human concept that it is supposed to be capturing; if an LLM forgets or mangles an axiom which is part of a definition then the code is likely to compile, it just won’t mean what it’s supposed to mean.

3. The Mathlib Initative. This is a new institution whose aim is to support Lean’s mathematics library. They have a lot on their plate already as you can see from their roadmap, but the problem I raise above of teaching Lean how to understand modern definitions is not explicitly mentioned. Perhaps a simple solution would be to expand in this direction, but of course this will mean a financial commitment from somewhere.

Until we find a better solution, it will be my group at Imperial College London who will work on this problem, manually. We gratefully acknowledge support by Renaissance Philanthropy and the EPSRC to work on what I believe is a fundamental question in AI for mathematics.

About xenaproject

The Xena Project aims to get mathematics undergraduates (at Imperial College and beyond) trained in the art of formalising mathematics on a computer. Why? Because I have this feeling that digitising mathematics will be really important one day.

This entry was posted in General, Machine Learning and tagged AI, Artificial Intelligence, imperial college, lean, llm, mathematics, mathlib, technology, theorem provers, theorem proving. Bookmark the permalink.

Worries about the non-formal approach.

Worries about the informal approach.

Whose job is it to fix this?

About xenaproject

Similar Posts