LLMs: Predictive Text, Now with Actual Predictions.
October 14, 2025 7:27 PM Subscribe
How Well Can LLMs Predict the Future? “Here are our key findings: Superforecasters still outperform leading LLMs, but the gap is modest. The best-performing model in our sample is GPT-4.5, which achieves a Brier score of 0.101 versus superforecasters’ 0.081 (lower is better).2 State-of-the-art LLMs show steady improvement, with projected LLM-superforecaster parity in late 2026 (95% CI: December 2025 – January 2028). Across all questions in our sample, LLM perf…
LLMs: Predictive Text, Now with Actual Predictions.
October 14, 2025 7:27 PM Subscribe
How Well Can LLMs Predict the Future? “Here are our key findings: Superforecasters still outperform leading LLMs, but the gap is modest. The best-performing model in our sample is GPT-4.5, which achieves a Brier score of 0.101 versus superforecasters’ 0.081 (lower is better).2 State-of-the-art LLMs show steady improvement, with projected LLM-superforecaster parity in late 2026 (95% CI: December 2025 – January 2028). Across all questions in our sample, LLM performance improves by around 0.016 Brier points per year. Linear extrapolation suggests LLMs could match expert human performance on ForecastBench in around a year if current trends continue.”
“Many factors could complicate this timeline. Linear extrapolation may break down as systems approach frontier performance—the last mile might prove hardest. Superforecasters may improve their accuracy, including by use of LLMs. Finally, our benchmark captures only one slice of forecasting ability: binary predictions on specific question types. Superforecasters may still maintain their edge on other, more complex forecasting questions.”
For now: “LLMs outperform non-expert public participants. A year ago, the median public forecast ranked #2 on our leaderboard, right behind superforecasters and ahead of all LLMs. Today it sits at #22. This achievement represents a significant milestone in AI forecasting capability.”
Limitation: “ForecastBench currently only includes binary “yes/no” questions. This excludes point predictions for continuous variables (“What will the GDP growth rate be?”), multiple-choice outcomes (“Which party will win the election in Germany?”), quantile predictions (“What is the 95th percentile for the 7-day change in Apple’s stock price?”), and full probability distribution (“Provide a probability density function for next quarter’s inflation rate”) elicitation. As a result, ForecastBench is limited in the scope of forecasting capabilities that it can evaluate.”