Updated LLM Benchmark (Gemini 3 Flash) (opens in new tab)

Covered by LessWrongDiscussed on Hacker News and Lobsters

I evaluate llms by how well they play text adventures. The last update I made was when Haiku 4.5 was released. Now that Google has released a preview of Gemini 3 Flash, I had to run the benchmark again. Having seen how well Gemini 2.5 Flash performed in earlier benchmarks, I would have expected Gemini 3 Flash to blow the other models out of the water, which it did … almost.

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 1 article

LessWrong·

Covered in 1 article

GLM 5.2 playing text adventures