ReasonScape Evaluation: AI21 Jamba Reasoning vs Qwen3 4B vs Qwen3 4B 2507
reddit.com·19h·
Discuss: r/LocalLLaMA

It’s an open secret that LLM benchmarks are bullshit. I built ReasonScape to be different, lets see what it tells us about how AI21’s latest drop compared to the high quality 4B we know and love.

My usual disclaimer is that these are all information processing tasks so I make no claims of performance on summarization, creative writing or similar tasks. This evaluation is a counting letters, tracking objects, doing math, following instructions kinda thing.

The second disclaimer is that I am sharing data from my development branch that’s not yet been published to the leaderboard or explorer apps - working on it, aiming…

Similar Posts

Loading similar posts...