benchmarking LLMs, evals, MMLU, model assessment, capability testing
Press ? anytime to show this help