The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer (opens in new tab)

Discussed on Hacker News

Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the same question in open generation. We ask what pruning changes: does it erase the correct answer, or does it make the answer harder to produce as the top output? We study this question with multilingual question answering, tracking the same questions before and after p...

Read the original article