Ok, I will say it out loud first to get it out of the way. LLMs develop, benchmarks suck and become useless, were standing in place when it comes to the USEFUL benchmarking. Benchmarks literally mean nothing to the user at this point, it’s not like typical benchmarks of different software or hardware anymore. Benchmarking LLMs stopped working somewhere around spring/summer 2024, in my opinion. It may be discussed, like anything, there are caveats, sure, but I come from this position, let’s make it clear.
However, when enough time passes, a generalized consensus within the community arrives and you can usually trust it. It’s something like - this scores high but sucks in actual coding, this is underestimated, this is unstable, this is stable but requires holding by hand through promp…
Ok, I will say it out loud first to get it out of the way. LLMs develop, benchmarks suck and become useless, were standing in place when it comes to the USEFUL benchmarking. Benchmarks literally mean nothing to the user at this point, it’s not like typical benchmarks of different software or hardware anymore. Benchmarking LLMs stopped working somewhere around spring/summer 2024, in my opinion. It may be discussed, like anything, there are caveats, sure, but I come from this position, let’s make it clear.
However, when enough time passes, a generalized consensus within the community arrives and you can usually trust it. It’s something like - this scores high but sucks in actual coding, this is underestimated, this is unstable, this is stable but requires holding by hand through prompting, this is less stable but does job on its own, this treats instructions too literally and follows everything at once all the time, this treats them too loosely and picks one to follow randomly etc.
Those are generalized opinions about models so not a skill issue. When I really follow them and - huhuhu - irony - use AI to filter and summarize them up - I rarely find them to be wrong after trying different models.
Now - there are some human-curated tests I am aware of, asking different LLMs to do the same things and comparing the results, some even try being representative with multiple runs etc. - but it’s all very use-case oriented so it’s hard comparing the models in general. Some dudes test coding in Python, others test captioning stuff, others test summarizing internet articles or videos, yet others test roleplaying with anime girlfriends or solving math tests from actual exams.
It’s all ok and actually, more useful than standard benchmarks these days - but a question arises:
Are we aware of some good quality, comparative repository with standardized, human-curated tests like that? Does anything standardized across the board exist and I am not aware of it? I know of the open router and hugging face user reviews/usage charts, which I use myself - but is there anything big, considered to be the current SOTA for human-curated tests? A database that tests just the actually useful models against each other in human-controlled tests of multiple use-cases, standardized across the board instead of one, very particular use case with particular methodology?
Thx in advance and cheers.