Evaluating LLM Output Quality In Production (opens in new tab)
In March 2023, GPT-4 could tell you whether a number was prime with 97.6% accuracy. By June of the same year, the same model name answered those same questions correctly 2.4% of the time. Nobody pushed a bad commit. No prompt changed in your repo. The thing behind the API just... moved. That number comes from a Stanford and Berkeley study, "How is ChatGPT's behavior changing over time?", and it's the cleanest illustration I know of the core problem with LLMs in production: the model is not a ...
Read the original article