I’m sure this is not my idea, so I’m not claiming it to be. I’ve been wanting to do a sort of continuous AI eval in production for a while, but the situation never presented a work. It was a mixture of having the data to do the eval off line, and wanting to avoid the risks of doing it in prod. But now I’m going to do it for a side project.
I don’t want to reveal what my side project is yet, so I’ll keep it vague. I’m very excited about this part, so I wanted to share it early. And I’m hoping that the Internet will tell me if, as it usually does, if this is a bad idea.
I have a task that will be done by an AI and I can measure how successful it was done but only 2 to 7 days after the task was completed and seeing it out there, in the world. I will gather some successful examples t…
I’m sure this is not my idea, so I’m not claiming it to be. I’ve been wanting to do a sort of continuous AI eval in production for a while, but the situation never presented a work. It was a mixture of having the data to do the eval off line, and wanting to avoid the risks of doing it in prod. But now I’m going to do it for a side project.
I don’t want to reveal what my side project is yet, so I’ll keep it vague. I’m very excited about this part, so I wanted to share it early. And I’m hoping that the Internet will tell me if, as it usually does, if this is a bad idea.
I have a task that will be done by an AI and I can measure how successful it was done but only 2 to 7 days after the task was completed and seeing it out there, in the world. I will gather some successful examples to use as part of the prompt, but I don’t have a good way to measure the AIs output other than my personal vibes which is not good enough.
My plan is to use OpenRouter and use most models in parallel, each doing a portion of the tasks (there are a lot of instances of these tasks). So if I go with 10 models, each model would be doing 10% of the tasks.
After a while I’m going to calculate the score of each model and then assign the proportion of tasks according to that score. So the better scoring models will take most of the tasks. I’m going to let the system operate like that for a period of time and recalculate scores.
After I see it become stable, I’m going to make it continuous, so that day by day (hour by hour?), the models are selected according to their performance.
Why not just select the winning model? This task I’m performing benefits from diversity, so if there are two or more models maxing it out, I want to distribute the tasks.
But also, I want to add, maybe even automatically, new models as they are released. I don’t want to have to come back to re-do an eval. The continuous eval should keep me on top of new releases. This will mean a fixed percentage for models with no wins.
What about prompts? I will also do the same with prompts. Having a diversity of prompts is also useful, but having high performing prompts is the priority. This will allow me to throw prompts on the arena and see them perform. My ideal would be all prompts in all models. I think here I will have to watch out for the amount of combinations making it take too long to get statistically significant data about each combination’s score.
What about cost? Good question! I’m still not sure if cost affects the score, as a sort of multiplier, or whether there’s a cut-off cost and if a model exceeds it, it just gets disqualified. At the moment, since I’m in the can-AI-even-do-this phase, I’m going to ignore cost.