The latest big headline in AI isn’t model size or multimodality — it’s the capacity crunch. At VentureBeat’s latest AI Impact stop in NYC, Val Bercovici, chief AI officer at WEKA, joined Matt Marshall, VentureBeat CEO, to discuss what it really takes to scale AI amid rising latency, cloud lock-in, and runaway costs.
Those forces, Bercovici argued, are pushing AI toward its own version of surge pricing. Uber famously introduced surge pricing, bringing real-time market rates to ridesharing for the first time. Now, Bercovici argued, AI is headed toward the same economic reckoning — especially for inference — when the focus turns to profitability.
"We don't have real market rates today. We have subsidized rates. That’s been necessary ...
The latest big headline in AI isn’t model size or multimodality — it’s the capacity crunch. At VentureBeat’s latest AI Impact stop in NYC, Val Bercovici, chief AI officer at WEKA, joined Matt Marshall, VentureBeat CEO, to discuss what it really takes to scale AI amid rising latency, cloud lock-in, and runaway costs.
Those forces, Bercovici argued, are pushing AI toward its own version of surge pricing. Uber famously introduced surge pricing, bringing real-time market rates to ridesharing for the first time. Now, Bercovici argued, AI is headed toward the same economic reckoning — especially for inference — when the focus turns to profitability.
"We don't have real market rates today. We have subsidized rates. That’s been necessary to enable a lot of the innovation that’s been happening, but sooner or later — considering the trillions of dollars of capex we’re talking about right now, and the finite energy opex — real market rates are going to appear; perhaps next year, certainly by 2027," he said. "When they do, it will fundamentally change this industry and drive an even deeper, keener focus on efficiency."
The economics of the token explosion
"The first rule is that this is an industry where more is more. More tokens equal exponentially more business value," Bercovici said.
But so far, no one's figured out how to make that sustainable. The classic business triad — cost, quality, and speed — translates in AI to latency, cost, and accuracy (especially in output tokens). And accuracy is non-negotiable. That holds not only for consumer interactions with agents like ChatGPT, but for high-stakes use cases such as drug discovery and business workflows in heavily regulated industries like financial services and healthcare.
"That’s non-negotiable," Bercovici said. "You have to have a high amount of tokens for high inference accuracy, especially when you add security into the mix, guardrail models, and quality models. Then you’re trading off latency and cost. That’s where you have some flexibility. If you can tolerate high latency, and sometimes you can for consumer use cases, then you can have lower cost, with free tiers and low cost-plus tiers."
However, latency is a critical bottleneck for AI agents. “These agents now don't operate in any singular sense. You either have an agent swarm or no agentic activity at all,” Bercovici noted.
In a swarm, groups of agents work in parallel to complete a larger objective. An orchestrator agent — the smartest model — sits at the center, determining subtasks and key requirements: architecture choices, cloud vs. on-prem execution, performance constraints, and security considerations. The swarm then executes all subtasks, effectively spinning up numerous concurrent inference users in parallel sessions. Finally, evaluator models judge whether the overall task was successfully completed.
“These swarms go through what's called multiple turns, hundreds if not thousands of prompts and responses until the swarm convenes on an answer,” Bercovici said.
“And if you have a compound delay in those thousand turns, it becomes untenable. So latency is really, really important. And that means typically having to pay a high price today that's subsidized, and that's what's going to have to come down over time.”
Reinforcement learning as the new paradigm
Until around May of this year, agents weren't that performant, Bercovici explained. And then context windows became large enough, and GPUs available enough, to support agents that could complete advanced tasks, like writing reliable software. It's now estimated that in some cases, 90% of software is generated by coding agents. Now that agents have essentially come of age, Bercovici noted, reinforcement learning is the new conversation among data scientists at some of the leading labs, like OpenAI, Anthropic, and Gemini, who view it as a critical path forward in AI innovation..
"The current AI season is reinforcement learning. It blends many of the elements of training and inference into one unified workflow,” Bercovici said. “It’s the latest and greatest scaling law to this mythical milestone we’re all trying to reach called AGI — artificial general intelligence,” he added. "What’s fascinating to me is that you have to apply all the best practices of how you train models, plus all the best practices of how you infer models, to be able to iterate these thousands of reinforcement learning loops and advance the whole field."
The path to AI profitability
There’s no one answer when it comes to building an infrastructure foundation to make AI profitable, Bercovici said, since it's still an emerging field. There’s no cookie-cutter approach. Going all on-prem may be the right choice for some — especially frontier model builders — while being cloud-native or running in a hybrid environment may be a better path for organizations looking to innovate agilely and responsively. Regardless of which path they choose initially, organizations will need to adapt their AI infrastructure strategy as their business needs evolve.
"Unit economics are what fundamentally matter here," said Bercovici. "We are definitely in a boom, or even in a bubble, you could say, in some cases, since the underlying AI economics are being subsidized. But that doesn’t mean that if tokens get more expensive, you’ll stop using them. You’ll just get very fine-grained in terms of how you use them."
Leaders should focus less on individual token pricing and more on transaction-level economics, where efficiency and impact become visible, Bercovici concludes.
The pivotal question enterprises and AI companies should be asking, Bercovici said, is “What is the real cost for my unit economics?”
Viewed through that lens, the path forward isn’t about doing less with AI — it’s about doing it smarter and more efficiently at scale.