AI is all about inference now

We’re years into the generative AI boom, but much of what we see on X or LinkedIn is still mostly consumer-grade demoware. Sure, you can quickly spin up cool stuff with a ChatGPT prompt, but that’s not where the real magic (or money) will happen. It’s time to get serious about AI. For enterprises trying to put their data to work, the true revolution isn’t model creation at all. It’s inference: the application of those models to real, governed enterprise data to solve actual business problems. Inference will determine who wins in enterprise AI.

Put bluntly: You don’t have a model problem; you have a data-and-…

Put bluntly: You don’t have a model problem; you have a data-and-deployment problem. Not that you’re alone in this. Follow the money and it’s clear this is a systemic issue.

In the past few months, OpenAI and others have inked massive contracts with cloud vendors, not to build new models, but rather to ensure capacity to run existing ones. OpenAI’s reported $300 billion commitment to Oracle (disclosure: I work for Oracle), along with massive deals across other providers, isn’t about funding science projects. It’s about locking in predictable, industrial-scale capacity to run models. “Scaling frontier AI requires massive, reliable compute,” says OpenAI founder Sam Altman. The frontier labs can chase ever-larger models; the rest of us have to make them pay off.

So, the real question is how to use enterprise data to make that inference useful and run it at scale without breaking the budget.

Inference overtakes training

IDC forecasts that global investment in AI inference infrastructure will surpass spending on training infrastructure by the end of 2025. That’s a dramatic tipping point. It means enterprises and cloud providers are now spending more on the plumbing to deploy models than on creating the models themselves. The reason is simple: You train a model once in a while, but you run it every hour, all day. This isn’t to say model training is irrelevant. On the contrary, foundational research and the creation of smaller, domain-specific models remain critical. But for the vast majority of enterprises, the race to build a better ChatGPT is a distraction. The real, immediate value comes from figuring out how to make AI work for your data in production.

Lots and lots of production.

IDC also found that 65% of organizations are expected to be running more than 50 generative AI use cases in production by 2025, and more than 25% will exceed 100 use cases. Each of those use cases represents potentially thousands or millions of inference calls. This explosion of real-world usage is driving a corresponding explosion in infrastructure needs. When Amazon CEO Andy Jassy says Bedrock “could ultimately be as big a business as EC2,” he’s really saying the quiet part out loud: Managed inference is the new rent check.

AWS is hardly alone; every major cloud is racing to turn inference into a first-class utility. Nvidia’s latest GPUs are optimized not just for training huge models, but for churning out AI-driven answers efficiently. Startups are building chips and AI accelerators purely for fast, low-cost inference at the edge and in data centers. The reason is simple: If every application in the enterprise is going to embed some AI-driven smarts, the number of inferences running per day will explode, and so will demand for efficient infrastructure to serve them.

Feeding models the right context

Why this shift toward inference? Enterprises are starting to realize that the biggest, fanciest model means nothing without business-specific context and data. As Oracle cofounder Larry Ellison argues, the next frontier of AI isn’t model creation at all—it’s data contextualization: “The companies building massive language models may make headlines, but the real value lies in connecting those models to the right data: private, high-value, business-critical enterprise information.” Nor is he alone in this argument. Chaoyu Yang, founder and CEO of the open source platform BentoML, suggests that “inference quality is product quality” because inference “determines how fast your product feels, how accurate its responses are, and how much it costs to run every single day.”

Much of that data lives in an enterprise’s databases, which are becoming the “memory layer” for AI, a place to store the knowledge that models draw on, as I’ve argued. Today’s generative AI systems are “the ultimate amnesiacs,” processing each query in isolation. They don’t know your company’s data, nor do they retain interactions from last week, unless we connect them to a data source. Hence the rise of retrieval-augmented generation (RAG) and the explosion of vector databases. By storing facts and context in a database that an LLM can query on the fly, we give the model an external brain and a semblance of long-term memory. This dramatically improves relevance and reduces hallucinations because the model no longer needs to guess. Your data can tell it the answer.

A bigger model without better context just hallucinates at scale. No one needs that.

Accelerating into the inference era

We’ve known for a while that enterprises won’t fully embrace AI until the “boring” issues are solved, things like integration, security, compliance, and cost management. As Spring framework creator Rod Johnson puts it, “Startups can risk building houses of straw. Banks can’t.” Those issues come to the forefront when AI goes into production. The good news is that mainstream adoption brings discipline: unit tests for AI agents, monitoring for model outputs, explicit rules about what data an AI can or can’t use, etc. We’re now seeing that discipline being applied, which suggests AI is growing up and getting ready for real-world usage.

For enterprises that are anxious to move faster on putting their data (and inference) to work, hopeful that high-stakes AI projects will succeed at least as often as other projects, start by inventorying your highest-value data (customer interactions, supply chain logs, knowledge bases, etc.,) and think about how AI could unlock it. The goal is to bring the model to the data. This might mean using a cloud service that lets you fine-tune a foundation model on your data, or using retrieval techniques so the model can reference your data on the fly. Either way, your proprietary data is your AI edge. Focus on that more than on tweaking model architectures. It doesn’t matter how sophisticated your model is if you’re feeding it lame data. If you don’t know where this data sits, you’re not really ready for AI. Sorry.

Also, if you’re a developer, you need to think about where you can be most valuable. For starters, developers should be:

Thinking beyond model training
Mastering RAG pipelines
Understanding vector database query optimization
Writing secure, low-latency APIs to serve the model
Creating prompts that are tightly coupled to data schemas
Figuring out cost management and monitoring per API call

Second, you don’t need 50 AI use cases from the start. Instead, begin with a smaller list of high-impact use cases. This is about moving beyond slideware and into real production. Where can inference on your data move the needle? Maybe it’s generating custom product recommendations from customer histories, or automating answers to employee HR questions from policy documents. Use early wins to build momentum. Over time, you’ll expand the portfolio of AI-infused applications, but ensuring you have a solid foundation first pays dividends.

Third, optimize for cost-efficient inference, which is both a matter of choosing the right infrastructure and the right model size for the job. (Don’t use a 175-billion-parameter behemoth if a 3-billion-parameter model fine-tuned on your data performs almost as well.) The four big cloud providers are investing heavily to make this a reality.

Fourth, as exciting as it may be to really get humming with AI, don’t forget governance and guardrails. If anything, inference makes these concerns more urgent because AI is now touching live data and customer-facing processes. Put in place the “boring” stuff: data access controls (Which parts of your database can the model see?), prompt filtering and output monitoring (to catch mistakes or inappropriate responses), and policies on human oversight.

A healthy dose of AI pragmatism

The signals are clear: When budget plans, cloud road maps, and C-suite conversations all point toward inference, it’s time to align your business strategy. In practice, that means treating AI not as magic pixie dust or a moonshot R&D experiment, but as a powerful tool in the enterprise toolbox, one that needs to be deployed, optimized, governed, and scaled like any other mission-critical capability.

The first cloud era was won by whoever made compute cheap and easy. The AI era will be won by whoever makes intelligence on top of governed data cheap, easy, and safe. In practice, that means making inference ubiquitous, efficient, and enterprise-friendly. Chasing the biggest model is optional. Making inference work on your own data is not.

Inference overtakes training

Feeding models the right context

Accelerating into the inference era

A healthy dose of AI pragmatism

Similar Posts