Small Vs. Large Language Models

The proliferation of edge AI will require fundamental changes in language models and chip architectures to make inferencing and learning outside of AI data centers a viable option.

The initial goal for small language models (SLMs) — roughly 10 billion parameters or less, compared to more than a trillion parameters in the biggest LLMs — was to leverage them exclusively for inferencing. Increasingly, however, they also include some learning capability. And because they are purpose-built for narrowly defined tasks, SLMs can generate results in a fraction of the time it takes to send a query, directive, or sensor data to an AI data center and receive a response.

SLMs are not new. EDA companies have been playing around with optimized computational software for years, and scientists hav…

The proliferation of edge AI will require fundamental changes in language models and chip architectures to make inferencing and learning outside of AI data centers a viable option.

SLMs are not new. EDA companies have been playing around with optimized computational software for years, and scientists have applied smaller models to solving mathematical and scientific problems. But the rollout of ChatGPT in November 2022 radically changed the world’s perception of AI, and the massive investments that have poured into the industry are enabling commercially available language models to run faster locally using far less energy.

Adopting AI at the edge requires more refinement of language models. But the expectation is that these ultimately will deliver capabilities at the edge that were supposed to be part of the Internet of Things. And while forecasts for total available market (TAM) growth for edge AI are speculative, there is widespread agreement that they are trending sharply upward. Fortune Business Insights estimates the market will reach $267 billion by 2032, up from just $27 billion in 2024. That, in turn, is expected to create a whole new market opportunity for a variety of chips, chiplets, and tools to optimize these designs.

“It’s an area of active research that we’re experimenting with now,” said Billy Rutledge, director of Edge AI Research at Google. “How can we make the models smaller? How can we create the right SLMs that can do the routing and intelligent cascading based on what something is capable of handling, or send it off to another tier? An engine that’s capable of running these models is the starting point. Then we can begin to do more software and ML experience.”

A recent Deloitte survey found that companies that invested in edge computing in 2024 were more upbeat about the return on investment from edge AI than in the past. Deloitte noted that AI embedded into some devices is a potential game-changer because it doesn’t require an internet connection. This has impacts across a spectrum of applications, from industrial and automotive to consumer devices, such as a security camera.

“Instead of getting a message like, ‘Your Ring camera detected motion,’ it might say, ‘Somebody wearing a brown shirt and black shoes picked up a package from your porch and left with it,’” said Jayson Lawley, director of product marketing for AI IP at Cadence. “And you don’t have to send all of those video frames up to the data center to be processed. It’s a huge savings if you can do that at the edge.”

In automotive, SLMs will allow for greater functionality in the vehicle and richer vehicle-to-infrastructure communication. In chip manufacturing, they will provide real-time analytics. And in customer service, they will reduce the frustration of automated answering service menus. Moreover, they will drive new chip architectures, from multi-die assemblies with customized processors and more distributed controllers, to chiplets with pre-loaded SLMs.

The challenge is shrinking these SLMs to workable sizes, and developing hardware architectures that can accelerate the algorithms within a small power budget — but with enough accuracy for whatever domains in which they are used. Large AI companies have reported orders of magnitude reductions through quantization, reducing high-precision FP32 (32-bit floating point) to as little as FP4. The reason this works is that not every query, directive, or analysis requires sifting through massive data sets. If lower volumes of highly relevant data can be stored locally, or at least close to the end device, then a simple 4-bit response may be sufficient. Alternatively, with faster processing elements customized for specific data types and more targeted memory architectures, an SLM may be able to use FP16 without significant slowdown or battery drain.

“A lot of people are thinking about these small language models,” said William Wang, CEO of ChipAgents. “Customers want high performance for their task, but they want to make sure they make the right tradeoffs. Maybe you can get a slightly lower-performing model with a faster response rate. For example, Cursor just released its Composer model, which is not as good as the frontier model, but it’s very fast. You want to push the Pareto curve, but you also need to reach the basic level of accuracy for your task.”

This is a very different approach for AI. “Large language models are essentially a brute force way to take all the data we have and compress it into all these different connections with all of the different vectorizations that happen,” said Cadence’s Lawley. “But if you can get that smaller and smaller, and then compress it, you can really start being able to push things to the edge much more effectively.”

For example, data stored in an edge device can be limited to what is especially relevant to the functioning of a particular chip or chiplet, rather than trying to add a global context.

“A lot of these products know what their usage will be,” said Steve Tateosian, senior vice president of IoT, consumer, and industrial MCUs at Infineon. “You’re not going to ask your thermostat why your Wi-Fi dropped off or to create a thesis about the U.S. Constitution. You’re going to ask it about domain-specific content. But we can go beyond the language model of a wake word to include natural language processing of that question, and then into the language model that generates the response. We’re calling it an edge language model, or ELM, but where we’ll see this go is from generative to generic AI, so the model can be used for different things. You may have multiple ELMs running, and you can train one language to ask about the context, and another that’s trained on vision, and so on. And then, on top of all your models, you may have an agent that is using that input to inform the user about something of interest, like the location of your car, because it actually recognizes your car.”

Fig. 1: Energy used in LLM in data center compared with ELM. Source: Infineon

Targeting workloads More generic SLMs make sense in the short term because they can take advantage of a broad array of processing elements. Language models are in a state of almost constant change, while hardware takes 18 to 24 months to design, verify, and manufacture. By that point, a chip co-designed for a specific SLM is already obsolete.

“You want to distill some basic knowledge from the bigger models and inject that into smaller models,” said ChipAgents’ Wang. “But you also want to be able to prune the weights so that instead of 16 bits you use 8 bits. Everything gets compressed. There are well-known algorithms to compress the weights and get to a certain level of accuracy. But language models and AI move so fast that it’s difficult to co-design the hardware. A year ago, people were co-designing hardware for Llama 3. But nobody is using Llama 3 anymore, and a chip built for Llama 3 may not support Llama 4, so nobody will buy it.”

Alternative approaches include adding some programmability into chips, or using more general-purpose chips in some customized configuration that provides the best tradeoffs with one or more narrowly focused SLMs.

“As you get closer to the device, you have more monetizable services,” said Nandan Nayampally, chief commercial officer at Baya Systems. “You will have SLMs for noise cancellation, for visual recognition — and not just standard visualization. It will be different contexts. We’re seeing demand for more specific, more tailored models across several of our clients. An LLM is really kind of general knowledge, and a lot of the SLMs are developed from the LLMs are much more associated with inference than training. That training is not going away, and if anything, in the short- to medium-term it will increase because there will be more models that are baselines for SLMs. But the inference point is moving down from the cloud to the network edge and potentially to the end device, and that transition gets quite interesting.”

Others agree. “Last year we talked a lot about what happened to IoT, which has been around since 2013 or 2014,” said Thomas Rosteck, president of Infineon’s Connected Secure Systems Division. “What’s changing is that in the past, the IoT was more like an interface to the cloud. Now it’s really becoming the internet of things. Things are talking to each other. For example, I have about 100 IoT devices at home. I have a smoke detector, and if the smoke detector is tested from time to time, it sets off an alarm that all the other smoke detectors will repeat, all the lights will go on, and all the shades will go up. It’s a practical example of things based on the guidelines we’ve given them. Edge AI adds a capability to an IoT device by providing more intelligence, and then providing a new feature set. So will there be a change? Yes, because the devices at the edge are becoming way more powerful. And the work split between the edge and the cloud has to change, because the cloud server farms consume so much energy that we have to at least get it to where it makes sense from a data transportation standpoint, but also from a task standpoint.”

That doesn’t mean the cloud is no longer useful. Models still need to be trained, and massive contextual searches and analysis are too big for edge devices. But moving more processing to the edge does reduce the cost of every AI transaction, both in terms of the amount of energy needed to move data, to process it in the cloud, and to return in a form that’s usable at the edge.

“One of the ways you become more efficient is you reduce the energy that it takes to move all of this data,” said Charlie Janac, chairman and CEO of Arteris. “Another way is to improve the way the LLMs process the data. So there are a lot of innovations here to be done, and this innovation is necessary, because right now, if you look at all of the data centers that are being built for AI training and inference, they are, in the aggregate, supposed to use three times more energy than the world has so far been produced. So there’s a big market for small nuclear reactors, but one of the answers is that this whole thing becomes more efficient, and instead of focusing on just processing power, we’re going to have to be focused on energy efficiency and energy utilization.”

That efficiency comes from improving the efficiency of systems running LLMs, but also processing more data at the edge with SLMs, and limiting the amount of data that needs to be sent to the cloud. “The key here is to minimize the data transfer back and forth,” said Venkat Kodavati, senior vice president and general manager of Synaptics’ Wireless Division. “But when you have to do it, you also want to do that in an efficient way and save power. We’ve seen a lot of small language models, and with a few hundred million parameters we can support that transfer on our edge devices. But models eventually will be able to run on edge devices more efficiently. You can do a lot of inferencing at the edge, as well as some training. And you can do customized training at the edge and then update the models in the cloud. All of these things will happen very soon.”

Fig. 2: Use cases at the intelligent edge. Source: Synaptics

And in many cases, much of this will be hidden from users. “Where AI really starts to impact people is going to be when they don’t know that it impacts them,” said Lawley. “It’s going to be invisible to them, like the removal of background noise when we’re talking. It just sort of integrates into everyday life, almost like a cell phone is now. You’ll see that with edge applications. I predict it will be even harder to discern that you’re using technology. It will just be how you live your life.”

Local when possible, global when necessary Hybrid models that utilize the cloud and edge, will be the norm in most cases — at least in the short term. Local processing will produce faster results, but devices still need to communicate with some large data center for things like maintenance and software updates, and for queries to large data sets that cannot be stored locally, such as in semiconductor manufacturing. In fact, SLMs increase the amount of data that needs to be processed during multiple test insertions.

“We’re going to ride the large language models for a while,” said Ira Leventhal, vice president of applied research and technology at Advantest. “Small language models will be focused on some niche applications where it makes sense to use them. But from a test standpoint, the advantage you get if things do go to small language models is they’re very purpose-focused. So you can cut that down to a smaller set of use cases that you have to ring out during tests, like less variability. It would simplify things. But if you’ve got a bunch of small language models, you also have to worry about testing all of those, and you’ve got to test them in parallel.”

This requires keeping track of all the interactions and dependencies involving multiple small language models. SLMs need to be carefully integrated into complex processes, such as semiconductor test or inspection, or they can cause problems.

“We are leveraging the know-how of large language model capabilities, but customers want it very specific to our systems, and then they want it very specific to their data and localized to them,” said John Kibarian, CEO of PDF Solutions. “They want something that’s purpose-built exactly, but which can get smarter about their environment and which is always updated, based on whatever capability is available. And they would like to see AI as an augmentation so that knowledge can be captured and transferred to the next generation of engineers. That will bring our industry to places it hasn’t been before, while not forgetting the knowledge that was captured in the past. They are looking for this kind of small, locally trained capability, effectively encapsulating tribal knowledge at some level by learning what’s gone on in all their past production, all their past analytics, test programs, the way they looked at data in the past, so they can more rapidly spread that capability across the organization.”

More features, new challenges SLM is a broad label that ultimately will be broken down into subsets. For example, there are multi-model models, video SLMs, and there will be others as more features are added into edge devices. What’s not clear at this point is how they might interact, how to structure those interactions in useful ways, or how to minimize them when that’s not possible. On top of that, some type of oversight will be needed to ensure these devices remain reliable if they are allowed to learn.

“On edge devices, we’re beginning to look at how we start operating in a different domain, like how do we operate in the token space,” said Kai Yick, engineering director at Google. “And how do you do sensor fusion sorts of sorts in that tokenization space, and then on the edge devices? Once you get everything tokenized, then you can make decisions. Should it take an action? Should it then cascade that decision to something else? For example, it can move to my phone, my phone list, or it could be more capable of actually running a little LLM in that circumstance. Then, should that LLM respond to me based on a query? Or if it’s an action, should it take a more sophisticated action based on that intent? And what happens if it exceeds the capability of something? Does it then cascade to the data center? This cascade architecture is what we’re looking at.”

Conclusion Tradeoffs between accuracy and performance will continue to dominate the edge and the SLMs developed for them for the near future. But companies that deliver edge AI will have the advantage of leveraging what they have learned in the cloud with LLMs to speed the rollout of SLMs. The less distance data needs to travel, and the less data that needs to be sent to the cloud, the faster the response. And the tighter the specs on what an SLM can do, the greater the speed at which all of that can be optimized.

SLMs are coming fast, and they are pushing boundaries in all directions. In some cases, they will be multi-modal. In others, they will be targeted at a specific mode, such as vision or natural language audio. Regardless, they will define and redefine how we interact with machines, and how machines interact with each other, and all of this will happen much closer to the source of data, and to tools and machines that people use at work and in their daily lives.

Similar Posts