The Transformer that underpins large language models certainly has range. And that range is proving to be one of the biggest confounding effects when it comes to making them run efficiently.
Many hardware vendors are looking to on-device language models built using the Transformer. The shift to devices should ease the minds of users worried about servers in the cloud eavesdropping on regular conversations and then using them as training data.
“We see a lot of smart-home appliances where you will be able to use voice or natural language,” says Augustine Nebu Philips, senior director of strategy and business development at Synaptics.
Models like Whisper, Moonshine and Parakeet offer reasonably accurate speech-to-text conversion. They feed data to a relatively small language model th…
The Transformer that underpins large language models certainly has range. And that range is proving to be one of the biggest confounding effects when it comes to making them run efficiently.
Many hardware vendors are looking to on-device language models built using the Transformer. The shift to devices should ease the minds of users worried about servers in the cloud eavesdropping on regular conversations and then using them as training data.
“We see a lot of smart-home appliances where you will be able to use voice or natural language,” says Augustine Nebu Philips, senior director of strategy and business development at Synaptics.
Models like Whisper, Moonshine and Parakeet offer reasonably accurate speech-to-text conversion. They feed data to a relatively small language model that attempts to parse the transcriptions into usable commands. Philips sees the other big target for Synaptics is industrial control through the use of vision-language models.
Vision-language models compete more directly with the convolutional neural networks used by security cameras and industrial imaging pipelines. Transformer-based models take advantage of the same type of vector mapping used for words to encode image segments. In principle, it gives the model better mechanisms for identifying events and reacting to them.
Inevitably, there is a cost that comes with this additional predictive power. A Transformer-based model is heavy not just on matrix arithmetic and memory. Even the relatively simple Whisper needs 1Gbyte of memory for its smallest version. There are ways to trim the overhead.
One method is to prune neural networks to remove connections that have little effect on the output. But because today’s hardware is usually optimised for dense matrix and tensor multiplications, pruning has a limited payoff. Accelerator designers like Tenstorrent have introduced methods to cut the overhead of handling sparse networks. But high performance often relies on reorganising the network to make the sparsity less irregular. That typically proves difficult to achieve.
Converting neural weights from long floating-point numbers used during training to integers just a few bits long has proven far more successful so far. This does not just save space. This quantisation or microscaling also works well with highly parallel execution pipelines used in many neural accelerators for embedded systems.
But these techniques are to some extent just chipping away at the edges. Are there ways to cut the Transformer architecture itself down to size? Because increasing parameter counts has delivered better benchmark results, this may seem unlikely. But things are changing.
Transformer architecture
“We must strive for better,” said IBM Research chief scientist Ruchir Puri at a conference on AI acceleration organised by the computer company and the IEEE in November. He expects almost all language models to move to local and edge processing by 2030.
Efforts on small language models such as TinyLlama, Alibaba’s Qwen, Microsoft’s Phi and Hugging Face’s Smol series have shown there is scope in trying to strip out as much wasted computation as possible with a much stronger focus on the quality rather than quantity of training data.
“It is becoming clear that small language models are improving dramatically. There are a lot of tasks in the world that can be done with smaller models. We went from the era of massive, single systems to the era of personal computers. We are at the same juncture with AI,” Puri claims.
A group led by Professor John Hennessy at Stanford University showed small language models handle almost 90 percent of simple chat and reasoning queries. Two years of changes saw more than a threefold improvement in “intelligence per watt”. Underlying hardware optimisations provided another 1.7x reduction in power.
Even with these reductions, the Transformer’s self-attention concept demands a heavy price. Every token needs to be calculated with each of the others in the sequence of data fed into it. Simpler models like Whisper can use sliding windows to reduce the context window to a bare minimum. But more complex models intended for reasoning and image or video analysis place huge demands on the computing hardware. Even caching can be counterproductive. Many models use a key-value (KV) cache to stop the model from having to recalculate similar answers repeatedly for each token. But there is a trade-off between how long it takes to build that cache for each new prompt against how often those values will be reused.
AI researchers are looking to another form of pruning for Transformers. Much of the time, tokens will be pulled into the self-attention calculations that have zero practical effect on the output. Working out which ones are which ahead of time is more or less impossible. But if you skip some of them according to a predetermined plan, you can save a lot of effort, hopefully without losing too much accuracy.
A recent proposal called Nexus, developed at Huawei and Nanyang Technological University, reworks self-attention into a hierarchy. They applied some mathematical tricks to prevent this needing even more computations. There is still a tradeoff between performance and the number of layers, but the researchers argue the technique helps capture long-range dependencies between tokens more effectively. That should let smaller models achieve better accuracy while also pointing the way to smarter skip patterns.
A change in the way developers describe their models and algorithms may make work to exploit pruning and sparsity more viable. The startup Furiosa AI promotes the idea of representing the operations as high-level tensor contractions rather than as combinations of individual matrix-multiplication instructions. But the company is far from alone in the field. At the IBM-IEEE conference, Nvidia researcher Joel Emer described how representing matrix operations using abstractions like Einstein summations (einsums) can prevent software being locked to a particular target architecture. Tools like Google’s Jax can translate the einsum representations into the specific low-level matrix and vector instructions an accelerator implements.

The diagram above shows the fibertree representation and how that can feed data into the Onyx accelerator developed at Stanford University
Fibertrees
The next step proposed by Emer and others may be to move to fibertrees, which represent the elements in these huge matrices and tensors as branches within tree-like data structures. The advantage of this representation is that it lets the developer express sparsity in a simpler way. It also lets architects model different accelerator designs more easily and see how well they might run a new generation of more efficient AI models.
Such architectures might open the door to smaller models that finally take advantage of a discovery made almost a decade ago: lurking inside big neural networks are much smaller sparse networks that do most of the valuable work.
Two major obstacles to exploiting this are the difficulty of finding these “winning lottery tickets” and then converting that into an efficient model. But there are powerful incentives to pursue this line of investigation even for big cloud-based models. Experiments have shown that the fine-tuning these models need to turn them into useful AI assistants is less prone to problems like catastrophic forgetting. These techniques work by finding the winning lottery tickets within the larger network and then training them independently.
The problem facing anyone trying to build hardware acceleration for these models at this stage is how much of the target architecture is a moving target. Making a firm bet on hardwiring much of the acceleration will probably be a losing move for some time to come.