Moving AI Workloads To The Edge

Experts At The Table: Semiconductor Engineering gathered a group of experts to discuss how some AI workloads are better suited for on-device processing to achieve consistent performance, avoid network connectivity issues, reduce cloud computing costs, and ensure privacy. The panel included Frank Ferro, group director in the Silicon Solutions Group at Cadence; Eduardo Montanez, vice president and head of PSOC Edge Microcontrollers & Edge AI Solutions, IoT, Wireless and Compute Business at Infineon; Alexander Petr, senior director at Keysight; Raj Uppala, senior director of marketing and partnerships…

**L-R: Cadence’s Ferro, Infineon’s Montanez, Keysight’s Petr, Rambus’ Uppala, Siemens’ Sitapure, and Synopsys’ Cooper. **

SE: As the industry increasingly discusses and plans moving AI applications that have been residing in the cloud to the edge, it’s important to understand why this is happening. What are the key drivers?

Ferro: The main reason we’re seeing a lot of interest in how to support these AI applications at the edge is because training has been the hot topic for the last four or five years, and a lot of these models, as they’re maturing, are getting pushed out to the edge and the endpoints of the network. This means we’ve seen a lot more interest in AI inference. I’ve even seen some market reports saying the AI inference market is going to start to grow. The compute requirements for the inference are less, and as you move out to the edge and to the endpoints in the network, then your cost, power, etc., everything else is going to get more intensified. This means you’ve got to be much lower power. You’ve got to be much lower cost and much more efficient in how you implement those systems. That’s where I’ve been spending a lot of time, meeting with customers, talking about how they can implement these LLMs at the edge. Even the requirements from not even a year ago need more capacity and bandwidth. So as the LL models are growing, AI inference is getting to be more compute intensive.

Montanez: The cloud presents a lot of limitations that we are addressing, specifically focused on infrastructure around wireless connectivity. Not everything needs to be connected, so edge AI has a good opportunity to bring a local user experience. Also, there are limitations in data privacy. A lot of us have products around the home that either have cameras or microphones, and who knows where your data is going? So edge AI presents the capability to give the user a different type of experience without having their data go everywhere. As Frank just noted, there’s also the capability to create new experiences with battery-powered products. There’s definitely a limitation where there is a heavy reliance on data centers and an increased footprint that consumes a ton of power. Some of these LLMs can run at the edge at a fraction of the energy.

Petr: It makes sense to distinguish between training and inferencing. LLMs were mentioned, but there aren’t just LLMs. There also are neural networks. So when we talk about AI, we have to be clear about what we are talking about and what we want to run. We are also nowadays talking about GPUs, TPUs, NPUs, for example. Neural network processing units and tensor processing units are similar, but not the same. It really depends on what you’re doing. What you encounter now in the industry is basically the question of where to train and where the inference is needed. It also depends a lot on the size and the capabilities of those LLMs. If we stick with the LLMs, there is a definition of east-west traffic and north-south traffic. The requirements for training are significantly different than the requirements for inferencing. For training, the important thing is that you need a massive amount of compute, parallelized to a great extent, and you need to be able to shovel data from one unit to the other.

SE: Moving data has a lot of overhead, right?

Petr: If you look at the data centers, they purchase data centers for gigawatts. NVIDIA is making a deal with a data center provider, saying, ‘I need 4 gigawatts or more than that.’ For capacity, they don’t talk about how many CPUs or GPUs, or what bandwidth or memory they need. They start talking about energy. The hyper-virtualization and the parallelization — the communication in the training — is different. We also heard from the other panelists that where the memory sits and how it’s connected, those latencies are crucial. On the inferencing side, I see a hard distinction between our semiconductor industry and, say, consumers. If you go into GPT, most of the inferencing happens in the cloud. But if you go to your phone, we now have TPUs and NPUs on our phones, so this is already an edge device, and we’re seeing different compute technologies and different sizes of LLMs being deployed. One of the most important things that I see, working with customers on AI solutions, is that it’s really all about security. There’s a clear distinction between LLMs that are built on data that’s widely available, meaning scraped from the internet, and fine-tuned models or user-specific AI solutions, which are unique to each company, built and trained, and refined on their IP. As soon as you go to companies that have that secure requirement and don’t want to expose their IP to any internet whatsoever, we’re talking about air-gap solutions. That’s where you see more and more edge requirements. That’s a big driver of why we’re seeing more and more data centers moving to private premises and edge devices being deployed on the training site — and also on the inference side. Then you also have mobile devices. Devices that run on batteries have different requirements.

Uppala: The challenge here is that when you look at applications, you have to look at the constraints. Some of our colleagues here pointed out the challenges where you have different compute requirements, different bandwidth and latency requirements. And when you look at an application perspective, for example, security cameras and things like that, a camera can have some intelligence built into it, but it’s limited in terms of the amount of processing it could do. Let’s say we’re talking about electricity infrastructure in remote locations where fire hazards have been a big concern. In those cases, you don’t expect a lot of connectivity, and the cameras only have a certain amount of compute capability. You can put in some analytics where it can detect fires and send some metadata to a security operations center or the like. That is very bandwidth-constrained, and you are looking for a certain kind of thing. In a similar application perspective, if you look at analytics for safety and security, if you have a break-in and things like that, that’s very latency-critical. You need to make sure you have enough bandwidth to send some alerts, and that’s the case where every second or millisecond is really important, given the recent example of what happened at the Louvre in Paris. The sooner you get these alerts, the faster you can address some of these situations. Autonomous vehicles are another example where safety is one of the key concerns, and you cannot rely on data being moved to the cloud and back. It has to be extremely fast in terms of the inference that happens on the vehicle. I would take more of an application perspective and look at which applications need latency and the kind of compute. Sometimes you’ll even have hybrid situations. If you put these cameras in retail locations, for example, shoplifting doesn’t require a whole lot of analytics. You could do it on the edge or the endpoints, on the camera itself. But if you need more analytics, like foot traffic and heat maps and things like that, it’s not latency-critical. You can push that data to the cloud and do those analytics there. It really boils down to the application, the capabilities of the application, and the connectivity of the application. A lot of use cases are emerging that can leverage AI, but still, we have some limitations in terms of bandwidth and compute capabilities.

Cooper: We see a big push for inference versus training. In the cloud, there are a few big players and it’s hard to compete. One of the reasons things are moving to the edge is because people have this technology they want to advance, and say, ‘Oh, that’s a crowded space. Let me go look over here.’ There’s connectivity, there’s privacy, there’s latency, and safety. There are security issues that you could have with the cloud, that could be addressed by moving to the edge. An automotive application is an example of where latency is key. If you see a pedestrian, you want your car to talk to you using an LLM. You’re not going to have time to go to the cloud to say, ‘Oh, look out for the pedestrian.’ Also very relevant is the comment that there are already TPUs and NPUs in cell phones. So there’s a whole range of devices out there that give a target for these people who want to take algorithms to test them, and then from there they can move to smart glasses or cars or whatever. That helps, because you have some hardware in place already. Further, it’s not an either/or. Another automotive example is, maybe I’m connected most of the time, but when I’m not connected I switch to the local, and then I go back to the cloud. It could be a hybrid mode where you go back and forth. So there are a lot of business and technical reasons for moving AI applications to the edge.

Sitapure: I was at Jensen Huang’s keynote [last week], and one of the big things he spoke about was physical AI, which is robotics. That’s a combination of having your VLMs (vision language models), specific ML/RL (reinforcement learning) models for grabbing things, and so on. It’s a $10 trillion market, so there is a lot of robotics activity. [Last week], NEO launched X1, Figure has 03, and all this cool stuff is coming. Robotics, just in that space, has to be edge. There is no way you can do cloud. What about Teslas and Waymos? That’s all edge compute. Another example is wearables that are more intelligent now. When the AWS server thing happened, people could not open their doors. People could not run their coffee machines, because they’re all running on the cloud. And you should not have this if you have a heart pacemaker that’s somehow analyzing stuff, and the heart stops beating because the Wi-Fi is down. Best to keep it simple.

Similar Posts