Published 9 minutes ago
I’m Adam Conway, an Irish technology fanatic with a BSc in Computer Science and I’m XDA’s Lead Technical Editor. My Bachelor’s thesis was conducted on the viability of benchmarking the non-functional elements of Android apps and smartphones such as performance, and I’ve been working in the tech industry in some way or another since 2017.
In my spare time, you’ll probably find me playing Counter-Strike or VALORANT, and you can reach out to me at adam@xda-developers.com, on Twitter as @AdamConwayIE, on Instagram as adamc.99, or u/AdamConwayIE on Reddit.
Sign in to your XDA account
[Local LLMs](https://www.xda-developers.com/apple-sleeper-advantage-loca…
Published 9 minutes ago
I’m Adam Conway, an Irish technology fanatic with a BSc in Computer Science and I’m XDA’s Lead Technical Editor. My Bachelor’s thesis was conducted on the viability of benchmarking the non-functional elements of Android apps and smartphones such as performance, and I’ve been working in the tech industry in some way or another since 2017.
In my spare time, you’ll probably find me playing Counter-Strike or VALORANT, and you can reach out to me at adam@xda-developers.com, on Twitter as @AdamConwayIE, on Instagram as adamc.99, or u/AdamConwayIE on Reddit.
Sign in to your XDA account
Local LLMs are incredibly powerful tools, but it can be hard to put smaller models to good use in certain contexts. With fewer parameters, they often know less, though you can improve their capabilities with a search engine that’s accessible over MCP. As it turns out, though, you can host a 120B parameter model on a GPU with just 24GB of VRAM, paired with 64GB of regular system RAM, and it’s fast enough to be usable for voice assistants, smart home automation, and more. For reference, on 24GB of VRAM, the most practical dense model you’ll typically be able to fit will be a quantized 27 billion parameter model, accounting for the memory needed to hold the context window, too.
Specifically, the model we can use is gpt-oss-120b, which is the largest open weight model from OpenAI. It’s a Mixture of Experts model with 117B parameters with 5.1B active at a time. Paired with Whisper for quick voice to text transcription, we can transcribe text, ship the transcription to our local LLM, and then get a response back. With gpt-oss-120b, I manage to get about 20 tokens per second of output, which is more than good enough for a voice assistant. I’m running all of this on the 45HomeLab HL15 Beast, but any similarly specced machine will be able to do the same.
Setting up our Proxmox LXCs for llama.cpp and Whisper
Creating a base image
Assuming you’re using Proxmox, installing Whisper and llama.cpp in their own LXCs is pretty simple. Our HL15 Beast has an Nvidia RTX 4090, but you can achieve a similar result on an AMD Radeon-based machine by applying the same concepts. Make sure that you have Nvidia’s drivers installed on the host system, and disable the open-source Nouveau driver if it’s installed.
Once the host system is ready, we’ll set up and configure one "base" LXC that we’ll use for llama.cpp, then clone it for use with Whisper. First, create a basic unprivileged LXC using a Ubuntu template, giving it 30GB of storage and 16GB of RAM. Before starting it, assuming you’re using an Nvidia GPU, you’ll need to give the LXC access to your hardware. In the Proxmox host, go to /etc/pve/lxc, and edit the configuration file that corresponds to your container. For example, my Whisper container’s configuration is in /etc/pve/lxc/102.conf. Add the following lines to your config:
lxc.cgroup2.devices.allow: c 226:0 rwmlxc.cgroup2.devices.allow: c 226:128 rwmlxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=filelxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=filelxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=filelxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file lxc.mount.entry: /dev/nvidia-caps/nvidia-cap1lxc.mount.entry: /dev/nvidia-caps/nvidia-cap2 dev/nvidia-caps/nvidia-cap2 none bind,optional,create=file
The first two lines give the LXC read, write, and mknod (device node) permissions for hardware that corresponds to a major number of 226, a minor number of 0, and hardware with a major number of 226, and a minor number of 128. You can verify that these match by using "ls -la" on the host in the appropriate folders. The other lines map the different Nvidia device paths used by the system to the LXC, surfacing the appropriate paths that the Nvidia drivers inside the LXC will expect.
You can now save the file and exit it.
After starting the container, download the Nvidia drivers that match the version used on the host. In my case, the driver I downloaded was "NVIDIA-Linux-x86_64-580.105.08.run," though check for the latest version first.
Now, run the following commands:
chmod +x NVIDIA-Linux-x86_64-580.105.08.run./NVIDIA-Linux-x86_64-580.105.08.run --no-kernel-moduleapt install nvidia-cuda-toolkit
The above commands will flag the Nvidia driver installer as executable, install it without the kernel modules, then install the Nvidia CUDA toolkit. LXCs use the host kernel, and LXCs also can’t install kernel modules. That’s why we installed the Nvidia drivers on the host, as installing Nvidia’s drivers this way makes the LXC aware of the hardware, while actually leveraging it via the kernel module loaded on the host.
With the hardware access we gave in the LXC configuration and the drivers we’ve just installed, all of it comes together to give the LXC complete access to the GPU, without needing to grant exclusive control as you would typically have to with a GPU passthrough to a VM. If the "nvidia-smi" command succeeds in the LXC, you’re done. You can now stop the LXC, clone it with a different name, and move on to configuring both llama.cpp and Whisper.
You can also optionally create a third clone to use as a base LXC with working Nvidia GPU access, so that you can spend less time setting up hardware access if you want to deploy other software in the future.
Configuring llama.cpp and Whisper
Setting up our LXCs
Firstly, remember how we only gave our LXC 30GB of storage and 16GB of RAM? That’s fine for Whisper, but we’ll want to increase those limits for our llama.cpp container. These models are huge, with gpt-oss-120b taking up approximately 60GB of storage. Given that you’ll be offloading much of it to system RAM as well, you’ll need an ample amount to assign to your LXC. Ideally, the entire model should fit between your VRAM and system RAM, as otherwise, the model will need to frequently access your storage, which is much, much slower than even system RAM.
I’ve given my llama.cpp LXC 62.5GB of RAM, but you can theoretically assign 40GB of RAM and still have the entire model split between both your GPU and CPU. This is because the different variants are all around 60GB to 65GB, and we subtract approximately 18GB to 24GB (depending on context and cache settings) from that as it goes to the GPU VRAM, assuming we leave room for both a small Whisper model and for a llama.cpp context window.
Installing llama.cpp is pretty easy, and you just need to follow the official instructions to get it up and running. Once you have it ready to go, navigate to the folder llama.cpp’s build was output to, and run the following command:
./llama-server -hf unsloth/gpt-oss-120b-GGUF:Q4_K_S -ot ".ffn_.*_exps.=CPU" --n-gpu-layers 999 --host 0.0.0.0 -c 0
Alternatively, run this command to offload a selected number of experts to system RAM:
./llama-server -hf unsloth/gpt-oss-120b-GGUF:Q4_K_S --n-cpu-layers 36 --n-gpu-layers 999 --host 0.0.0.0 -c 0
This downloads the Unsloth Q4_K_S quantized model from Hugging Face, loads all expert layers to the CPU in system RAM, loads non-expert layers to VRAM, and starts llama.cpp’s server on 0.0.0.0, so that it’s available on all interfaces. Setting "-c 0" sets the default context length, which is 131,072 for gpt-oss-120b. This context length is far too large for a GPU with 24GB of VRAM, especially in a larger model which will have more hidden dimensions, but you won’t run into problems or oddities until you start to have a long conversation.
If you find that the context window needs to be reduced for the model to be usable, you can set the "-c" value to 16,385 (16K) or 32,768 (32K), and if you pair it with quantization, you can get good results with minimal quality loss, though I wouldn’t recommend using it for outputs that require precision. To use a 4-bit quantized KV cache, add "–cache-type-k q4_0 –cache-type-v q4_0" to the above command.
Once you’ve decided on your model parameters, type "ip a" in the LXC terminal to print the IP address of the container. Then start your model, and once the model is loaded, navigate to the IP address at port 8080. In my case, that’s 192.168.1.198:8080. You should see llama.cpp, and you can interact with it in your browser to test it out.
Whisper is a lot simpler, and you can simply follow the GitHub instructions to get it set up in your other LXC. Any of the models should work fine for most home uses, but if your home is often noisy, or you don’t want to use a voice assistant in English, then you may need to use one of the larger transcription models. Finally, run it from the /build/bin folder with the following command:
./whisper-server --model /opt/whisper.cpp/models/ggml-large-v3.bin --host 0.0.0.0 --port 5005 --inference-path "/v1/audio/transcriptions" --convert --print-progress
Make sure you replace the model parameter with your downloaded model.
Connecting it all to Home Assistant
The final step
This assumes that you have the Home Assistant Community Store, known as HACS, installed on your Home Assistant instance. You’ll need to install the "Local LLMs" community integration, as it will allow us to connect to our llama.cpp server and use it in a voice assistant pipeline. The llama.cpp server hosts an OpenAI-compatible API endpoint, which the Local LLMs integration can use. Just enter your server IP here, and it should be connected.
Next, we’ll need to add Whisper. This also requires the "OpenAI Whisper Cloud" HACS integration, as despite the name, it also supports adding a custom Whisper instance with your own endpoint. These are the values I configure for each parameter:
- Name: Any name
- Url: http://192.168.1.204:5005/v1/audio/transcriptions
- Model: /opt/whisper.cpp/models/ggml-large-v3.bin
Keep in mind that these values will likely be different for your installation.
Finally, we need to build the voice pipeline to join both of these services. In Home Assistant, go to Settings, Voice assistants, and then click Add assistant. Choose your local LLM as the conversation agent, your custom Whisper model as your speech-to-text agent, and optionally, select a custom text-to-speech agent as well.
With that, you’re done! Save it, click the overflow menu, and click Start conversation. If you have HTTPS access configured, you can speak to your voice model out loud by clicking the microphone button. Otherwise, just type your prompt.
If you’re wondering why this is so fast, the explanation is pretty interesting, and is thanks to a somewhat unique property of gpt-oss-120b.
How is a 120B model so fast?
It’s all thanks to a unique trait of the gpt-oss models (and Mixtral)
At first glance, those numbers don’t seem like they should add up. A 117B parameter model sounds firmly out of reach for consumer hardware, and in a dense configuration it absolutely would be. The reason gpt-oss-120b works so well here is that it’s not behaving like a 120B model at inference time.
The most important detail to remember is that it’s a Mixture of Experts model in the strictest sense. While the full model contains 117B parameters, only around 5.1B of them are active for any given token. Each token is routed to a small subset of experts, and only those experts do any real work. The rest of the model may as well not exist for that step. From a raw compute perspective, that puts gpt-oss-120b much closer to a mid-sized dense model than anything in the 70B to 120B class, even though its knowledge and training footprint are far larger.
That sparsity alone isn’t enough to explain the performance, though. The real trick is how those experts are deployed in llama.cpp. In this configuration, the expert layers are explicitly kept on the CPU, resident in system RAM, while the GPU handles attention, the always-on dense layers, and the KV cache. When a token is routed to a particular expert, the GPU sends a relatively small activation tensor to the CPU, the expert computation is performed there using weights already in memory, and the resulting activation is sent back to the GPU to continue the forward pass. Because the number of active experts per token is small and predictable, the amount of data exchanged between CPU and GPU per step remains tightly bounded. Instead of shuffling large expert weight matrices across the PCIe bus, the system only exchanges compact intermediate activations. This keeps latency low and avoids turning system RAM access into the bottleneck, even at interactive token generation speeds.
This is where precision also plays an important role. In this setup, the expert weights are heavily quantized, which dramatically reduces their size. Smaller weights mean less pressure on memory bandwidth and faster transfers when those experts are needed. It doesn’t change how much computation the model performs, but it removes a lot of the typically-bandwidth-heavy workloads from the data path, helping the GPU stay busy instead of waiting on memory.
Another reason gpt-oss-120b feels unusually fast is that its MoE design is built to be practical rather than to maximize performance in a more "ideal" hardware configuration. Not every layer is sparse, and routing isn’t happening on every tiny operation. Attention remains both dense predictable, which matters because attention and KV cache access are often the real bottlenecks once token generation speeds climb into the double digits. By keeping the latency-critical parts of the model simple and pushing sparsity into the feed-forward layers (implemented as Gated Linear Units, or SwiGLU/GEGLU), the model avoids many of the overheads that slow down other MoE architectures.
Put all of that together and the result is a model that looks far too large to run on paper, but behaves very differently in practice. At the moment, only a handful of MoE models actually benefit from this CPU and GPU split in llama.cpp. Mixtral 8x7B and the gpt-oss family are standouts in this regard, because their active parameter counts stay low enough that system RAM execution of experts doesn’t become a latency bottleneck. Many larger MoE models technically work, but lose the latency advantages that make this approach practical for real-time use.
In other words, gpt-oss-120b works because you’re not asking a 120B dense model to do work on every token. You’re asking a small, carefully chosen slice of a much larger model to respond, while the rest stays out of the way. That’s why, on a 24GB VRAM GPU with enough system RAM, gpt-oss-120b can hit token speeds that would normally be reserved for models a fraction of its advertised size... and why it suddenly makes sense as a backbone for local voice assistants, automation, and other real-time workloads.