Right now I’m experimenting quite a bit with generative AI models and it’s starting to get expensive quickly. I’m using the Pay-As-You-Go model quite a bit and the credits are costing me a small fortune.
My dream would be to host small LLMs at home but the hardware is generally not ready. I imagine that in a few years TPUs will be everywhere. That’s why in the meantime, I decided to experiment with my gaming PC. It’s a tower running Windows with an AMD RX 9070 XT graphics card as well as an AMD Ryzen 7 9700X processor. With this configuration, we should be able to do some stuff 👀
First attempts #
For the drivers, no issues to report, I simply used the latest drivers available on Windows provided by AMD via its software [AMD Adrenalin](https://www.amd.com/en/…
Right now I’m experimenting quite a bit with generative AI models and it’s starting to get expensive quickly. I’m using the Pay-As-You-Go model quite a bit and the credits are costing me a small fortune.
My dream would be to host small LLMs at home but the hardware is generally not ready. I imagine that in a few years TPUs will be everywhere. That’s why in the meantime, I decided to experiment with my gaming PC. It’s a tower running Windows with an AMD RX 9070 XT graphics card as well as an AMD Ryzen 7 9700X processor. With this configuration, we should be able to do some stuff 👀
First attempts #
For the drivers, no issues to report, I simply used the latest drivers available on Windows provided by AMD via its software AMD Adrenalin. Just in case, I also installed AMD’s HIP SDK to take advantage of the ROCm platform. Afterwards it doesn’t seem to have served much purpose but I’m keeping it in mind.

I started my experiments with Ollama. The software is simple and since a few versions, we even have a nice little GUI. However, ollama doesn’t seem to support my graphics card and the models only run with CPU acceleration. In their 2024 article they mention preliminary support for AMD GPUs on Windows and Linux but the number of supported card references remains ridiculously low…

Jan enters the scene #
Disappointed with my inconclusive experiments with Ollama, I went looking for alternatives. I’ve heard a lot about LM Studio. The application works well, but the fact that the software is proprietary quickly cooled me off.
So I went with Jan, its open source counterpart.
Unlike Ollama which uses its own backend to infer models, Jan is based on llama.cpp. By default the application installs the latest version for CPU. The little trick is therefore to make it install the version compiled to support the Vulkan API.
The llama.cpp project releasing new versions every few hours, I downloaded the latest version available b7356.
Be careful to download the Windows x64 (Vulkan) version
Then we import the backend by selecting “Install Backend from File”. No need to decompress the archive, Jan takes care of everything.

And finally, we select the right backend under Vulkan.
Server mode #
Running a model on one machine is fun for two minutes but we’d like to be able to access it remotely. And maybe even open it to close ones.
To guarantee encrypted communications that pass through firewalls without hassle, I’m very fond of VPN mesh solutions. I went with my favorite netbird. One of the advantages for my use case is that I already have the netbird client installed on all my devices to access them remotely. The only small modification remaining: modify the filtering between my nodes. I therefore created a policy to authorize some of my devices to access my Jan instance on port 1337.

Finally, just launch Jan in server mode from the application settings.
Be careful to change the listening address to 0.0.0.0 and provide a bearer token.

Remote access #
Everything is now in place. To test, I’m using another instance of Jan running on a laptop. For this I add a provider compatible with the OpenAI API and point it to my server.

And voila, we now have our own server exposing an OpenAI API in our network 🥳

With this setup I get about 50 tokens on average on the latest model Ministral 3 14B 🪄