How to run Ministral 3 with an AMD GPU on Windows

Right now I’m experimenting quite a bit with generative AI models and it’s starting to get expensive quickly. I’m using the Pay-As-You-Go model quite a bit and the credits are costing me a small fortune.

My dream would be to host small LLMs at home but the hardware is generally not ready. I imagine that in a few years TPUs will be everywhere. That’s why in the meantime, I decided to experiment with my gaming PC. It’s a tower running Windows with an AMD RX 9070 XT graphics card as well as an AMD Ryzen 7 9700X processor. With this configuration, we should be able to do some stuff 👀

First attempts #

For the drivers, no issues to report, I simply used the latest drivers available on Windows provided by AMD via its software [AMD Adrenalin](https://www.amd.com/en/…

First attempts #

For the drivers, no issues to report, I simply used the latest drivers available on Windows provided by AMD via its software AMD Adrenalin. Just in case, I also installed AMD’s HIP SDK to take advantage of the ROCm platform. Afterwards it doesn’t seem to have served much purpose but I’m keeping it in mind.

Ollama GUI

I started my experiments with Ollama. The software is simple and since a few versions, we even have a nice little GUI. However, ollama doesn’t seem to support my graphics card and the models only run with CPU acceleration. In their 2024 article they mention preliminary support for AMD GPUs on Windows and Linux but the number of supported card references remains ridiculously low…

AMD GPUs supported by Ollama on Windows https://docs.ollama.com/gpu#windows-support

Jan enters the scene #

Disappointed with my inconclusive experiments with Ollama, I went looking for alternatives. I’ve heard a lot about LM Studio. The application works well, but the fact that the software is proprietary quickly cooled me off.

So I went with Jan, its open source counterpart.

Unlike Ollama which uses its own backend to infer models, Jan is based on llama.cpp. By default the application installs the latest version for CPU. The little trick is therefore to make it install the version compiled to support the Vulkan API.

The llama.cpp project releasing new versions every few hours, I downloaded the latest version available b7356.

Be careful to download the Windows x64 (Vulkan) version

Then we import the backend by selecting “Install Backend from File”. No need to decompress the archive, Jan takes care of everything.

Backend selection

And finally, we select the right backend under Vulkan.

Server mode #

Running a model on one machine is fun for two minutes but we’d like to be able to access it remotely. And maybe even open it to close ones.

To guarantee encrypted communications that pass through firewalls without hassle, I’m very fond of VPN mesh solutions. I went with my favorite netbird. One of the advantages for my use case is that I already have the netbird client installed on all my devices to access them remotely. The only small modification remaining: modify the filtering between my nodes. I therefore created a policy to authorize some of my devices to access my Jan instance on port 1337.

Netbird Policy

Finally, just launch Jan in server mode from the application settings.

Be careful to change the listening address to 0.0.0.0 and provide a bearer token.

Launching Jan server mode

Remote access #

Everything is now in place. To test, I’m using another instance of Jan running on a laptop. For this I add a provider compatible with the OpenAI API and point it to my server.

Demo of jan on a laptop

And voila, we now have our own server exposing an OpenAI API in our network 🥳

Demo of jan on a laptop

With this setup I get about 50 tokens on average on the latest model Ministral 3 14B 🪄

First attempts #

First attempts #

Jan enters the scene #

Server mode #

Remote access #

Similar Posts