Local LLMs: state of the art

With all the local LLMs available by now, you might get curious about what’s the best we can have running locally and how does that compare against what you can get with free-tier inference providers. And the first question you’ll have is: what model do I use?

I’ve set out to answer those questions for myself. Here is what I’ve learned from this journey.

Goals and hardware

My use case is agentic coding. Specifically, KiloCode. That’s pretty important because broadly speaking, there are two main use cases for LLMs, and some of the requirements are the opposite:

Creative writing/roleplay: you want the model to be creative - to be able to tell an interesting and unexpected story, rather than sticking to what you say.

Agentic coding (or other agenti…

I’ve set out to answer those questions for myself. Here is what I’ve learned from this journey.

Goals and hardware

My use case is agentic coding. Specifically, KiloCode. That’s pretty important because broadly speaking, there are two main use cases for LLMs, and some of the requirements are the opposite:

Creative writing/roleplay: you want the model to be creative - to be able to tell an interesting and unexpected story, rather than sticking to what you say.

Agentic coding (or other agentic use): you want the model to do exactly what you say - the less "creativity", the better.

Of course, you can control this with temperature, but generally some models are best for creative writing, and some are best for following instructions.

I’m running those models on 64 GB RAM and two RX6800’s with 16 GB VRAM each. This gives me 32 GB VRAM in total. It’s not as fast as the latest NVidia graphics cards, but on the other hand, I can fit quite a lot in VRAM, and that’s going to have a much more noticeable difference compared to NVidia cards.

So, what are the most popular models for agentic coding, and what experience did I have with them?

The Journey

At first, I’ve tried Qwen3-Coder-30B-A3B-Instruct - the obvious choice when you look for "the latest model" and "for coding". First of all, it was pretty slow. The first request from KiloCode is around 10k tokens, and processing that much input takes about one minute. It then gets cached, so consequent requests will be faster, but if you switch models - or if your model crashes and you have to restart it, which does happen pretty often - you’ll have to wait for one minute again. And the more context you already have filled, the slower it gets, so you’re looking at similarly long wait times after each request. What’s more, very often it couldn’t even use tools correctly, so using it was mostly fighting with the model rather than the model helping me. I’ve tried running the model at Q6 and Q8, but it didn’t make them much better than Q4_K_XL.

Then I’ve tried something similar: Qwen3-30B-A3B-Instruct-2507 . It’s about as fast as the previous one, but this time it could use tools more consistently. It did pretty well with creating a basic project skeleton to begin iterating on, but it was pretty bad at finding and fixing bugs. I’ve asked it to add a few features, and it succeeded after a few attempts, but then the program could no longer run due to a (pretty obvious) null pointer dereferencing bug, and the model could not figure it out, no matter how many times I’ve tried. Instead, it tried to fix nonexistent issues, and messed up the code more and more.

This bug was pretty obvious and easy to fix, but it gave a lot of trouble to the models, so I’ve decided to use this as an exam. Is there any model that can fix this? And there was: gpt-oss-120b , which I’ve managed to run now that I’ve gained enough experience in this. And it was FAST! But it is famous for being somewhat "derpy", occasionally failing to use tools, and thinking a lot. It has figured out the bug, but instead of fixing it properly, it slapped a proxy function in between to substitute the null pointer. And as my project progressed further, most tasks became too complex even for gpt-oss-120b to handle. And it is still much slower than what you can get with a free-tier cloud inference provider.

So, I’ve hit the ceiling in terms of smartness, but there was still room for improvement in terms of speed. I’ve suddenly realized that ROCm v7 is WAY WAY faster than ROCm v6, and even faster than Vulkan, especially in prompt processing - what I needed the most! So I migrated from Vulkan to ROCm v7. Also, MXFP4 models really are (a little) faster even on AMD graphics cards.

It was at this point that the first REAP models have appeared. And this made a lot of difference! But obviously only in speed, not "smartness". Qwen3 and Qwen3-Coder got much faster, even though they did already fit completely in my VRAM before. gpt-oss (now 58b) got very very fast! At least compared to everything I’ve seen before. With all that, I got that 10k prompt that initially took about a minute down to only about 10 seconds!

REAP also made it possible to try GLM-4.5-Air which wouldn’t even fit in my RAM before. At first, it did not work at all - it was only outputting "???????". The solution to this is to disable its thinking. But either way, it is SLOW. Even with all those improvements. Like 5-10 minutes to ingest that initial 10k prompt slow. It’s obviously completely unusable.

Tools

I’ve learned a lot while doing this research. I’ve tried ollama, then llama.cpp, then llama-swap. When I saw that some models have trouble with tool calls, I noticed that there is no way in KiloCode (and I assume most other similar software) to see the raw model output, so I wanted a tool for that. I also wanted a tool to benchmark the models’ performance in a consistent way that would be easy to compare. Apparently no established benchmark exists; everyone simply writes their own. I’ve seen multiple benchmarks like this, but wanted something more feature-complete. I also saw llama-bench, but turns out it’s not very representative or consistent, as it uses literally random tokens as input.

At the same time, I needed something to write - some simple project to use as an example, simple enough that local models could handle it. And then it turned a little recursive: why don’t I write that toolkit? And that’s what I did. It has two tools. One is a benchmark that can test multiple models in multiple configurations at once, with up to 100k context, and output results in a table you can easily save in a simple text file. The other one is a proxy that dumps raw model output in real-time.

Thanks to the KiloCode Discord community for helping me a lot every step of the way!

Conclusion

"Small" models - ones you can fit completely in 32 GB VRAM at Q4 - are basically completely impotent.
"Big guns" models - ones you can run on 64GB RAM (but very little VRAM as it’s only used for the KV cache!) - can only help with pretty basic things, and are still way slower than free-tier inference providers.
Full-sized models running in the cloud - Qwen3-235b-A22b-Instruct-2507, Qwen3-Coder-480b, GLM-4.5-Air - can do way better. And way faster. They are actually usable and helpful. And when they are not enough, asking ChatGPT the old-fashioned way seems to be the best you can get.

With MoE and REAP, the state of local LLMs has advanced profoundly. Let’s hope that in the near future, we’ll get more technologies like that. Once we can run something comparable to already existing full-size models locally at reasonable speed, I’d call that the "day of the LLM desktop".

Goals and hardware

Creative writing/roleplay: you want the model to be creative - to be able to tell an interesting and unexpected story, rather than sticking to what you say.

Goals and hardware

Creative writing/roleplay: you want the model to be creative - to be able to tell an interesting and unexpected story, rather than sticking to what you say.

The Journey

Tools

Conclusion

Similar Posts