Hello good people of r/LocalLLaMA
I’m buidling an agent that can call my app’s APIs (exposed as tools) and run automated test cases. Running everything on a CPU-only machine (8GB RAM) with LM Studio hosting Qwen 3 4B Instruct (Q4_K_M / Q8). I talk to it from a C# client using the OpenAI API format.
Performance is tiny but fine (1–2 tok/sec) ok for tool calling, I’m surprised it even works:)
But I noticed something: after the first turn, the llm response is noticably a bit faster.
Did some reading, found out this is probably KV cache which from what little I understand:
Is a processed prefix (system prompt + tool schemas + history) that model keeps, so it doesn’t re-do all the attention work every turn.
BUT it only works if …
Hello good people of r/LocalLLaMA
I’m buidling an agent that can call my app’s APIs (exposed as tools) and run automated test cases. Running everything on a CPU-only machine (8GB RAM) with LM Studio hosting Qwen 3 4B Instruct (Q4_K_M / Q8). I talk to it from a C# client using the OpenAI API format.
Performance is tiny but fine (1–2 tok/sec) ok for tool calling, I’m surprised it even works:)
But I noticed something: after the first turn, the llm response is noticably a bit faster.
Did some reading, found out this is probably KV cache which from what little I understand:
Is a processed prefix (system prompt + tool schemas + history) that model keeps, so it doesn’t re-do all the attention work every turn.
BUT it only works if we stay in one continuous chat thread.
If I start a new chat with a new system prompt, or change tool definitions, or rebuild the prefix so the KV gets wiped and the model has to re-ingest everything again.
Here’s why I’m confused
In my current agent design flow I:
Often clone the main chat whenever needed and run quick “side” prompts (like asking the model to validate something, check a condition, break a request into steps, etc.). I assumed keeping those separate would be faster.
I also do tool routing by asking the LLM to pick a subset of tools, and then I rebuild the tool schema each time accordingly.
Now I’m starting to think all of this is destroying my KV cache constantly, which might be making performance worse instead of better.
Just want to know what people actually do in practice. If there are smarter patterns to run llms in resources constrained hws where every little bit matters to improve performance... I’d like to hear your thoughts...