5 min read17 hours ago
–
***Authored by: ***Venkatesh Wagh
Press enter or click to view image in full size
“CPU is high in production.”
Every engineer has heard this at least once.
And almost every time, it triggers the same reaction:
open top, see 300% or 400% CPU, panic a little, maybe restart the service, and hope the problem goes away.
Sometimes it does. Most times, it comes back — louder, harder to explain, and more uncomfortable to debug.
High CPU is one of those problems that feels like a black box:
- No exceptions
- No stack traces
- No obvious failures
Memory looks stable. GC looks quiet. Requests are sti…
5 min read17 hours ago
–
***Authored by: ***Venkatesh Wagh
Press enter or click to view image in full size
“CPU is high in production.”
Every engineer has heard this at least once.
And almost every time, it triggers the same reaction:
open top, see 300% or 400% CPU, panic a little, maybe restart the service, and hope the problem goes away.
Sometimes it does. Most times, it comes back — louder, harder to explain, and more uncomfortable to debug.
High CPU is one of those problems that feels like a black box:
- No exceptions
- No stack traces
- No obvious failures
Memory looks stable. GC looks quiet. Requests are still flowing… just slower.
You’ll hear advice like:
- “Take a thread dump.”
- “Analyze hexadecimal thread IDs.”
- “Enable a profiler.”
- “Install an APM agent.”
All of that advice is technically correct — and still not very helpful if you don’t know which signal to trust first.
Before dashboards, flame graphs, or fancy tooling, it’s worth stepping back for a moment. Because the CPU is not a single thing. And not everything that looks expensive actually is.
A Simple Program That Eats an Entire CPU Core
Let’s start with something intentionally dumb.
//Program which consumes one core CPU by single thread @GetMapping("/basic-compute") public void compute(){ while(true){ int a = 2; int b = a*5; } }
No I/O. No allocations. No blocking calls.
Yet the moment this endpoint is hit, CPU spikes to 100%.
Why?
Because a single runnable thread is executing instructions as fast as the CPU allows. That’s it. No mystery.
This example is important because it exposes a common misconception:
High CPU does not require complexity. It only requires runnable threads.
What “300% CPU” Actually Means
CPU numbers are often misread — especially on multi-core machines.
For application developers, CPU = cores available to your process.
If your VM has 8 cores:
- 100% CPU = one core fully busy
- 400% CPU = four cores fully busy
- 12.5% CPU overall can still mean one hot core
So when you see:
“The app is using 300% CPU”
It does not mean the system is melting down. It means three cores are doing work.
(in linux — run top and then press 1 — you’ll get per core utilization) (in linux — run top and press shift+i — enter irix mode — which gives across cores utilization)
This distinction matters — because capacity problems and bugs look very similar in CPU graphs.
Case Study: Synthetic High CPU Usage
I created a small Spring Boot application with multiple REST APIs that intentionally burn CPU.
Endpoints include:
/basic-compute→ ~100% CPU (single core)/heavy-creation→ ~200–300% CPU
The /heavy-creation API generates ~25MB of JSON in memory. The CPU spike comes from:
- Object creation
- Serialization
- Parsing
You can observe this locally using:
top/htop- JConsole
- Basic JVM metrics
Project is available here: https://github.com/venkyintech-afk/cpu-spike-synthetic.git
First Observation: Thread Count Doesn’t Change
Press enter or click to view image in full size
Jconsole capture of app post executing /heavy-creation API
After triggering /heavy-creation, CPU jumps — but:
- Thread count remains constant (e.g., 31 threads)
- No new threads are spawned
This immediately tells us something important:
The CPU spike is not caused by thread creation. Existing threads are simply doing more work.
This is your first trustworthy signal.
Reading top Without Lying to Yourself
Let’s look at what top is actually telling us.
Press enter or click to view image in full size
top command output (for host), 201% CPU is for a core
Host-level view
CPU usage shows aggregate utilization across all cores
You might see:
- 36% user CPU
- 54% idle CPU
This already tells you:
The machine is not saturated.
Process-level view
Press enter or click to view image in full size
process level htop -pid <pid> on mac (with all 8 cores shown)
On Linux:
top -p <pid>
On macOS:
top -pid <pid>
If you see 200% CPU for your process on an 8-core machine:
- That’s roughly 2 cores fully utilized
- Across all cores, it’s only ~25%
This is not a crisis. It’s math.
Thread-level view (Linux)
On Linux, this is where things get interesting:
top -H -p <pid>
This shows:
- Which threads are consuming CPU
- Whether the work is concentrated or spread out
On macOS, thread-level CPU visibility is limited. You’ll rely more on:
jstack- sampling multiple thread dumps
- JVM-level tools
Different OS, different constraints.
The Most Important Question to Ask First
Is CPU being consumed by more threads — or by hotter threads?
Thread count increasing? Look for:
- Executor leaks
- Unbounded async pipelines
- Thread-per-request patterns
Thread count stable, CPU rising? Look for:
- Tight loops
- Serialization/deserialization
- Metrics collection
- Polling logic
- Inefficient batching
This single question narrows the problem space dramatically.
Use tools to improve your visibility and get more facts.
Tooling: Use It With Intent
Toolkits exist for a reason — but they work best after you understand the shape of the problem.
APM tools (AppDynamics, Dynatrace, New Relic) Great for:
- Thread-level CPU attribution
- Allocation hot paths
- Time-based correlation
Thread dumps (**jstack**)
Ideal when:
- CPU is high but unexplained
- You want to identify RUNNABLE threads
- You need ground truth
JFR / async profilers Powerful — but only useful once you know what you’re hunting
Tools don’t replace thinking. They amplify it.
Closing Thoughts
High CPU is not mysterious. It’s just often misunderstood.
Most production CPU incidents come down to:
- Runnable threads doing exactly what they were coded to do
- Misinterpreted metrics
- Incorrect mental models about cores and percentages
Once you understand what CPU numbers really represent, debugging becomes far less emotional — and far more mechanical.
Until the next article, Cheers 🥂
Authored by
Technology Tinkerer and enthusiast
Link to Bio