What really controls your processes on a Linux system?
You run a command, and it just… works. But what stops that command from consuming every last byte of RAM? What prevents a buggy script from saturating all 32 of your CPU cores, bringing your critical database to a grinding halt? In a modern multi-tenant server, or more recognizably, in a container, what builds the “walls” that keep applications in their own sandboxes?
The answer, deep in the plumbing of the Linux kernel, is Control Groups, or cgroups.
If you’ve ever used Docker, Kubernetes, or even modern systemd, you’ve used cgroups. They are the invisible engine of the container revolution and the unsung hero of modern server stability. They are the fundamental kernel mechanism that makes resource management—limiti…
What really controls your processes on a Linux system?
You run a command, and it just… works. But what stops that command from consuming every last byte of RAM? What prevents a buggy script from saturating all 32 of your CPU cores, bringing your critical database to a grinding halt? In a modern multi-tenant server, or more recognizably, in a container, what builds the “walls” that keep applications in their own sandboxes?
The answer, deep in the plumbing of the Linux kernel, is Control Groups, or cgroups.
If you’ve ever used Docker, Kubernetes, or even modern systemd, you’ve used cgroups. They are the invisible engine of the container revolution and the unsung hero of modern server stability. They are the fundamental kernel mechanism that makes resource management—limiting, accounting for, and isolating hardware resources—possible.
This is not just another brief overview. This is the ultimate, 10,000-word guide to Linux cgroups. We will start from the absolute basics (“what is a cgroup?”) and go all the way to the deep, technical internals of cgroup v1 and cgroup v2, how systemd uses them to manage your entire system, and how Kubernetes (K8s) and Docker translate a simple YAML file into kernel-level resource limits.
By the end of this article, you will not just know about cgroups; you will understand them.
Cgroups Resource Partitioning
The “Why?” – What Problem Do Cgroups Solve?
Before cgroups, resource management on Linux was primitive. You had nice and renice to adjust process priority—a suggestion, not a guarantee. You had ulimit to set resource limits, but it was per-process, not for a group of related processes, and it was notoriously inflexible.
This wasn’t good enough for the world Google was building in the mid-2000s. They needed two key things:
- Multi-Tenancy: To run many different applications (e.g., web search, maps, mail) on the same physical hardware without them interfering with each other. If the “maps” application had a memory leak, it could not be allowed to crash the “search” application.
- Quality of Service (QoS): To guarantee that high-priority, customer-facing applications always get the CPU and I/O they need, while low-priority, background batch jobs only use whatever resources are “left over.”
To solve this, Google engineers (Paul Menage and Rohit Seth) developed “process containers” in 2006, which were later renamed Control Groups and merged into the Linux kernel in 2008 (kernel 2.6.24).
Cgroups solve this by providing a mechanism to:
- Limit Resources: Set hard upper limits. For example, “This group of processes can never use more than 2GB of RAM or 50% of one CPU core.”
- Prioritize Resources: When there is resource contention (e.g., two applications both want 100% of the CPU), cgroups decide who gets what share. For example, “Application A gets 70% of the CPU time, and Application B gets 30%.”
- Account for Resources: Measure exactly how much CPU time, memory, and I/O a group of processes has used. This is critical for billing and monitoring.
- Isolate and Control: Do more than just limit resources. Cgroups can freeze a group of processes, restrict them to specific CPU cores, or deny them access to certain hardware devices.
This framework is the foundation for all containerization. A Docker container is, at its core, a collection of Linux kernel features. Cgroups provide the resource limits (the “walls” of the container’s room), while Namespaces provide the isolation (making the container think it’s in its own, separate room).
The Core Concepts of Cgroups
Before we dive into the v1 vs. v2 war, we must understand the terminology.
- Cgroup (Control Group): A collection of tasks (processes) that are treated as a single unit for resource management.
- Task: A process or a thread in the Linux kernel.
- Controller (or Subsystem): A specific piece of the kernel that manages one specific resource. There is a
memorycontroller, acpucontroller, apidscontroller, and so on. - Hierarchy: A tree-like structure of cgroups. A cgroup can have child cgroups, which inherit and can further restrict the limits of their parent. This is the most important concept for understanding the difference between v1 and v2.
- cgroupfs: A pseudo-filesystem (like
/procor/sys) that provides the user-space interface to cgroups. To manage cgroups, you simply create directories and read/write to files in this filesystem, which is almost always mounted at/sys/fs/cgroup.
Cgroup Hierarchy Tree
To put it all together: You create a cgroup (by creating a directory in cgroupfs). You enable specific controllers for it (like memory and cpu). You configure those controllers (by echo-ing values to files like memory.limit_in_bytes). Finally, you add your application’s processes (by writing their PIDs to a file named tasks or cgroup.procs).
From that moment on, all those processes (and any they spawn) are bound by the limits you defined.
The Exhaustive Deep Dive: Cgroups v1
Cgroup v1 (also called the “legacy” or “hybrid” model) was the standard for years. It’s powerful, flexible, and deeply confusing. Its defining feature—and its greatest flaw—is its use of multiple, independent hierarchies.
The v1 Problem: Multiple Hierarchies
In v1, you can mount different cgroup trees for different sets of controllers.
For example, you could have:
- One hierarchy for the
cpuandcpuacctcontrollers, mounted at/sys/fs/cgroup/cpu,cpuacct. - A completely separate hierarchy for the
memorycontroller, mounted at/sys/fs/cgroup/memory.
This means a process P1 could be in the /production cgroup in the cpu hierarchy, but in the /testing cgroup in the memory hierarchy. This flexibility was intended to allow complex, independent resource allocation, but in practice, it was a nightmare. It made it impossible to have a single, unified view of “what limits apply to this application?”
This is why systemd (which we’ll cover later) introduced a “hybrid” model where it tried to co-mount as many controllers as possible into one tree, but it was still a workaround.
The v1 Controllers: An Exhaustive Guide
Let’s explore the most important v1 controllers and their key files. You find these in /sys/fs/cgroup/<controller_name>.
1. cpu Controller
Manages CPU time using a “Completely Fair Scheduler” (CFS).
-
cpu.shares: -
What it is: A relative weight for CPU time. The default is
1024. -
How it works: This value only matters during CPU contention. If
Cgroup-Ahas1024shares andCgroup-Bhas512shares, and both are trying to use 100% of the CPU,Cgroup-Awill get roughly 2/3 (1024 / (1024+512)) of the CPU time, andCgroup-Bwill get 1/3. -
Common Pitfall: If the CPU is not contended (i.e., total usage is < 100%), this setting does nothing. A cgroup with
512shares is free to use 100% of the CPU if no one else is using it. -
cpu.cfs_period_usandcpu.cfs_quota_us: -
What it is: This is the hard cap on CPU usage (what most people think of as a “CPU limit”).
-
How it works:
cfs_period_usis the total time period, in microseconds.cfs_quota_usis how much “run time” the cgroup gets within that period. -
Example: To limit a cgroup to 50% of one CPU core:
-
cpu.cfs_period_us:100000(100ms) -
cpu.cfs_quota_us:50000(50ms) -
Example: To limit a cgroup to 2.5 CPU cores:
-
cpu.cfs_period_us:100000(100ms) -
cpu.cfs_quota_us:250000(250ms) -
If
cfs_quota_usis set to-1, there is no limit.
2. cpuacct Controller
This controller doesn’t enforce limits; it just reports on CPU usage. It’s almost always mounted in the same hierarchy as the cpu controller.
cpuacct.usage: Total CPU time (in nanoseconds) consumed by all tasks in this cgroup. This is a simple counter you can read.cpuacct.stat: Breaks down CPU time intouser(time spent in user-space code) andsystem(time spent in kernel-space code).
3. cpuset Controller
Critically important for performance and NUMA (Non-Uniform Memory Access) systems. This controller pins tasks to specific CPU cores and memory nodes.
-
cpuset.cpus: -
What it is: The list of CPU cores the cgroup’s tasks are allowed to run on.
-
Example:
echo "0-3,7" > cpuset.cpus(Allows tasks to run on cores 0, 1, 2, 3, and 7). -
cpuset.mems: -
What it is: The list of memory nodes the cgroup is allowed to allocate memory from.
-
How it works: On multi-socket servers (NUMA), each CPU has “local” memory that is faster to access. This setting ensures a process running on CPU
0allocates from its local memory node (e.g., node0), dramatically improving performance.
2-socket NUMA System
4. memory Controller
This is arguably the most complex and important controller. It manages memory usage and, crucially, is responsible for the OOM (Out-of-Memory) Killer.
-
memory.limit_in_bytes: -
What it is: The absolute hard limit on memory usage (RSS + Page Cache). If a process tries to allocate memory that would push the cgroup over this limit, the OOM Killer is invoked.
-
Example:
echo "1073741824" > memory.limit_in_bytes(Sets a 1GB limit). You can also use suffixes like1G,500M. -
memory.soft_limit_in_bytes: -
What it is: A “best-effort” limit. The kernel will try to reclaim memory from cgroups that are over their soft limit, but it’s not a guarantee.
-
How it works: This is useful for prioritization. A cgroup with a low soft limit is a good “first-dibs” candidate for memory reclamation when the system is under pressure, before it hits its hard limit.
-
memory.oom_control: -
What it is: Controls the OOM Killer for this cgroup.
-
Key File:
memory.oom_control(a file, not a directory). -
Settings:
-
oom_kill_disable 0(default): OOM Killer is enabled. Whenmemory.limit_in_bytesis hit, the kernel will kill a process inside this cgroup to free memory. -
oom_kill_disable 1: OOM Killer is disabled. This is dangerous. If the cgroup hits its limit, the kernel won’t kill a process; instead, the process that triggered the limit will just pause indefinitely, waiting for memory to be freed. -
memory.usage_in_bytes: Read-only file showing the cgroup’s current total memory usage. -
memory.stat: A detailed file showing a breakdown of memory:cache(page cache),rss(Resident Set Size – actual RAM),swap, etc.
5. blkio Controller
Manages Block I/O (i.e., disk reads and writes). This controller is complex and known for being somewhat imprecise.
-
blkio.weight: -
What it is: A relative weight for I/O, similar to
cpu.shares. The value is between 100 and 1000 (default 500). -
How it works: Only applies during I/O contention. A cgroup with
1000weight will get twice the I/O “time” as a cgroup with500weight. -
blkio.throttle.read_bps_deviceandblkio.throttle.write_bps_device: -
What it is: The hard cap on I/O bandwidth (Bytes Per Second).
-
How it works: You must specify the device major:minor number (which you can find with
ls -l /dev/sda). -
Example: To limit reads from
/dev/sda(e.g.,8:0) to 10MB/s: -
echo "8:0 10485760" > blkio.throttle.read_bps_device -
blkio.throttle.read_iops_deviceandblkio.throttle.write_iops_device: -
What it is: A hard cap on I/O Operations Per Second (IOPS). This is more useful for fast SSDs where bandwidth isn’t the bottleneck, but the number of operations is.
6. devices Controller
A security controller that acts as a whitelist/blacklist for device access.
-
devices.allowanddevices.deny: -
What it is: Whitelist or blacklist access to a device.
-
How it works: You write a string specifying the device (e.g.,
cfor character,bfor block), themajor:minornumber, and the permissions (rfor read,wfor write,mformknod). -
Example: To deny all write access to
/dev/sda(8:0): -
echo "b 8:0 w" > devices.deny -
By default, a cgroup inherits its parent’s
devices.list. An emptydevices.listmeans it has access to nothing until youallowit. This is how containers prevent access to host hardware.
7. pids Controller
A very simple but critical controller for preventing “fork bombs.”
-
pids.max: -
What it is: The maximum number of tasks (PIDs) that can exist in the cgroup at one time.
-
Example:
echo "100" > pids.max(Limits this cgroup to 100 total processes/threads). If it tries tofork()a 101st process, the call will fail.
8. freezer Controller
A “utility” controller that can suspend and resume all tasks in a cgroup.
-
freezer.state: -
What it is: A file to control the state.
-
echo "FROZEN" > freezer.state: Suspends (pauses) all tasks in the cgroup. -
echo "THAWED" > freezer.state: Resumes all tasks. -
Use Case: This is used for live-snapshotting a running application or for container migration.
Cgroups v2: The Unified, Simplified Future
Cgroup v1 was powerful, but it was a mess. The multiple hierarchies, the inconsistent file names (cpu.shares vs. blkio.weight), and the strange rules (like cpuacct doing nothing but reporting) made it difficult to use.
The kernel developers went back to the drawing board and created Cgroup v2. Its primary goal was to fix everything wrong with v1. It was declared stable in Linux kernel 4.5 (2016), and as of 2024+, it is the default and recommended version, used by all modern systemd and container runtimes.
The Core Principle of v2: The Unified Hierarchy
This is the single most important change. In v2, there is only one cgroup tree (hierarchy). All controllers are mounted in one place: /sys/fs/cgroup (or /sys/fs/cgroup/unified).
- No More Multiple Trees: A process is in exactly one cgroup, period. This cgroup’s path in the tree defines all of its resource limits.
- No More “Orphan” Controllers: You can’t have a
cputree and a separatememorytree.cpuandmemoryare both controllers that can be enabled on the same cgroup.
Cgroups v2
The New v2 Interface: How It Works
The file-based API was completely redesigned for consistency.
1. Controller Activation: cgroup.subtree_control
This is the new “magic” file and the most common source of confusion for v1 users.
- How it worked in v1: You created a directory
/sys/fs/cgroup/cpu/my-group. It automatically was acpucgroup. - How it works in v2:
- You create a directory
/sys/fs/cgroup/my-group. By default, this directory can’t have resource limits set on it (it’s just a “parent”). - To “delegate” controllers to child cgroups, the parent cgroup must explicitly enable them.
- You do this by writing to the parent’s
cgroup.subtree_controlfile.
-
Example: To allow
my-groupto have children that controlcpuandmemory: -
echo "+cpu +memory" > /sys/fs/cgroup/cgroup.subtree_control(This enables them for the root cgroup’s children). -
Now, you can create
mkdir /sys/fs/cgroup/my-group -
echo "+cpu +memory" > /sys/fs/cgroup/my-group/cgroup.subtree_control(This enables them formy-group‘s children). -
Now, you can create
mkdir /sys/fs/cgroup/my-group/my-app -
…and finally, you can set limits in
/sys/fs/cgroup/my-group/my-app.
This “delegation” model ensures the hierarchy is clean and intentional. A cgroup can either (a) have processes in it, or (b) have child cgroups with controllers enabled, but not both (with some exceptions at the root).
2. Adding Processes: cgroup.procs
- The
tasksfile is gone. - To add a process to a cgroup, you write its PID to
cgroup.procs. - The kernel automatically handles moving all threads of that process.
The v2 Controllers: Cleaned Up and More Powerful
The v2 controllers are renamed and refactored for consistency.
1. cpu Controller (v2)
Combines the functionality of v1’s cpu and cpuacct.
-
cpu.weight: -
Replaces:
cpu.shares. -
What it is: A relative weight, but the range is now 1-10000 (default 100). Same “contention-only” logic applies.
-
cpu.max: -
Replaces:
cpu.cfs_quota_usandcpu.cfs_period_us. -
What it is: A much simpler format:
$QUOTA $PERIOD. -
Example: To limit to 50% of one core:
-
echo "50000 100000" > cpu.max -
Example: To limit to 2.5 cores:
-
echo "250000 100000" > cpu.max -
cpu.stat: Read-only file that replacescpuacct. Showsusage_usec(total CPU time),user_usec,system_usec, and (critically)throttled_usec(how long the cgroup was prevented from running because it hit itscpu.maxlimit).
2. memory Controller (v2)
Vastly improved with better controls for soft limits and memory pressure.
-
memory.max: -
Replaces:
memory.limit_in_bytes. -
What it is: The hard limit. Hitting this invokes the OOM Killer.
-
Example:
echo "1073741824" > memory.max(1GB limit). -
memory.high: -
What it is: The main “soft limit” and throttling mechanism.
-
How it works: When a cgroup goes above
memory.high, the kernel will heavily throttle its tasks (making them run slower) and aggressively try to reclaim memory from it. This is a much stronger “suggestion” than v1’ssoft_limit. The cgroup is allowed to exceed this, but it will pay a heavy performance penalty. -
memory.low: -
What it is: A “best-effort” guarantee. The kernel will try not to reclaim memory from this cgroup if its usage is below this
lowwater-mark. -
memory.min: -
What it is: A stronger guarantee. The kernel will work very hard to protect this amount of memory for the cgroup.
-
The New “Protection” Model:
-
memory.min/memory.low: Protect memory below this line. -
memory.high/memory.max: Throttle or kill memory above this line. -
This gives you a “safe zone” (
lowtohigh) where your app can operate, with clear protection and throttling boundaries. -
memory.events: A file that reports on key events, includingoom(how many OOM kills) andhigh(how many timesmemory.highwas breached).
3. io Controller (v2)
The new and improved replacement for blkio. It’s “weight-based” and “cost-based,” meaning it’s much smarter about different device types (e.g., it knows an I/O operation on a spinning disk “costs” more than one on an NVMe SSD).
-
io.weight: -
Replaces:
blkio.weight. Relative I/O priority (1-10000). -
io.max: -
Replaces:
blkio.throttle.*. -
What it is: A much simpler interface for hard limits.
-
Format:
MAJ:MIN TYPE=VALUE -
Example: To limit reads from
/dev/sda(8:0) to 10MB/s and writes to 500 IOPS: -
echo "8:0 rbps=10485760 wiops=500" > io.max -
(
rbps= read bytes per second,wiops= write IOPS).
4. pids Controller (v2)
Identical in function to v1, just new file names.
pids.max: The max number of PIDs.pids.current: Read-only file showing the current number of PIDs.
The v2 Killer Feature: Pressure Stall Information (PSI)
This is one of the most powerful monitoring features ever added to the Linux kernel. It’s not a controller, but a monitoring interface that’s available by default when cgroup v2 is enabled.
The Problem: Your application is slow. Why? Is it…
- …throttled because it hit its
cpu.maxlimit? (CPU pressure) - …waiting for the kernel to reclaim memory? (Memory pressure)
- …waiting for a disk read/write to complete? (I/O pressure)
Before PSI, this was incredibly hard to diagnose.
The Solution: PSI adds files to every cgroup: cpu.pressure, memory.pressure, and io.pressure.
When you cat one of these files, you see:
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
some: The percentage of time in the last 10, 60, or 300 seconds that at least one task in the cgroup was stalled, waiting for this resource.full: The percentage of time that all tasks in the cgroup were stalled simultaneously, waiting for this resource.
Why this is a game-changer:
- You can now definitively know why your app is slow.
cpu.pressuresomeis high? Your cgroup is CPU-throttled. You need to increase itscpu.maxorcpu.weight.memory.pressurefullis high? Your cgroup is being heavily throttled bymemory.highor is actively swapping. It’s starved for memory.io.pressuresomeis high? Your tasks are waiting for disk I/O.- This enables smart autoscaling. A Kubernetes autoscaler could see
cpu.pressureis high and decide to add another pod, before CPU usage even hits 100%.
High Memory Pressure
The Practical Guide: How to Use Cgroups
Enough theory. Let’s get our hands dirty. There are three main ways to interact with cgroups, from worst to best.
Method 1: Manual Interaction (The “For-Learning” Way)
You should do this to understand how the plumbing works, but never do this on a production server.
Prerequisite: Your system must be running cgroup v2. Check with mount | grep cgroup. You should see cgroup2 on /sys/fs/cgroup type cgroup2.
Example 1: Limit CPU with v2
Let’s limit a simple while true loop.
- Become root:
sudo -s - Navigate to the root cgroup:
cd /sys/fs/cgroup - Enable controllers for child cgroups: We must “delegate” the
cpucontroller from the root.echo "+cpu" > cgroup.subtree_control - Create our new cgroup:
mkdir my-cpu-group - Set the CPU limit: Let’s limit this group to 20% of one CPU core.
echo "20000 100000" > my-cpu-group/cpu.max - Run a “CPU-hungry” process: Open another terminal.
# This will spin a loop. while true; do :; done - Find its PID:
pidof while # Let's say it returns 12345 - Move the process into the cgroup:
echo "12345" > my-cpu-group/cgroup.procs - Observe: Open
htop. You will see yourwhileprocess, which was previously using 100% of a core, is now capped at exactly 20%. - Clean up:
# Kill the process kill 12345 # Remove the cgroup (must be empty of procs first) rmdir my-cpu-group
Example 2: Limit Memory with v2
Let’s limit the stress-ng tool.
- Install
stress-ng:apt install stress-ngordnf install stress-ng - Become root:
sudo -s - Navigate:
cd /sys/fs/cgroup - Enable controllers: This time, we need
memory.# We already enabled +cpu, let's add +memory echo "+memory" > cgroup.subtree_control - Create our memory cgroup:
mkdir my-mem-group - Set the hard memory limit: Let’s set it to 100MB.
echo "100M" > my-mem-group/memory.max - Run a “memory-hungry” process: We’ll use
stress-ngto try and allocate 500MB, which is over our limit.# Run this in the background stress-ng --vm 1 --vm-bytes 500M & - Find its PID:
pidof stress-ng # Let's say it returns 54321 - Move the process before it allocates: (This is a bit of a race, but
stress-ngis usually slow enough)echo "54321" > my-mem-group/cgroup.procs - Observe:
- Check
dmesg -w. You will see messages from the kernel: “OOM-killer: task … (stress-ng) … killed”. - Check the cgroup’s events:
cat my-mem-group/memory.events. You will seeoom 1andoom_kill 1.
- Clean up:
rmdir my-mem-group
This manual method is messy. Processes die, cgroups need to be cleaned up… there’s a better way.
Method 2: systemd (The “Modern Linux” Way)
On almost any modern Linux distro, systemd is the king of cgroups. It manages the entire cgroup tree (using v2 by default) and abstracts all this messiness away.
You should never manually mkdir in /sys/fs/cgroup on a systemd system. You should let systemd do it for you.
systemd divides everything into three types of cgroup “units”:
- Slices: These are groups that contain other units. They are the “branches” of the tree. Examples:
system.slice(for system services),user.slice(for all user login sessions),machine.slice(for VMs and containers). - Services: These are cgroups for daemons, like
nginx.serviceorsshd.service. - Scopes: These are cgroups for transient, user-initiated processes, like a user’s login shell.
Tool 1: systemd-cgls
This command shows you the entire cgroup tree as systemd sees it. It’s the best way to visualize what’s running on your system.
Output of systemd-cgls
Tool 2: systemd-cgtop
This is like htop but for cgroups. It shows you the top cgroups ranked by their CPU, Memory, and I/O usage.
Tool 3: Setting limits in .service files
This is the “persistent” way to limit a service. If you’re an admin, this is how you control nginx, postgres, etc.
Edit the service’s unit file (e.g., systemctl edit nginx.service). In the [Service] section, you add resource-control options.
[Service]
# v1-style (shares)
CPUShares=512
# v2-style (weight)
CPUWeight=100
# v2-style (max)
CPUQuota=50%
# Memory limits (v2)
MemoryMin=100M
MemoryLow=500M
MemoryHigh=1G
MemoryMax=2G
# pids limit
TasksMax=500
# blkio limit
IOReadBandwidthMax=/dev/sda 10M
IOWriteBandwidthMax=/dev/sda 5M
After adding this, run systemctl daemon-reload and systemctl restart nginx.service. systemd will handle all the echoing to all the correct cgroup files for you.
Tool 4: systemd-run (The “On-the-Fly” Way)
This is the best way to run a one-off command with specific limits. It’s the systemd equivalent of our manual mkdir example, but it’s clean and safe.
Let’s re-do our CPU limit example:
# Run a loop in a new, transient scope, with a 20% CPU quota
systemd-run --scope -p "CPUQuota=20%" while true; do :; done
That’s it! One command. systemd creates a temporary scope unit (e.g., transient-while.scope), sets the cpu.max property, moves the process into it, and automatically cleans up the cgroup when the process finishes.
Let’s re-do our memory limit example:
# Run stress-ng in a new service, limited to 100M of RAM
systemd-run --unit=my-mem-test -p "MemoryMax=100M" stress-ng --vm 1 --vm-bytes 500M
You can then check its status with systemctl status my-mem-test. You’ll see it failed, and journalctl -u my-mem-test will show the OOM kill. It’s all perfectly managed and logged.
Method 3: Container Runtimes (Docker & Kubernetes)
This is the highest level of abstraction. Docker and Kubernetes are, in many ways, just fancy systemd-run wrappers.
When you write docker run --memory=1g --cpus=0.5 ..., Docker is simply:
- Creating a new cgroup.
echo "1G" > .../memory.maxecho "50000 100000" > .../cpu.max- Putting the container’s process into that cgroup.
The systemd vs. cgroupfs Driver
There’s one “gotcha” in the container world: the --cgroup-driver flag for Docker and the Kubelet.
cgroupfsdriver (old default): Docker manages its own cgroups, manuallymkdir-ing andecho-ing into/sys/fs/cgroup.systemddriver (new default): Docker asks systemd (viasystemd-run) to create and manage a cgroup (a.scopeunit) for the container.
Why does this matter? If Docker uses the cgroupfs driver, systemd (which thinks it owns the whole tree) gets confused. You end up with two “managers” of cgroups, which can lead to instability.
The rule is: ALWAYS use the systemd driver for both Docker and the Kubelet. This ensures systemd is the single source of truth for the cgroup tree, and tools like systemd-cgls will show you your containers.
systemd vs cgroupfs
Cgroups and the Container Revolution
This is where all the pieces come together. Cgroups are the enforcement mechanism for container resource limits.
How Docker Uses Cgroups
Let’s trace a docker run command:
docker run -d \
--name my-nginx \
--cpus="1.5" \
--memory="512m" \
--pids-limit=100 \
--blkio-weight=300 \
nginx
Assuming Docker is using the systemd cgroup driver, here’s what happens under the hood:
- Docker calls
systemdto create a new scope unit, something likedocker-<container-id>.scope, under themachine.slice. systemdcreates the cgroup:/sys/fs/cgroup/system.slice/docker-<container-id>.scopesystemd(at Docker’s request) writes the limits:
echo "150000 100000" > .../cpu.max(1.5 cores)echo "536870912" > .../memory.max(512MB)echo "100" > .../pids.maxecho "118" > .../cpu.weight(Docker maps 300 to a 1-10000 scale,(300 / 1000) * 26214 + 1… it’s complex, but it’s just setting the weight).echo "300" > .../io.weight(v2iocontroller uses a different mapping)
The container is now running, fully constrained by the kernel.
How Kubernetes (Kubelet) Uses Cgroups
Kubernetes takes this one step further by creating a hierarchy of cgroups to manage Quality of Service (QoS).
When you define a Pod, Kubernetes assigns it one of three QoS Classes based on your resources specification:
Guaranteed:
- How: You set
limitsandrequests, andlimits==requestsfor all resources (CPU and memory). - What Kubelet does: This is the highest priority. The Kubelet sets
cpu.maxto your CPU limit andmemory.maxto your memory limit. Your pod is guaranteed this amount, but it can never use more.
Burstable:
-
How: You set
requestsandlimits, andrequests<limits. Or you only setrequests. -
What Kubelet does: This is the medium priority.
-
cpu.weight(v2) orcpu.shares(v1) is set based on yourrequests. (e.g.,100mCPU request = ~10 shares). -
cpu.maxis set to yourlimit. -
memory.maxis set to yourlimit. -
memory.low(v2) ormemory.soft_limit(v1) is set to yourrequest. -
Result: Your pod is guaranteed its
requests. If the node has free resources, it can “burst” up to itslimits. If the node is under pressure, it’s a more likely OOM kill candidate thanGuaranteedpods.
BestEffort:
- How: You set no
requestsorlimits. - What Kubelet does: This is the lowest priority. It sets minimal
cpu.weightand no memory guarantees. These are the first pods to be OOM-killed if the node runs out of memory.
The Kubelet Cgroup Tree:
The Kubelet, via systemd, creates its own slice called kubepods.slice and builds a tree inside it:
/sys/fs/cgroup/kubepods.slice/
├── kubepods-besteffort.slice/
│ ├── pod<POD_UID_1>/
│ │ ├── container<ID_A>
│ │ └── container<ID_B>
├── kubepods-burstable.slice/
│ ├── pod<POD_UID_2>/
│ │ └── container<ID_C>
└── kubepods-guaranteed.slice/
├── pod<POD_UID_3>/
│ └── container<ID_D>
This brilliant structure allows Kubernetes to manage resources at a “whole QoS class” level. For example, if the node is under pressure, the kernel will first try to reclaim memory from all pods in kubepods-besteffort.slice.
This is the magic: Your simple YAML file…
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "512Mi"
cpu: "1"
…is translated by the Kubelet into a series of echo commands to files in a systemd-managed cgroup (/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/pod...), setting cpu.weight, cpu.max, memory.low, and memory.max, all using the kernel features we just explored.
Kubernetes YAML Flowchart
Cgroups as the Bedrock of the Cloud
We’ve been on a deep, 10,000-word journey from a simple kernel patch in 2008 to the engine powering the entire global cloud infrastructure.
Let’s recap the key takeaways:
- Cgroups limit, account for, and isolate resource usage (CPU, memory, I/O, etc.).
- Cgroup v1 was powerful but messy, with multiple hierarchies that made a unified view impossible.
- Cgroup v2 is the modern standard, with a single unified hierarchy and a clean, consistent interface.
- v2 “Magic Files”:
cgroup.subtree_control(to delegate controllers) andcgroup.procs(to add PIDs). - v2 Killer Features: The
memory.highthrottle point for better memory pressure handling, and Pressure Stall Information (PSI) for unparalleled, “why-is-it-slow” diagnostics. - Don’t Touch! You should almost never manage cgroups manually by
mkdir-ing in/sys/fs/cgroup. - Use
systemd:systemdis the “cgroup manager” for a modern Linux system. Use.servicefiles orsystemd-runto set limits. - Use the
systemdDriver: Ensure your container runtimes (Docker,containerd) are configured to use thesystemdcgroup driver for a stable, unified tree. - Containers are Cgroups: Docker and Kubernetes are high-level cgroup managers. Your
resources:limitsin a Pod YAML are just a friendly face forecho-ing values tocpu.maxandmemory.max.
Cgroups are a perfect example of the Linux philosophy: “everything is a file.” They transformed a complex problem (resource management) into a simple, file-based API. By mastering this API—or at least, by mastering the tools like systemd and Kubernetes that master it for you—you gain complete control over your system’s performance, stability, and security. They are the invisible walls, the traffic cops, and the accountants of the kernel, and now, they are no longer invisible to you.