Optimizing your homelab containers: what cgroups v2 does differently

What really controls your processes on a Linux system?

You run a command, and it just… works. But what stops that command from consuming every last byte of RAM? What prevents a buggy script from saturating all 32 of your CPU cores, bringing your critical database to a grinding halt? In a modern multi-tenant server, or more recognizably, in a container, what builds the “walls” that keep applications in their own sandboxes?

The answer, deep in the plumbing of the Linux kernel, is Control Groups, or cgroups.

If you’ve ever used Docker, Kubernetes, or even modern systemd, you’ve used cgroups. They are the invisible engine of the container revolution and the unsung hero of modern server stability. They are the fundamental kernel mechanism that makes resource management—limiti…

What really controls your processes on a Linux system?

The answer, deep in the plumbing of the Linux kernel, is Control Groups, or cgroups.

This is not just another brief overview. This is the ultimate, 10,000-word guide to Linux cgroups. We will start from the absolute basics (“what is a cgroup?”) and go all the way to the deep, technical internals of cgroup v1 and cgroup v2, how systemd uses them to manage your entire system, and how Kubernetes (K8s) and Docker translate a simple YAML file into kernel-level resource limits.

By the end of this article, you will not just know about cgroups; you will understand them.

Cgroups Resource Partitioning

The “Why?” – What Problem Do Cgroups Solve?

Before cgroups, resource management on Linux was primitive. You had nice and renice to adjust process priority—a suggestion, not a guarantee. You had ulimit to set resource limits, but it was per-process, not for a group of related processes, and it was notoriously inflexible.

This wasn’t good enough for the world Google was building in the mid-2000s. They needed two key things:

Multi-Tenancy: To run many different applications (e.g., web search, maps, mail) on the same physical hardware without them interfering with each other. If the “maps” application had a memory leak, it could not be allowed to crash the “search” application.
Quality of Service (QoS): To guarantee that high-priority, customer-facing applications always get the CPU and I/O they need, while low-priority, background batch jobs only use whatever resources are “left over.”

To solve this, Google engineers (Paul Menage and Rohit Seth) developed “process containers” in 2006, which were later renamed Control Groups and merged into the Linux kernel in 2008 (kernel 2.6.24).

Cgroups solve this by providing a mechanism to:

Limit Resources: Set hard upper limits. For example, “This group of processes can never use more than 2GB of RAM or 50% of one CPU core.”
Prioritize Resources: When there is resource contention (e.g., two applications both want 100% of the CPU), cgroups decide who gets what share. For example, “Application A gets 70% of the CPU time, and Application B gets 30%.”
Account for Resources: Measure exactly how much CPU time, memory, and I/O a group of processes has used. This is critical for billing and monitoring.
Isolate and Control: Do more than just limit resources. Cgroups can freeze a group of processes, restrict them to specific CPU cores, or deny them access to certain hardware devices.

This framework is the foundation for all containerization. A Docker container is, at its core, a collection of Linux kernel features. Cgroups provide the resource limits (the “walls” of the container’s room), while Namespaces provide the isolation (making the container think it’s in its own, separate room).

The Core Concepts of Cgroups

Before we dive into the v1 vs. v2 war, we must understand the terminology.

Cgroup (Control Group): A collection of tasks (processes) that are treated as a single unit for resource management.
Task: A process or a thread in the Linux kernel.
Controller (or Subsystem): A specific piece of the kernel that manages one specific resource. There is a memory controller, a cpu controller, a pids controller, and so on.
Hierarchy: A tree-like structure of cgroups. A cgroup can have child cgroups, which inherit and can further restrict the limits of their parent. This is the most important concept for understanding the difference between v1 and v2.
cgroupfs: A pseudo-filesystem (like /proc or /sys) that provides the user-space interface to cgroups. To manage cgroups, you simply create directories and read/write to files in this filesystem, which is almost always mounted at /sys/fs/cgroup.

Cgroup Hierarchy Tree

To put it all together: You create a cgroup (by creating a directory in cgroupfs). You enable specific controllers for it (like memory and cpu). You configure those controllers (by echo-ing values to files like memory.limit_in_bytes). Finally, you add your application’s processes (by writing their PIDs to a file named tasks or cgroup.procs).

From that moment on, all those processes (and any they spawn) are bound by the limits you defined.

The Exhaustive Deep Dive: Cgroups v1

Cgroup v1 (also called the “legacy” or “hybrid” model) was the standard for years. It’s powerful, flexible, and deeply confusing. Its defining feature—and its greatest flaw—is its use of multiple, independent hierarchies.

The v1 Problem: Multiple Hierarchies

In v1, you can mount different cgroup trees for different sets of controllers.

For example, you could have:

One hierarchy for the cpu and cpuacct controllers, mounted at /sys/fs/cgroup/cpu,cpuacct.
A completely separate hierarchy for the memory controller, mounted at /sys/fs/cgroup/memory.

This means a process P1 could be in the /production cgroup in the cpu hierarchy, but in the /testing cgroup in the memory hierarchy. This flexibility was intended to allow complex, independent resource allocation, but in practice, it was a nightmare. It made it impossible to have a single, unified view of “what limits apply to this application?”

This is why systemd (which we’ll cover later) introduced a “hybrid” model where it tried to co-mount as many controllers as possible into one tree, but it was still a workaround.

The v1 Controllers: An Exhaustive Guide

Let’s explore the most important v1 controllers and their key files. You find these in /sys/fs/cgroup/<controller_name>.

1. `cpu` Controller

Manages CPU time using a “Completely Fair Scheduler” (CFS).

cpu.shares:
What it is: A relative weight for CPU time. The default is 1024.
How it works: This value only matters during CPU contention. If Cgroup-A has 1024 shares and Cgroup-B has 512 shares, and both are trying to use 100% of the CPU, Cgroup-A will get roughly 2/3 (1024 / (1024+512)) of the CPU time, and Cgroup-B will get 1/3.
Common Pitfall: If the CPU is not contended (i.e., total usage is < 100%), this setting does nothing. A cgroup with 512 shares is free to use 100% of the CPU if no one else is using it.
cpu.cfs_period_us and cpu.cfs_quota_us:
What it is: This is the hard cap on CPU usage (what most people think of as a “CPU limit”).
How it works: cfs_period_us is the total time period, in microseconds. cfs_quota_us is how much “run time” the cgroup gets within that period.
Example: To limit a cgroup to 50% of one CPU core:
cpu.cfs_period_us: 100000 (100ms)
cpu.cfs_quota_us: 50000 (50ms)
Example: To limit a cgroup to 2.5 CPU cores:
cpu.cfs_period_us: 100000 (100ms)
cpu.cfs_quota_us: 250000 (250ms)
If cfs_quota_us is set to -1, there is no limit.

2. `cpuacct` Controller

This controller doesn’t enforce limits; it just reports on CPU usage. It’s almost always mounted in the same hierarchy as the cpu controller.

cpuacct.usage: Total CPU time (in nanoseconds) consumed by all tasks in this cgroup. This is a simple counter you can read.
cpuacct.stat: Breaks down CPU time into user (time spent in user-space code) and system (time spent in kernel-space code).

3. `cpuset` Controller

Critically important for performance and NUMA (Non-Uniform Memory Access) systems. This controller pins tasks to specific CPU cores and memory nodes.

cpuset.cpus:
What it is: The list of CPU cores the cgroup’s tasks are allowed to run on.
Example: echo "0-3,7" > cpuset.cpus (Allows tasks to run on cores 0, 1, 2, 3, and 7).
cpuset.mems:
What it is: The list of memory nodes the cgroup is allowed to allocate memory from.
How it works: On multi-socket servers (NUMA), each CPU has “local” memory that is faster to access. This setting ensures a process running on CPU 0 allocates from its local memory node (e.g., node 0), dramatically improving performance.

2-socket NUMA System

4. `memory` Controller

This is arguably the most complex and important controller. It manages memory usage and, crucially, is responsible for the OOM (Out-of-Memory) Killer.

memory.limit_in_bytes:
What it is: The absolute hard limit on memory usage (RSS + Page Cache). If a process tries to allocate memory that would push the cgroup over this limit, the OOM Killer is invoked.
Example: echo "1073741824" > memory.limit_in_bytes (Sets a 1GB limit). You can also use suffixes like 1G, 500M.
memory.soft_limit_in_bytes:
What it is: A “best-effort” limit. The kernel will try to reclaim memory from cgroups that are over their soft limit, but it’s not a guarantee.
How it works: This is useful for prioritization. A cgroup with a low soft limit is a good “first-dibs” candidate for memory reclamation when the system is under pressure, before it hits its hard limit.
memory.oom_control:
What it is: Controls the OOM Killer for this cgroup.
Key File: memory.oom_control (a file, not a directory).
Settings:
oom_kill_disable 0 (default): OOM Killer is enabled. When memory.limit_in_bytes is hit, the kernel will kill a process inside this cgroup to free memory.
oom_kill_disable 1: OOM Killer is disabled. This is dangerous. If the cgroup hits its limit, the kernel won’t kill a process; instead, the process that triggered the limit will just pause indefinitely, waiting for memory to be freed.
memory.usage_in_bytes: Read-only file showing the cgroup’s current total memory usage.
memory.stat: A detailed file showing a breakdown of memory: cache (page cache), rss (Resident Set Size – actual RAM), swap, etc.

5. `blkio` Controller

Manages Block I/O (i.e., disk reads and writes). This controller is complex and known for being somewhat imprecise.

blkio.weight:
What it is: A relative weight for I/O, similar to cpu.shares. The value is between 100 and 1000 (default 500).
How it works: Only applies during I/O contention. A cgroup with 1000 weight will get twice the I/O “time” as a cgroup with 500 weight.
blkio.throttle.read_bps_device and blkio.throttle.write_bps_device:
What it is: The hard cap on I/O bandwidth (Bytes Per Second).
How it works: You must specify the device major:minor number (which you can find with ls -l /dev/sda).
Example: To limit reads from /dev/sda (e.g., 8:0) to 10MB/s:
echo "8:0 10485760" > blkio.throttle.read_bps_device
blkio.throttle.read_iops_device and blkio.throttle.write_iops_device:
What it is: A hard cap on I/O Operations Per Second (IOPS). This is more useful for fast SSDs where bandwidth isn’t the bottleneck, but the number of operations is.

6. `devices` Controller

A security controller that acts as a whitelist/blacklist for device access.

devices.allow and devices.deny:
What it is: Whitelist or blacklist access to a device.
How it works: You write a string specifying the device (e.g., c for character, b for block), the major:minor number, and the permissions (r for read, w for write, m for mknod).
Example: To deny all write access to /dev/sda (8:0):
echo "b 8:0 w" > devices.deny
By default, a cgroup inherits its parent’s devices.list. An empty devices.list means it has access to nothing until you allow it. This is how containers prevent access to host hardware.

7. `pids` Controller

A very simple but critical controller for preventing “fork bombs.”

pids.max:
What it is: The maximum number of tasks (PIDs) that can exist in the cgroup at one time.
Example: echo "100" > pids.max (Limits this cgroup to 100 total processes/threads). If it tries to fork() a 101st process, the call will fail.

8. `freezer` Controller

A “utility” controller that can suspend and resume all tasks in a cgroup.

freezer.state:
What it is: A file to control the state.
echo "FROZEN" > freezer.state: Suspends (pauses) all tasks in the cgroup.
echo "THAWED" > freezer.state: Resumes all tasks.
Use Case: This is used for live-snapshotting a running application or for container migration.

Cgroups v2: The Unified, Simplified Future

Cgroup v1 was powerful, but it was a mess. The multiple hierarchies, the inconsistent file names (cpu.shares vs. blkio.weight), and the strange rules (like cpuacct doing nothing but reporting) made it difficult to use.

The kernel developers went back to the drawing board and created Cgroup v2. Its primary goal was to fix everything wrong with v1. It was declared stable in Linux kernel 4.5 (2016), and as of 2024+, it is the default and recommended version, used by all modern systemd and container runtimes.

The Core Principle of v2: The Unified Hierarchy

This is the single most important change. In v2, there is only one cgroup tree (hierarchy). All controllers are mounted in one place: /sys/fs/cgroup (or /sys/fs/cgroup/unified).

No More Multiple Trees: A process is in exactly one cgroup, period. This cgroup’s path in the tree defines all of its resource limits.
No More “Orphan” Controllers: You can’t have a cpu tree and a separate memory tree. cpu and memory are both controllers that can be enabled on the same cgroup.

Cgroups v2

The New v2 Interface: How It Works

The file-based API was completely redesigned for consistency.

1. Controller Activation: `cgroup.subtree_control`

This is the new “magic” file and the most common source of confusion for v1 users.

How it worked in v1: You created a directory /sys/fs/cgroup/cpu/my-group. It automatically was a cpu cgroup.
How it works in v2:

You create a directory /sys/fs/cgroup/my-group. By default, this directory can’t have resource limits set on it (it’s just a “parent”).
To “delegate” controllers to child cgroups, the parent cgroup must explicitly enable them.
You do this by writing to the parent’s cgroup.subtree_control file.

Example: To allow my-group to have children that control cpu and memory:
echo "+cpu +memory" > /sys/fs/cgroup/cgroup.subtree_control (This enables them for the root cgroup’s children).
Now, you can create mkdir /sys/fs/cgroup/my-group
echo "+cpu +memory" > /sys/fs/cgroup/my-group/cgroup.subtree_control (This enables them for my-group‘s children).
Now, you can create mkdir /sys/fs/cgroup/my-group/my-app
…and finally, you can set limits in /sys/fs/cgroup/my-group/my-app.

This “delegation” model ensures the hierarchy is clean and intentional. A cgroup can either (a) have processes in it, or (b) have child cgroups with controllers enabled, but not both (with some exceptions at the root).

2. Adding Processes: `cgroup.procs`

The tasks file is gone.
To add a process to a cgroup, you write its PID to cgroup.procs.
The kernel automatically handles moving all threads of that process.

The v2 Controllers: Cleaned Up and More Powerful

The v2 controllers are renamed and refactored for consistency.

1. `cpu` Controller (v2)

Combines the functionality of v1’s cpu and cpuacct.

cpu.weight:
Replaces: cpu.shares.
What it is: A relative weight, but the range is now 1-10000 (default 100). Same “contention-only” logic applies.
cpu.max:
Replaces: cpu.cfs_quota_us and cpu.cfs_period_us.
What it is: A much simpler format: $QUOTA $PERIOD.
Example: To limit to 50% of one core:
echo "50000 100000" > cpu.max
Example: To limit to 2.5 cores:
echo "250000 100000" > cpu.max
cpu.stat: Read-only file that replaces cpuacct. Shows usage_usec (total CPU time), user_usec, system_usec, and (critically) throttled_usec (how long the cgroup was prevented from running because it hit its cpu.max limit).

2. `memory` Controller (v2)

Vastly improved with better controls for soft limits and memory pressure.

memory.max:
Replaces: memory.limit_in_bytes.
What it is: The hard limit. Hitting this invokes the OOM Killer.
Example: echo "1073741824" > memory.max (1GB limit).
memory.high:
What it is: The main “soft limit” and throttling mechanism.
How it works: When a cgroup goes above memory.high, the kernel will heavily throttle its tasks (making them run slower) and aggressively try to reclaim memory from it. This is a much stronger “suggestion” than v1’s soft_limit. The cgroup is allowed to exceed this, but it will pay a heavy performance penalty.
memory.low:
What it is: A “best-effort” guarantee. The kernel will try not to reclaim memory from this cgroup if its usage is below this low water-mark.
memory.min:
What it is: A stronger guarantee. The kernel will work very hard to protect this amount of memory for the cgroup.
The New “Protection” Model:
memory.min / memory.low: Protect memory below this line.
memory.high / memory.max: Throttle or kill memory above this line.
This gives you a “safe zone” (low to high) where your app can operate, with clear protection and throttling boundaries.
memory.events: A file that reports on key events, including oom (how many OOM kills) and high (how many times memory.high was breached).

3. `io` Controller (v2)

The new and improved replacement for blkio. It’s “weight-based” and “cost-based,” meaning it’s much smarter about different device types (e.g., it knows an I/O operation on a spinning disk “costs” more than one on an NVMe SSD).

io.weight:
Replaces: blkio.weight. Relative I/O priority (1-10000).
io.max:
Replaces: blkio.throttle.*.
What it is: A much simpler interface for hard limits.
Format: MAJ:MIN TYPE=VALUE
Example: To limit reads from /dev/sda (8:0) to 10MB/s and writes to 500 IOPS:
echo "8:0 rbps=10485760 wiops=500" > io.max
(rbps = read bytes per second, wiops = write IOPS).

4. `pids` Controller (v2)

Identical in function to v1, just new file names.

pids.max: The max number of PIDs.
pids.current: Read-only file showing the current number of PIDs.

The v2 Killer Feature: Pressure Stall Information (PSI)

This is one of the most powerful monitoring features ever added to the Linux kernel. It’s not a controller, but a monitoring interface that’s available by default when cgroup v2 is enabled.

The Problem: Your application is slow. Why? Is it…

…throttled because it hit its cpu.max limit? (CPU pressure)
…waiting for the kernel to reclaim memory? (Memory pressure)
…waiting for a disk read/write to complete? (I/O pressure)

Before PSI, this was incredibly hard to diagnose.

The Solution: PSI adds files to every cgroup: cpu.pressure, memory.pressure, and io.pressure.

When you cat one of these files, you see:

some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

some: The percentage of time in the last 10, 60, or 300 seconds that at least one task in the cgroup was stalled, waiting for this resource.
full: The percentage of time that all tasks in the cgroup were stalled simultaneously, waiting for this resource.

Why this is a game-changer:

You can now definitively know why your app is slow.
cpu.pressure some is high? Your cgroup is CPU-throttled. You need to increase its cpu.max or cpu.weight.
memory.pressure full is high? Your cgroup is being heavily throttled by memory.high or is actively swapping. It’s starved for memory.
io.pressure some is high? Your tasks are waiting for disk I/O.
This enables smart autoscaling. A Kubernetes autoscaler could see cpu.pressure is high and decide to add another pod, before CPU usage even hits 100%.

High Memory Pressure

The Practical Guide: How to Use Cgroups

Enough theory. Let’s get our hands dirty. There are three main ways to interact with cgroups, from worst to best.

Method 1: Manual Interaction (The “For-Learning” Way)

You should do this to understand how the plumbing works, but never do this on a production server.

Prerequisite: Your system must be running cgroup v2. Check with mount | grep cgroup. You should see cgroup2 on /sys/fs/cgroup type cgroup2.

Example 1: Limit CPU with v2

Let’s limit a simple while true loop.

Become root: sudo -s
Navigate to the root cgroup: cd /sys/fs/cgroup
Enable controllers for child cgroups: We must “delegate” the cpu controller from the root.echo "+cpu" > cgroup.subtree_control
Create our new cgroup:mkdir my-cpu-group
Set the CPU limit: Let’s limit this group to 20% of one CPU core.echo "20000 100000" > my-cpu-group/cpu.max
Run a “CPU-hungry” process: Open another terminal.# This will spin a loop. while true; do :; done
Find its PID:pidof while # Let's say it returns 12345
Move the process into the cgroup:echo "12345" > my-cpu-group/cgroup.procs
Observe: Open htop. You will see your while process, which was previously using 100% of a core, is now capped at exactly 20%.
Clean up:# Kill the process kill 12345 # Remove the cgroup (must be empty of procs first) rmdir my-cpu-group

Example 2: Limit Memory with v2

Let’s limit the stress-ng tool.

Install stress-ng: apt install stress-ng or dnf install stress-ng
Become root: sudo -s
Navigate: cd /sys/fs/cgroup
Enable controllers: This time, we need memory.# We already enabled +cpu, let's add +memory echo "+memory" > cgroup.subtree_control
Create our memory cgroup:mkdir my-mem-group
Set the hard memory limit: Let’s set it to 100MB.echo "100M" > my-mem-group/memory.max
Run a “memory-hungry” process: We’ll use stress-ng to try and allocate 500MB, which is over our limit.# Run this in the background stress-ng --vm 1 --vm-bytes 500M &
Find its PID:pidof stress-ng # Let's say it returns 54321
Move the process before it allocates: (This is a bit of a race, but stress-ng is usually slow enough)echo "54321" > my-mem-group/cgroup.procs
Observe:

Check dmesg -w. You will see messages from the kernel: “OOM-killer: task … (stress-ng) … killed”.
Check the cgroup’s events: cat my-mem-group/memory.events. You will see oom 1 and oom_kill 1.

Clean up: rmdir my-mem-group

This manual method is messy. Processes die, cgroups need to be cleaned up… there’s a better way.

Method 2: `systemd` (The “Modern Linux” Way)

On almost any modern Linux distro, systemd is the king of cgroups. It manages the entire cgroup tree (using v2 by default) and abstracts all this messiness away.

You should never manually mkdir in /sys/fs/cgroup on a systemd system. You should let systemd do it for you.

systemd divides everything into three types of cgroup “units”:

Slices: These are groups that contain other units. They are the “branches” of the tree. Examples: system.slice (for system services), user.slice (for all user login sessions), machine.slice (for VMs and containers).
Services: These are cgroups for daemons, like nginx.service or sshd.service.
Scopes: These are cgroups for transient, user-initiated processes, like a user’s login shell.

Tool 1: `systemd-cgls`

This command shows you the entire cgroup tree as systemd sees it. It’s the best way to visualize what’s running on your system.

Output of systemd-cgls

Tool 2: `systemd-cgtop`

This is like htop but for cgroups. It shows you the top cgroups ranked by their CPU, Memory, and I/O usage.

Tool 3: Setting limits in `.service` files

This is the “persistent” way to limit a service. If you’re an admin, this is how you control nginx, postgres, etc.

Edit the service’s unit file (e.g., systemctl edit nginx.service). In the [Service] section, you add resource-control options.

[Service]
# v1-style (shares)
CPUShares=512

# v2-style (weight)
CPUWeight=100

# v2-style (max)
CPUQuota=50%

# Memory limits (v2)
MemoryMin=100M
MemoryLow=500M
MemoryHigh=1G
MemoryMax=2G

# pids limit
TasksMax=500

# blkio limit
IOReadBandwidthMax=/dev/sda 10M
IOWriteBandwidthMax=/dev/sda 5M

After adding this, run systemctl daemon-reload and systemctl restart nginx.service. systemd will handle all the echoing to all the correct cgroup files for you.

Tool 4: `systemd-run` (The “On-the-Fly” Way)

This is the best way to run a one-off command with specific limits. It’s the systemd equivalent of our manual mkdir example, but it’s clean and safe.

Let’s re-do our CPU limit example:

# Run a loop in a new, transient scope, with a 20% CPU quota
systemd-run --scope -p "CPUQuota=20%" while true; do :; done

That’s it! One command. systemd creates a temporary scope unit (e.g., transient-while.scope), sets the cpu.max property, moves the process into it, and automatically cleans up the cgroup when the process finishes.

Let’s re-do our memory limit example:

# Run stress-ng in a new service, limited to 100M of RAM
systemd-run --unit=my-mem-test -p "MemoryMax=100M" stress-ng --vm 1 --vm-bytes 500M

You can then check its status with systemctl status my-mem-test. You’ll see it failed, and journalctl -u my-mem-test will show the OOM kill. It’s all perfectly managed and logged.

Method 3: Container Runtimes (Docker & Kubernetes)

This is the highest level of abstraction. Docker and Kubernetes are, in many ways, just fancy systemd-run wrappers.

When you write docker run --memory=1g --cpus=0.5 ..., Docker is simply:

Creating a new cgroup.
echo "1G" > .../memory.max
echo "50000 100000" > .../cpu.max
Putting the container’s process into that cgroup.

The systemd vs. cgroupfs Driver

There’s one “gotcha” in the container world: the --cgroup-driver flag for Docker and the Kubelet.

cgroupfs driver (old default): Docker manages its own cgroups, manually mkdir-ing and echo-ing into /sys/fs/cgroup.
systemd driver (new default): Docker asks systemd (via systemd-run) to create and manage a cgroup (a .scope unit) for the container.

Why does this matter? If Docker uses the cgroupfs driver, systemd (which thinks it owns the whole tree) gets confused. You end up with two “managers” of cgroups, which can lead to instability.

The rule is: ALWAYS use the systemd driver for both Docker and the Kubelet. This ensures systemd is the single source of truth for the cgroup tree, and tools like systemd-cgls will show you your containers.

systemd vs cgroupfs

Cgroups and the Container Revolution

This is where all the pieces come together. Cgroups are the enforcement mechanism for container resource limits.

How Docker Uses Cgroups

Let’s trace a docker run command:

docker run -d \
--name my-nginx \
--cpus="1.5" \
--memory="512m" \
--pids-limit=100 \
--blkio-weight=300 \
nginx

Assuming Docker is using the systemd cgroup driver, here’s what happens under the hood:

Docker calls systemd to create a new scope unit, something like docker-<container-id>.scope, under the machine.slice.
systemd creates the cgroup: /sys/fs/cgroup/system.slice/docker-<container-id>.scope
systemd (at Docker’s request) writes the limits:

echo "150000 100000" > .../cpu.max (1.5 cores)
echo "536870912" > .../memory.max (512MB)
echo "100" > .../pids.max
echo "118" > .../cpu.weight (Docker maps 300 to a 1-10000 scale, (300 / 1000) * 26214 + 1… it’s complex, but it’s just setting the weight).
echo "300" > .../io.weight (v2 io controller uses a different mapping)

The container is now running, fully constrained by the kernel.

How Kubernetes (Kubelet) Uses Cgroups

Kubernetes takes this one step further by creating a hierarchy of cgroups to manage Quality of Service (QoS).

When you define a Pod, Kubernetes assigns it one of three QoS Classes based on your resources specification:

Guaranteed:

How: You set limits and requests, and limits == requests for all resources (CPU and memory).
What Kubelet does: This is the highest priority. The Kubelet sets cpu.max to your CPU limit and memory.max to your memory limit. Your pod is guaranteed this amount, but it can never use more.

Burstable:

How: You set requests and limits, and requests < limits. Or you only set requests.
What Kubelet does: This is the medium priority.
cpu.weight (v2) or cpu.shares (v1) is set based on your requests. (e.g., 100m CPU request = ~10 shares).
cpu.max is set to your limit.
memory.max is set to your limit.
memory.low (v2) or memory.soft_limit (v1) is set to your request.
Result: Your pod is guaranteed its requests. If the node has free resources, it can “burst” up to its limits. If the node is under pressure, it’s a more likely OOM kill candidate than Guaranteed pods.

BestEffort:

How: You set no requests or limits.
What Kubelet does: This is the lowest priority. It sets minimal cpu.weight and no memory guarantees. These are the first pods to be OOM-killed if the node runs out of memory.

The Kubelet Cgroup Tree:

The Kubelet, via systemd, creates its own slice called kubepods.slice and builds a tree inside it:

/sys/fs/cgroup/kubepods.slice/
├── kubepods-besteffort.slice/
│   ├── pod<POD_UID_1>/
│   │   ├── container<ID_A>
│   │   └── container<ID_B>
├── kubepods-burstable.slice/
│   ├── pod<POD_UID_2>/
│   │   └── container<ID_C>
└── kubepods-guaranteed.slice/
├── pod<POD_UID_3>/
│   └── container<ID_D>

This brilliant structure allows Kubernetes to manage resources at a “whole QoS class” level. For example, if the node is under pressure, the kernel will first try to reclaim memory from all pods in kubepods-besteffort.slice.

This is the magic: Your simple YAML file…

resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "512Mi"
cpu: "1"

…is translated by the Kubelet into a series of echo commands to files in a systemd-managed cgroup (/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/pod...), setting cpu.weight, cpu.max, memory.low, and memory.max, all using the kernel features we just explored.

Kubernetes YAML Flowchart

Cgroups as the Bedrock of the Cloud

We’ve been on a deep, 10,000-word journey from a simple kernel patch in 2008 to the engine powering the entire global cloud infrastructure.

Let’s recap the key takeaways:

Cgroups limit, account for, and isolate resource usage (CPU, memory, I/O, etc.).
Cgroup v1 was powerful but messy, with multiple hierarchies that made a unified view impossible.
Cgroup v2 is the modern standard, with a single unified hierarchy and a clean, consistent interface.
v2 “Magic Files”: cgroup.subtree_control (to delegate controllers) and cgroup.procs (to add PIDs).
v2 Killer Features: The memory.high throttle point for better memory pressure handling, and Pressure Stall Information (PSI) for unparalleled, “why-is-it-slow” diagnostics.
Don’t Touch! You should almost never manage cgroups manually by mkdir-ing in /sys/fs/cgroup.
Use systemd: systemd is the “cgroup manager” for a modern Linux system. Use .service files or systemd-run to set limits.
Use the systemd Driver: Ensure your container runtimes (Docker, containerd) are configured to use the systemd cgroup driver for a stable, unified tree.
Containers are Cgroups: Docker and Kubernetes are high-level cgroup managers. Your resources:limits in a Pod YAML are just a friendly face for echo-ing values to cpu.max and memory.max.

Cgroups are a perfect example of the Linux philosophy: “everything is a file.” They transformed a complex problem (resource management) into a simple, file-based API. By mastering this API—or at least, by mastering the tools like systemd and Kubernetes that master it for you—you gain complete control over your system’s performance, stability, and security. They are the invisible walls, the traffic cops, and the accountants of the kernel, and now, they are no longer invisible to you.

The “Why?” – What Problem Do Cgroups Solve?

The Core Concepts of Cgroups

The Exhaustive Deep Dive: Cgroups v1

The v1 Problem: Multiple Hierarchies

The v1 Controllers: An Exhaustive Guide

1. cpu Controller

2. cpuacct Controller

3. cpuset Controller

4. memory Controller

5. blkio Controller

6. devices Controller

7. pids Controller

8. freezer Controller