Table Of Contents
- Kata Containers Storage
- Terminology
- Kata Containers Storage Architecture
- Containerd and CRI-O Storage Architecture
- Problem Statement For Kata Containers
- Proposed Solutions
Kata Containers I/O performance, workload compatibility and overall stability is tightly bound to its storage backend. Both container images and volumes potentially use host local storage, making it a critical piece of the overall Kata Containers architecture.
Here we will evaluate the current Kata Container…
Table Of Contents
- Kata Containers Storage
- Terminology
- Kata Containers Storage Architecture
- Containerd and CRI-O Storage Architecture
- Problem Statement For Kata Containers
- Proposed Solutions
Kata Containers I/O performance, workload compatibility and overall stability is tightly bound to its storage backend. Both container images and volumes potentially use host local storage, making it a critical piece of the overall Kata Containers architecture.
Here we will evaluate the current Kata Containers storage architecture, and see how it can be improved.
Terminology
See https://github.com/containers/storage and https://blog.mobyproject.org/where-are-containerds-graph-drivers-145fc9b7255
- A layer is a copy-on-write filesystem
- An image is a reference to a particular layer (its top layer)
- A container is a read-write layer which is a child of an image’s top layer
- Graph drivers: A Docker originated terminology, describing the fact that layers dependencies form a graph. A graph driver is a Docker pluggable container storage backend implementation.
- Snapshotters is a containerd interface that ** replace** graph drivers. Unlike graph drivers, they’re decoupled from the container lifecycle and only provide
Mountpoints for thesnapshottercallers to manage. For example, snapshotters do not handle the container layes mount and unmount cycles.containerddoes.
From https://blog.mobyproject.org/where-are-containerds-graph-drivers-145fc9b7255: In the container world we use two types of filesystems: overlays and snapshotting filesystems. AUFS and OverlayFS are overlay filesystems which have multiple directories with file diffs for each “layer” in an image. Snapshotting filesystems include devicemapper, btrfs, and ZFS which handle file diffs at the block level. Overlays usually work on common filesystem types such as EXT4 and XFS whereas snapshotting filesystems only run on volumes formatted for them.
Kata Containers storage architecture
Container images and volumes can be stored on the host through regular, overlay or snapshotting filesystems. Kata Containers then need to mount them into the guest VM for the container workload to be started or the volume to be accessed by the container workload. If the container running in the guest modifies the container image or the volume, the corresponding changes must appear on the host as well.
Depending on the type of filesystem used to store container images, the architecture to meet these requirements is different:
With overlays (overlayfs) and regular filesystems, Kata Containers uses a paravirtualized filesystem sharing mechanism to share the container images and volumes stored on the host with the guest. This typically is handled through the infamous virtio-9p driver. In the longer term, this may be replaced with the recently announced virtio-fs implementation.
With snapshotting filesystems, container images and volumes are handled at the block level and can be themselves seen as block devices. In this case, Kata Containers can use the virtio-scsi or virtio-block drivers to expose those layers as a block device into the guest.
Pros and Cons
When exposing the host container image through virtio-9p, the guest sees what is supposed to be a POSIX compliant filesystem. However, benchmarks show that 9pfs and its virtio-9p driver are (very) slow, not fully POSIX compliant, and unstable. Several Open Source Vendors (OSVs) have stopped maintaining it.
The btrfs and ZFS snapshotting filesystems are not viable options for Kata Containers (for stability, maintenance and legal reasons) so we’re left with the devicemapper one. One one hand devicemapper gives much more performant and stable results than virtio-9p, but on the other hand it needs its own, dedicated devicemapper formatted partition on the host (As with other snapshotting filesystems). Moreover, devicemapper is also seen by some as an obsolete technology and e.g. containerd no longer provides a devicemapper snapshotter.
Dependencies
Since Kata Containers is just a container runtime, it does not get to choose which container storage or volume backend it will use. It is up to the layer on top of Kata Containers to make that decision and kata-runtime will need to detect how container images and volumes are stored on the host in order to select the right hypervisor backend (virtio-9p for overlay filesystems, virtio-scsi or virtio-block for snapshotting ones) to use when preparing the guest VM.
For the sake of this discussion, we will limit ourselves to containerd and CRI-O as the supported layers calling into kata-runtime.
containerd and CRI-O storage architecture
The containerd and CRI-O implement different storage architectures. In order to understand how Kata Containers could leverage them in an optimal way, we need to descrive them a little further.
Images and volumes
The containerd and CRI-O runtimes implement their own storage implementations for storing, mounting and managing container images only. Volumes, on the other hand, are simply host paths that will be mounted in the container mount namespace.
CRI-O
CRI-O relies on the external containers storage project to manage container image storage. Image management (pulling, listing, etc) is managed by the containers image package.
The storage package entry point is the store API, that attempts to initialize, create and manage a container image store backed by a storage driver. Adding a storage backend means implementing the driver interfaces:
// Driver is the interface for layered/snapshot file system drivers.
type Driver interface {
ProtoDriver
DiffDriver
LayerIDMapUpdater
}
type ProtoDriver interface {
// String returns a string representation of this driver.
String() string
// CreateReadWrite creates a new, empty filesystem layer that is ready
// to be used as the storage for a container. Additional options can
// be passed in opts. parent may be "" and opts may be nil.
CreateReadWrite(id, parent string, opts *CreateOpts) error
// Create creates a new, empty, filesystem layer with the
// specified id and parent and options passed in opts. Parent
// may be "" and opts may be nil.
Create(id, parent string, opts *CreateOpts) error
// CreateFromTemplate creates a new filesystem layer with the specified id
// and parent, with contents identical to the specified template layer.
CreateFromTemplate(id, template string, templateIDMappings *idtools.IDMappings, parent string, parentIDMappings *idtools.IDMappings, opts *CreateOpts, readWrite bool) error
// Remove attempts to remove the filesystem layer with this id.
Remove(id string) error
// Get returns the mountpoint for the layered filesystem referred
// to by this id. You can optionally specify a mountLabel or "".
// Optionally it gets the mappings used to create the layer.
// Returns the absolute path to the mounted layered filesystem.
Get(id string, options MountOpts) (dir string, err error)
// Put releases the system resources for the specified id,
// e.g, unmounting layered filesystem.
Put(id string) error
// Exists returns whether a filesystem layer with the specified
// ID exists on this driver.
Exists(id string) bool
// Status returns a set of key-value pairs which give low
// level diagnostic status about this driver.
Status() [][2]string
// Returns a set of key-value pairs which give low level information
// about the image/container driver is managing.
Metadata(id string) (map[string]string, error)
// Cleanup performs necessary tasks to release resources
// held by the driver, e.g., unmounting all layered filesystems
// known to this driver.
Cleanup() error
// AdditionalImageStores returns additional image stores supported by the driver
AdditionalImageStores() []string
}
type DiffDriver interface {
// Diff produces an archive of the changes between the specified
// layer and its parent layer which may be "".
Diff(id string, idMappings *idtools.IDMappings, parent string, parentIDMappings *idtools.IDMappings, mountLabel string) (io.ReadCloser, error)
// Changes produces a list of changes between the specified layer
// and its parent layer. If parent is "", then all changes will be ADD changes.
Changes(id string, idMappings *idtools.IDMappings, parent string, parentIDMappings *idtools.IDMappings, mountLabel string) ([]archive.Change, error)
// ApplyDiff extracts the changeset from the given diff into the
// layer with the specified id and parent, returning the size of the
// new layer in bytes.
// The io.Reader must be an uncompressed stream.
ApplyDiff(id string, idMappings *idtools.IDMappings, parent string, mountLabel string, diff io.Reader) (size int64, err error)
// DiffSize calculates the changes between the specified id
// and its parent and returns the size in bytes of the changes
// relative to its base filesystem directory.
DiffSize(id string, idMappings *idtools.IDMappings, parent string, parentIDMappings *idtools.IDMappings, mountLabel string) (size int64, err error)
}
type LayerIDMapUpdater interface {
// UpdateLayerIDMap walks the layer's filesystem tree, changing the ownership
// information using the toContainer and toHost mappings, using them to replace
// on-disk owner UIDs and GIDs which are "host" values in the first map with
// UIDs and GIDs for "host" values from the second map which correspond to the
// same "container" IDs. This method should only be called after a layer is
// first created and populated, and before it is mounted, as other changes made
// relative to a parent layer, but before this method is called, may be discarded
// by Diff().
UpdateLayerIDMap(id string, toContainer, toHost *idtools.IDMappings, mountLabel string) error
// SupportsShifting tells whether the driver support shifting of the UIDs/GIDs in a
// image and it is not required to Chown the files when running in an user namespace.
SupportsShifting() bool
}
containerd
With containerd the storage handling is split into 2 parts:
- A common part that is backend agnostic and that actually manages the container images lifecycle. It depends on a storage backend provided by the chosen snapshotter implementation.
- A pluggable part that is implemented through a
golanginterface: snapshotters
A snapshotter implementation provides method for creating, snapshotting and mounting filesystems. It essentially is a very low level API that creates and provides Mounts objects for the common part of containerd and the container runtimes themselves to consume.
containerd snapshotters are external processes available through an internally generated gRPC interface.
By default, containerd provides BTRFS, overlay and native snapshotter implementations. External snapshotters can be used, and projects like Firecracker provide their own ones when interfacing with containerd.
containerd snapshotter interface is the main and single entry point for adding support to a new storage backend:
type Snapshotter interface {
// Stat returns the info for an active or committed snapshot by name or
// key.
//
// Should be used for parent resolution, existence checks and to discern
// the kind of snapshot.
Stat(ctx context.Context, key string) (Info, error)
// Update updates the info for a snapshot.
//
// Only mutable properties of a snapshot may be updated.
Update(ctx context.Context, info Info, fieldpaths ...string) (Info, error)
// Usage returns the resource usage of an active or committed snapshot
// excluding the usage of parent snapshots.
//
// The running time of this call for active snapshots is dependent on
// implementation, but may be proportional to the size of the resource.
// Callers should take this into consideration. Implementations should
// attempt to honer context cancellation and avoid taking locks when making
// the calculation.
Usage(ctx context.Context, key string) (Usage, error)
// Mounts returns the mounts for the active snapshot transaction identified
// by key. Can be called on an read-write or readonly transaction. This is
// available only for active snapshots.
//
// This can be used to recover mounts after calling View or Prepare.
Mounts(ctx context.Context, key string) ([]mount.Mount, error)
// Prepare creates an active snapshot identified by key descending from the
// provided parent. The returned mounts can be used to mount the snapshot
// to capture changes.
//
// If a parent is provided, after performing the mounts, the destination
// will start with the content of the parent. The parent must be a
// committed snapshot. Changes to the mounted destination will be captured
// in relation to the parent. The default parent, "", is an empty
// directory.
//
// The changes may be saved to a committed snapshot by calling Commit. When
// one is done with the transaction, Remove should be called on the key.
//
// Multiple calls to Prepare or View with the same key should fail.
Prepare(ctx context.Context, key, parent string, opts ...Opt) ([]mount.Mount, error)
// View behaves identically to Prepare except the result may not be
// committed back to the snapshot snapshotter. View returns a readonly view on
// the parent, with the active snapshot being tracked by the given key.
//
// This method operates identically to Prepare, except that Mounts returned
// may have the readonly flag set. Any modifications to the underlying
// filesystem will be ignored. Implementations may perform this in a more
// efficient manner that differs from what would be attempted with
// `Prepare`.
//
// Commit may not be called on the provided key and will return an error.
// To collect the resources associated with key, Remove must be called with
// key as the argument.
View(ctx context.Context, key, parent string, opts ...Opt) ([]mount.Mount, error)
// Commit captures the changes between key and its parent into a snapshot
// identified by name. The name can then be used with the snapshotter's other
// methods to create subsequent snapshots.
//
// A committed snapshot will be created under name with the parent of the
// active snapshot.
//
// After commit, the snapshot identified by key is removed.
Commit(ctx context.Context, name, key string, opts ...Opt) error
// Remove the committed or active snapshot by the provided key.
//
// All resources associated with the key will be removed.
//
// If the snapshot is a parent of another snapshot, its children must be
// removed before proceeding.
Remove(ctx context.Context, key string) error
// Walk all snapshots in the snapshotter. For each snapshot in the
// snapshotter, the function will be called.
Walk(ctx context.Context, fn func(context.Context, Info) error) error
// Close releases the internal resources.
//
// Close is expected to be called on the end of the lifecycle of the snapshotter,
// but not mandatory.
//
// Close returns nil when it is already closed.
Close() error
}
Problem statement for Kata Containers
- Kata Containers optimized storage setup needs a snapshotting filesystem as a host container image and volume storage backend.
virtio-9pis not a production ready backend for Kata Containers, andvirtio-fsis a long term replacement option. - The
devicemapperstorage backend is not supported by upstreamcontainerd. The Firecracker team does have an open source implementation for it though. devicemapperas a container storage technology is stable and performant. However, with containerd not shipping a corresponding implementation, less and less distros will support it.- The
CRI-Oandcontainerdprojects have zero storage implementation overlap. Adding a snapshotting filesystem backend support tocontainerdwill not share any code with CRI-O.
Proposed solutions
- Add a snapshotting filesystem snapshotter implementation to
containerd - lvm2 or/and raw block? Benefits of raw block compared with lvm2?
- Can we use
containers/storagedriver’s API to implement such snapshotter and at least start sharing some code? And reduce maintenance.