\setkeys
Ginkeepaspectratio=true,clip=true,draft=false
Gernot Heiser, Ivan Velickovic, Peter Chubb Alwin Joshy, Anuraag Ganesh, Bill Nguyen, Cheng Li, Courtney Darville, Guangtao Zhu, James Archer, Jingyao Zhou, Krishnan Winter, Lucy Parker, Szymon Duchniewicz, Tianyi Bai UNSW Sydney, Australia {gernot, peter.chubb, i.velickovic}@unsw.edu.au
Abstract
We present LionsOS, an operating system for security- and safety-critical embedded systems. LionsOS is based on the formally verified seL4 microkernel and designed with verification in mind. It uses a static architecture and features a highly modular design driven by strict separation of concerns and a focus on simplicity. We demonstrate that LionsOS achieves excellent performance on system-call…
\setkeys
Ginkeepaspectratio=true,clip=true,draft=false
Gernot Heiser, Ivan Velickovic, Peter Chubb Alwin Joshy, Anuraag Ganesh, Bill Nguyen, Cheng Li, Courtney Darville, Guangtao Zhu, James Archer, Jingyao Zhou, Krishnan Winter, Lucy Parker, Szymon Duchniewicz, Tianyi Bai UNSW Sydney, Australia {gernot, peter.chubb, i.velickovic}@unsw.edu.au
Abstract
We present LionsOS, an operating system for security- and safety-critical embedded systems. LionsOS is based on the formally verified seL4 microkernel and designed with verification in mind. It uses a static architecture and features a highly modular design driven by strict separation of concerns and a focus on simplicity. We demonstrate that LionsOS achieves excellent performance on system-call intensive workloads.
1 Introduction
Safety- and security-critical systems, such as aircraft, autonomous cars, medical devices, industrial control or defence systems, require a highly dependable operating system (OS). The complexity (and code size) of these embedded/cyber-physical systems keeps growing, and it is unavoidable to have highly critical (and, presumably, highly assured) functionality co-exist with less critical (and less trustworthy) functionality. These systems are therefore mixed criticality systems (MCS), where the correct operation of a critical component must not depend on the correctness of a less critical component [Barhorst et al., 2009; Burns and Davis, 2017].
Traditional MCS are concerned with (spatial and temporal) integrity and availability. In modern systems the mixed-criticality requirement must also extend to confidentiality: Many cyberphysical systems, such as drones or medical devices, process sensitive data that must be protected.
The core requirement such critical systems impose on the OS is therefore strong temporal and spatial isolation. As such, the seL4 microkernel [seL4 Foundation, 2021b] seems a perfect foundation: seL4 has undergone extensive formal verification, including proofs of confidentiality and integrity enforcement, proof of implementation correctness and proofs that the binary has the same semantics as the verified C code, taking the compiler out of the trusted computing base (TCB) [Klein et al., 2014]. It is the only formally-verified kernel that uses fine-grained access control through capabilities [Dennis and Van Horn, 1966]. There exists an MCS version of seL4 that provides the temporal isolation properties required by real-time systems [Lyons et al., 2018], verification of the MCS variant is currently in progress [seL4 Foundation, 2024].
These features have resulted in some real-world deployments, including autonomous military aircraft [Cofer et al., 2018] and, more recently, commercial electric cars [Qu, 2024]. Yet more than a decade after seL4’s verification was completed, it is not widely deployed.
The core reason behind this slow uptake seems to be the low-level nature of seL4, even lower than for most other microkernels. For example, seL4 makes management of all of a system’s physical memory (including the kernel’s) a user-level responsibility [Elkaduwe et al., 2008]; while this is a core enabler of reasoning about isolation, managing memory requires data structures (e.g. page tables) which require memory – a foot gun for developers.
seL4 can be said to be the “assembly language of operating systems”. To build functional, performant systems on seL4 requires deep expertise – the kernel is largely unapproachable by industrial developers of critical systems, resulting in a number of seL4-based deployment projects apparently abandoned a few years after they started.
In short, to benefit from seL4’s provable isolation enforcement, developers of critical systems need an actual operating system, providing appropriate abstractions, such as processes, files and network connections, while retaining (and extending) the isolation guarantees provided by seL4. We furthermore posit that such a system must be highly modular, in order to make best use of seL4-provided isolation for minimising the impact of faults, simplify identification and elimination of bugs, and enable end-to-end formal verification (eventually – for now we leave verification out of scope but aim for a verification-friendly design.)
Fine-grained modularity results in many context switches and IPC operations; this overhead has traditionally been considered the Achilles heel of microkernels [Bershad, 1992]. Even very recent microkernel-based work co-locates services in larger modules and migrates functionality back into the kernel in order to achieve performance competitive with the monolithic Linux system [Chen et al., 2024].
While such performance challenges may be real for a general-purpose OS system (eg. for smartphones), we aim to achieve a performant yet highly-modular OS for the embedded/cyberphysical space.
Our contribution is to demonstrate
- •
LionsOS, the first highly-modular microkernel-based OS that achieves performance at par or better than traditional monolithic designs;
- •
which is enabled by a simplicity-oriented design employing use-case-specific policies.
We present the principles of our approach (Section 3), discuss the resulting design (Section 4) and its implementation focussed on simplicity (Section 5). We evaluate LionsOS performance against Linux and microkernel-based OSes and find that on context-switch intensive loads, which can be expected to show the highest modularity-imposed overheads, LionsOS outperforms all systems we compare against (Section 6).
2 Background and Related Work
2.1 Verification and scalability
The formal verification of seL4 demonstrated that it is possible to prove the implementation correctness of real-world systems of considerable complexity. However, the cost was high: about 12 person years of non-recurring engineering for 8,500 source lines of code (SLOC), and an estimated cost of $350/SLOC [Klein et al., 2009]. While potentially justified for a stable, foundational piece of infrastructure, this cost is too high for most systems, especially a complete OS that will be significantly larger than seL4.
The seL4 project delivered another key insight: Verification effort scales with the square of the specification size [Matichuk et al., 2015]. This implies that there could be a large scalability benefit from keeping things small and simple, i.e. a highly modular design, where each component has a narrow interfaces and simple specification, and where module boundaries are enforced by seL4, making it possible to verify modules independently of each other.
While seL4 used labour-intensive interactive theorem proving, recent work increasingly uses automated theorem proving techniques [Sigurbjarnarson et al., 2016; Zaostrovnykh et al., 2017; Nelson et al., 2017, 2019; Zaostrovnykh et al., 2019; Narayanan et al., 2020; Chen et al., 2023; Paturel et al., 2023; Cebeci et al., 2024]. These automated techniques are in essence based on (symbolic) state-space exploration with the help of heuristics to deal with combinatorial explosion. Yet they have severe limitations in the complexity of the code they can tackle, and will generally work best on simple and small modules.
2.2 Modularity in operating systems
The idea of a modular OS with hardware-enforced module boundaries is old, going back to the original microkernel (before the term was coined), Brinch Hansen’s Nucleus [Brinch Hansen, 1970]. The approach was popularised by Mach [Rashid et al., 1989] and taken up by other microkernel systems of the time, such as Chorus [Rozier et al., 1988] and QNX [Hildebrand, 1992].
These systems were plagued by poor performance, and functionality moved back into the kernel [Welch, 1991]. This did not prevent expensive debacles, such as IBM’s ill-fated, Mach-based Workplace OS [Fleisch et al., 1998].
Almost all microkernel-based OSes exhibited course-grained modularity, typically at the level of major subsystems such as file service, networking and process management [Rawson, 1997; Whitaker et al., 2002; Härtig et al., 2005; Herder et al., 2006; Qubes, 2010], making them too large for verification. Nevertheless, even the most recent work argues that the cost of crossing module boundaries is too high, resulting in a drastically-expanded kernel of 90 kSLOC (vs. seL4’s 10 kSLOC) and co-locating services into even more course-grained isolation domains [Chen et al., 2024].
The Flux OSKit [Ford et al., 1997] was an early design featuring a more fine-grained design. Performance comparison to Linux and FreeBSD showed a 13% degradation in networking throughput and a 45% increase in latency. SawMill [Gefflaut et al., 2000] was an ambitious project aiming to break up Linux into components isolated by a microkernel; file-system benchmarks showed a throughput degradation of about 18%. No CPU load values are reported for any of these systems, but the degradation in achieved throughput indicates a significant increase in per-packet processing cost.
Genode (formerly Bastei) [Feske and Helmuth, 2007; Feske, 2015] features a modular design, explicitly prioritising assurance over performance (we could not find any published performance data, although we evaluate against it in Section 6.3.1). Its implementation in C++ will prevent a complete formal verification for the foreseeable future. THINK [Fassino et al., 2002] is a component system for building kernels rather than an OS with isolated components. Systems like TinyOS [Levis et al., 2005], Tock [Levy et al., 2017] and Tinkertoy [Wang and Seltzer, 2022] are for microcontrollers without memory protection and as such unsuitable for MCS.
An alternative to enforcing modularity by address-space isolation is enforcing it through the programming language, as pioneered by SPIN [Bershad et al., 1995], and later adopted by Singularity [Fähndrich et al., 2006] and RedLeaf [Narayanan et al., 2020]. These systems generally exhibit lower performance than mainstream OSes. Furthermore, as their security relies on type-safety enforcement by the programming language, they require the whole OS to be implemented in that language. Inevitably this requires unsafe escapes for dealing with hardware. More importantly, this rules out re-using code from mainstream OSes.
Writing all device drivers from scratch is generally infeasible. We therefore ignore language-based isolation approaches and instead focus on modularity enforced by address-space isolation.
What is common to these earlier systems is their complexity, among others driven by the desire for code or API re-use leading to design compromises. Our core take-away is that to make modular OSes work, a clean, principled from-scratch approach is needed. Re-use, while desirable, should be subject to clean design principles, rather than compromising them.
3 LionsOS Design Principles
Given this experience, what makes us think we can meet our performance aim from Section 1 with a modular system?
We observe that a commonality of these earlier systems is a significant complexity in design and implementation. We posit that the key to meeting aim is a strict application of the time-honoured KISS Principle [Wikipedia, 2001]. Following this high-level principle, we aim for a highly modular design which incorporates the following secondary principles:
Strict separation of concerns:
Each module has one and only one purpose (as far as feasible). Furthermore, a particular concern (e.g. the network traffic-shaping policy) should be fully contained in a single module.
Least privilege:
Each module only has the access rights it needs, not more. While not a consequence of KISS, this time-honoured security principle of Saltzer and Schroeder [1975] simplifies reasoning about security and safety.
Design for verification:
Module interfaces are narrow and module implementations simple, to make verification scalable (Section 2.1).
Use-case specific policies:
Likely the most controversial, this principle calls for tailoring each (resource) policy to the system’s particular use case, in order to simplify the policy implementation.
It seems clear that adhering to these principles, which we summarise as “radical simplicity”, will give us the best chance to produce a highly dependable system and maximize the chances of formally proving its correctness and security.
Simplicity is aided by restricting our target domain to embedded systems. While aiming for generality within that domain – which specifically includes cyber-physical systems such as autonomous aircraft and cars, some of which are quite complex and demanding – we do not (yet) aim to support more general-purpose systems, such as cloud servers, smartphones and certainly not laptops.
The common thread of the embedded domain is that it can be served with a static system architecture, i.e. a set of components that is known at configuration time. This does not mean that the system is fully static – it can support late loading of components, dynamic component updates, and place-holders for components that are loaded with programs not known at build time. Dependable embedded systems generally cannot over-commit resources, which is what makes the static architecture work. We are yet to see a realistic use case in the embedded space that cannot be addressed with a static architecture. Note that the static system architecture does not prevent a subsystem, such as a virtual machine (VM) from managing a subset of resources dynamically.
The inherent constraints of embedded systems are also in line with the principle of use-case specific policies. A computer system generally has two classes of policies: security and resource policies. In the embedded space, the security policy is generally defined by the use case, and will only change in the context of a significant re-configuration of the system (accompanied with a major software upgrade).
The embedded system’s defined set of resources implies that, at least for the critical systems we are targeting, the designer has (or should have) a clear idea of how they should be managed. The more tailored the policy is to the use case, the simpler its implementation, and the easier it is to assure (formally or informally) that it matches requirements.
This principle of use-case specific policies is arguably the clearest departure from conventional approaches. OSes tend to be designed to adapt to changing application scenarios with no or minimal code changes. This naturally leads to a (conscious or not) desire to provide universal policies. Of course, no policy is truly universal, and sooner or later it will encounter a use case where the existing policy behaves pathologically, resulting in attempts to generalise it. This approach is a massive driver of complexity: For example, the Linux scheduler contains five scheduling classes, each of which has one or more per-thread tuning parameters; the scheduler has grown from around 11,kSLOC in 2011 (version 3.0) to over 30 kSLOC today (version 6.12). Furthermore, the approach frequently leads to optimising particular “hot” use cases at the expense of overall performance [Ren et al., 2019].
Use-case specific policies represent the opposite approach: Each policy is highly specialised for the use case, and the system achieves use-case diversity not by generalising policies, but by re-writing them as needed. Of course, this can only work if the policies are simple enough to implement.
Our underlying argument is that by taking a radical approach to simplicity and use-case specificity, policies do become simple. Moreover, for most resources there is a small to moderate set of policies that can be pre-supplied, letting the system designer chose from an existing set (or trivially adapting an existing one to the use case).
An illustrative example is network traffic shaping: If multiple clients of a network interface overload the interface, there are only a small number of obvious policies to choose from: Clients may be given a priority, their bandwidth may be limited, they may be served round-robin, or certain protocols might be prioritised. In an embedded system, it is usually obvious which one is appropriate, and this is unlikely to change for the lifetime of the system.
Taken together, our principles lead to an OS becoming something akin to a Lego® set: The OS is built from different kinds of components (brick shapes). Each component comes in multiple, functionally compatible versions (brick colours), and the choice of version (colour) can be made largely independent of the rest of the system.
An obvious concern is how the fine-granular modularity affects our performance aim, as modularity necessarily leads to high context-switching rates. Given there is a cost to every context switch (depending on the architecture, 400–600 cycles for seL4 [seL4 Foundation, 2021a]), this has the potential to make the system slow [Chen et al., 2024]. We will examine the performance impact in Section 6.
4 LionsOS Design
LionsOS is based on seL4, and uses its address-space, thread and IPC abstractions. Like most OSes, it consists mostly of I/O subsystems (device drivers, protocol stacks, file systems) and resource management. The latter part is particularly small in LionsOS because of its static architecture, which reduces resource management to some simple policy modules that are not only use-case specific (as per our principle) but also local in their nature (e.g. shaping traffic of a network interface).
Some global resource management is required, such as core management (off- and on-lining processor cores based on overall CPU load). This is presently work in progress.
4.1 Devices
We demonstrate how our principle of strict separation of concerns applies to I/O, using networking, specifically Ethernet, as a case study. We briefly discuss other device classes in Section 4.1.5.
4.1.1 Device drivers
Applying separation of concerns, we reduce the purpose of a device driver to translate between a hardware-specific device interface and a hardware-independent device class interface: an Ethernet driver does no more than abstracting the specific NIC as a generic Ethernet device.
Unsurprisingly, the Ethernet device-class interface (i.e. the driver’s OS interface) looks similar to that of an actual Ethernet network interface controller (NIC), with some differences that help simplify its use. NICs typically use ring buffers in DMA memory to pass references to DMA buffers from and to the driver; each ring buffer entry contains a pointer to a buffer in the DMA region, together with some meta-data (indicating whether a buffer contains valid data). A NIC usually references two such ring buffers, one for transmit (Tx) and one for receive (Rx) data.
The driver’s OS interface uses buffer queues similar to these hardware-specified ring buffers, but to simplify use we use separate queues for buffers containing valid data and those that do not. This means that the software side of the driver has four queues:
transmit available (TxA):
references buffers with valid data provided by the OS for transmission;
transmit free (TxF):
references buffers returned by the NIC to the OS for re-use;
receive available (RxA):
references buffers filled with data by the NIC to be consumed by the OS;
receive free (RxF):
references buffers provided by the OS to the NIC to receive data from the network.
These queues are allocated in driver metadata regions, for Ethernet there is one for Tx data (containing the TxA and TxF queues) and a separate one for Rx data (RxA and RxF queues). These are “normal” memory invisible to the device (i.e. not accessed by DMA). For Ethernet we can keep the Tx and Rx regions separate.
Note that the driver only handles pointers to DMA buffers, it has no need to access the actual data. As shown in in the right half of Figure 1, we make this explicit by separating the data region, which contains the buffers to be filled/emptied by the device, from the device metadata region, which contains the HW-defined ring buffers pointing to the data buffers. The data region is not mapped into the driver’s address space, in line with the principle of least privilege. The device control region is mapped to the driver uncached.
In addition there is the device control region, which maps the device registers for memory-mapped I/O. Data and device metadata regions are memory regions accessed by the device via DMA.
Figure 1: Memory regions for Ethernet: device control (Dev Ctl), device metadata (Dev MD), transmit and receive data (Tx Data, Rx Data), driver metadata (Tx D MD, Rx D MD) and virtualiser metadata (Tx V MD, Rx V MD). Arrows indicate access to regions by components, thick black arrows indicate DMA by the device, the thick coloured arrow indicates uncached access by the driver. The TxVirt only maps the Tx Data region if needed for cache management (shown as dashed). We only show a single client of the virtualiser.
4.1.2 Virtualisers
Drivers abstract the device hardware; in line with separation of concerns they do not deal with sharing the device between multiple clients, and the address translations required for this. This is the responsibility of a separate virtualiser (Virt) component which:
- •
shares a physical device between multiple clients;
- •
translates references to DMA buffers from client addresses to device addresses (physical addresses or IOMMU-translated I/O-space addresses);
- •
performs cache management (flushing/invalidating) where needed (not required on the x86 architecture, which keeps caches coherent with DMA).
For the Ethernet device class, we have independent virtualisers for the Tx and Rx paths (TxVirt and RxVirt).
The Virt replicates its driver interface at the client side, meaning each of the Virt’s client-side interfaces looks exactly like its device interface. Specifically it has a metadata region (Tx V MD, Rx V MD) that structurally replicates the respective device metadata region. The key difference is that while the device MD regions use I/O addresses for referring to data buffers, the Virt MD regions refers to data buffers by offsets from the beginning of the respective data region, thus making the Virt’s address-translation task independent of client virtual address-space layout.
The RxVirt needs to inspect the headers of incoming packets so it has the Rx Data region mapped (R/O). The TxVirt does not need to access the data region directly, but may need to have it mapped to perform cache management. For example, the Arm architecture performs cache operations on virtual address ranges, so the TxVirt needs the data region to be mapped into its address space. Arm does not have cache-coherent DMA, so memory that will be transferred to a device using DMA has to be cache-cleaned before DMA occurs. Likewise, the RxVirt on Arm needs to invalidate caches after DMA into its buffers. These mappings are indicated in the left half of Figure 1.
The TxVirt must implement a traffic shaping policy if its clients generate load that exceeds the NIC’s Tx capacity. In line with our principle of use-case specific policy, the Tx policy is as simple as the specific use-case allows, eg. round-robin, priority-based or bandwidth limiting.
The RxVirt at most requires a simple policy: what to do when data arrives for a client whose RxA queue is full. Possible choices are to block, or (more likely) discard the packet and return the buffer to the driver’s RxF queue. We generally avoid this case by ensuring that all client queues are large enough to hold all available Rx buffers, starving the device of buffers if the clients fail to process input fast enough – this leads to the NIC dropping packets under overload without wasting CPU cycles, and leaves the RxVirt policy-free.
4.1.3 Data regions and copiers
The Tx data region is seen as a single region by the driver. However, each client has its own sub-region, which is mapped into the client address space (and the Virt’s where required).
While each client data region is contiguous in physical memory, there is no need for contiguity of the overall data region. Obviously, the whole region must be mapped to the device by the IOMMU.
The same approach does not work for Rx data, as the device will deposit input in any free buffer, and only the Virt determines the target address space. There are three possible approaches to making Rx data available to the correct client:
- 1.
have only a single, global Rx data region, and the Virt maps each buffer to the client when inserting it into the client’s RxA queue, and unmapping it when retrieving a buffer from the client’s RxF queue. This needs additional privilege in the Virt, but the Virt must be trusted anyway;
- 2.
have only a single, global Rx data region, which is mapped R/O in all client address spaces. This implies that clients can read other client’s input data;
- 3.
have an explicit copier (Copy) component between the RxVirt and each client, which copies the data from the global data region into a per-client data region.
The actual choice comes down to performance (incl. a trade-off between the cost of copying and the cost of mapping operations) and the system’s security policy. For example, option (2) is suitable if there is no concern about one client seeing another’s input data (e.g. when all network traffic is encrypted) and clients can be trusted to return buffers. The Copy component can be inserted transparently: the difference between case (2) and (3) above does not affect the implementation of either the Virt or the client/copier.
4.1.4 Broadcast
Broadcast packets such as for the Address Resolution Protocol (ARP) need special handling. We offer two schemes.
The first approach uses a separate ARP client, whose only job is to respond to those requests. This requires that incoming traffic is routed to clients based on MAC address, and that the ARP client has (R/O) access to the MAC-address allocation table. Other broadcast packets are dropped.
The alternative approach handles a broadcast packet in the RxVirt by enqueuing a copy of the packet in each client’s queue, and reference counting the driver’s buffer containing the packet. We decrement the reference count when the buffer is returned to the client’s RxF queue; and return the buffer to the driver’s RxF queue once all clients have finished (i.e. the reference count has dropped to zero). Each client is then responsible for handling any broadcast traffic it receives.
4.1.5 Other device classes
Some other device classes look similar to Ethernet at a high level, and result in a similar design. This includes most serial devices (serial ports, SPI, I2C) with differences in the details for the protocol. Some have no separation between data and metadata (the queues directly contain the data).
Others, especially storage devices, do not have the clear separation of Tx and Rx traffic of Ethernet, and instead react to explicit requests. This results in a slightly different design:
- •
there is a single driver metadata region;
- •
there are only two queues, the request (Rq) and the response (Rs) queue;
- •
there is a single Virt, which presents an Rq and Rs queue to each client in a per-client metadata region
- •
there is one data region per client.
For storage, in addition to read and write requests, the Rq may also contain barrier requests, across which the device is not allowed to reorder other requests. Other protocol details support batching of requests. The storage Virt statically partitions the devices between its clients.
The storage driver exports an information page of device properties, which is appropriately virtualised by the Virt.
4.2 OS services
For best performance, LionsOS presents a native API that is asynchronous and modelled largely on the device interfaces. For developer convenience and to ease porting of legacy applications, we use a coroutine library that implements a POSIX-like blocking API that is layered over the native one.
As network traffic is explicitly (de)multiplexed by the virtualisers, there is no need for a global IP stack, it becomes a library linked directly into the client. This takes the complex (and probably buggy) protocol stack out of the system’s TCB.
We use the same approach for storage, by providing a per-client file-system library that directly operates on the virtual storage device provided by the Virt (with an optional copier in between). Alternatively, a single file system could be used for all clients, which then is the sole client of the storage Virt – this shared file system would have to be trusted. We currently see no need for this in our target domain.
Sharing across per-client file systems is enabled by an explicit multiplexer component that connects to multiple clients. In our space, this is generally used for read-only storage.
5 LionsOS Implementation
5.1 The starting point: seL4 Microkit
We base the design of LionsOS on the seL4 Microkit [seL4 Foundation, 2023] (formerly “Core Platform”). The Microkit simplifies seL4 usage by imposing a static system architecture and an event-driven programming model. It presents an abstraction of the seL4 API that is partially verified using SMT solvers [Paturel et al., 2023].
The Microkit provides a process abstraction called protection domain (PD). PDs are single-threaded and combine the seL4 abstractions of virtual address space, capability space, thread and scheduling context. Multi-threaded processes can be implemented through multiple PDs that share an address space. While useful for applications, we do not use this for LionsOS itself – all LionsOS components are strictly sequential.
PDs communicate via shared memory and semaphores (seL4 notifications). Server-type PDs can be invoked synchronously via protected procedure calls (PPCs), which map onto seL4 synchronous IPC – such a server executes on the caller’s core.
PDs are structured as event handlers. Signalling a PD’s semaphore will cause it (eventually) to execute the notified function, identifying the sender PD. A server has another handler function, protected, to receive PPCs. Each PD also has an init handler for initialisation.
The system architecture of PDs and their communication channels (semaphores and shared memory regions) is defined in a system description file (SDF). It specifies the ELF files to be loaded into each PD and a PD’s meta-data, including scheduling parameters, access rights to memory regions and caching attributes, and access to interrupts (which appear as semaphores). A PD can monitor a virtual machine, in which case it acts as a private virtual-machine monitor (VMM) which handles virtualisation events from that VM.
Microkit tooling generates from the SDF the seL4 system calls that set up the PDs, channels and memory regions and invokes each PD’s init function. The tooling hides the complexities of seL4’s capability system from the developer.
5.2 Queues and state
The design using explicit virtualisers (for separation of concerns) enables another important simplification: All shared-memory communication is single producer, single consumer (SPSC), enabling the use of simple, lock-free queue implementations. Specifically, the TxF and RxA queues hold data that is provided by the driver (original producer) and consumed by the client (ultimate consumer), packets flow from right to left in Figure 1; for the TxA and RxF queues the flow is in the opposite direction.
The queues are also inherently bounded, leading to a simple, array-based implementation, where references to particular queue entries are array indices. We also require all queues to be a power of two in size, further simplifying implementation and sanitation.
An important property of this design is that all policy-independent state is held in shared memory. This makes it easy to restart a failed component without affecting the rest of the system (other than by a transient latency glitch). This even enables switching policies on the fly, by reloading the code of a component. We demonstrate this in Section 6.3.1.
5.3 Location transparency
In standard producer-consumer fashion, the lock-free SPSC queues are synchronised by semaphores (signalling a Microkit channel). A producer component signals the consumer if new buffers have been enqueued in a previously empty queue, and the consumer has set a flag that requests signalling. Similarly, producers can request signalling on a queue becoming non-full.
This approach is completely location transparent: A particular component is not aware whether the component with which it shares a queue is running on the same or a different core. This location transparency of components makes up for the strictly sequential nature of LionsOS components: Instead of requiring error-prone, multi-threaded implementation of components to make use of multicore hardware, LionsOS utilises multicore processors by distributing components across cores.
The result is that concurrency is tamed: almost all code is freed from concurrency control. The only requirements are correct use of semaphores and flags, and the correct implementation of the enqueue/dequeue library functions (which are straightforward due to the SPSC nature of the queues).
Location transparency will also simplify core management (the implementation of which is in progress): If a core needs to be off-lined, components running on it can be transparently migrated to other cores, without affecting the system’s operation (other than some temporary latency increases).
5.4 Legacy driver reuse
The LionsOS design vastly simplifies drivers compared to other OSes (see Section 6.2); implementing drivers from scratch is usually easy and will result in the best performance.
However, it is unrealistic to expect adopters to write all drivers from scratch, especially since in practice few devices are performance critical enough to justify such an effort. It is also frequently impractical, as many devices are poorly (or un-)documented. For such cases, LionsOS allows reusing a driver from Linux by encapsulating it in a virtual machine (VM). Unlike the Dom0 driver VM of Xen [Barham et al., 2003] or the Driver Container of HongMeng [Chen et al., 2024], we follow the approach of LeVasseur et al. [2004] and support (but do not force) wrapping each driver in its own VM.
Figure 2 shows the architecture. The driver VM runs the legacy driver as part of a (minimally configured) Linux guest. The guest runs a single, statically-linked usermode program, the UIO driver (which replaces init). The program uses normal Linux system calls to interact with the device, and the Linux user I/O (UIO) framework to interact with the LionsOS driver queues.
Specifically we use UIO to map guest physical memory (to access the queues) and receive virtual interrupts. seL4’s virtual machine architecture re-directs virtualisation exceptions to a per-VM virtual-machine monitor. We use this to inject semaphore signals from the Virt as IRQs into the VM, to be received by the UIO driver.
We supply the driver VM’s complete userspace as a CPIO archive loaded at boot time from a RAM disk.
Figure 2: Driver-VM architecture
5.5 Implementation status
5.5.1 Device drivers
Most of our development happens on the HardKernel Odroid-C4 (Amlogic S905X3 SoC) and the Avnet MaaXBoard (i.MX8MQ SoC) platforms, so this is where we currently have the largest set of native drivers:
- •
Serial for all supported platforms.
- •
PinMux and clock for MaaXBoard and Odroid-C4.
- •
Ethernet for Odroid-C4, MaaXBoard and the i.MX8 series FEC.
- •
Block: SDHC drivers for the MaaXBoard and Odroid-C4 – the latter written in Rust.
- •
VirtIO drivers (for running on top of QEMU) for serial, block, network and graphics (2D).
- •
an I2C host driver for the Odroid-C4.
- •
I2C drivers (using the I2C host driver) for a PN532 NFC card reader and a DS3231 real-time-clock.
Most drivers are written in C, but the system does not prescribe an implementation language, as demonstrated by the Odroid-C4 Rust-implemented SDHC driver.
We have Linux driver VMs for the following device classes:
- •
GPU via exported framebuffer for Odroid-C4.
- •
Ethernet for Odroid-C4.
- •
Block (SDHC) for the Odroid-C4.
- •
Sound using the ALSA framework for Odroid-C4.
5.5.2 Services
We have full networking functionality as described above, using lwIP [Dunkels, 2001] as a client library: lwIP cannot break isolation and is thus not part of the system’s TCB. Our default setup uses an RxCopy component (see Section 4.1.3). Both the native (asynchronous) as well as the blocking API are supported, the latter is layered over the former using a coroutine library. We have an NFS client (using an open-source NFS library) which uses the blocking API.
We have an asynchronous filesystem API that uses either a native FAT file system, the NFS client for network storage, or any Linux file system hosted in a Linux VM accessed via the standard VirtIO block interface. Again, there is a blocking API layered on top.
LionsOS is mature enough to run several complete systems in daily use. One of them is a reference design for a point-of-sales terminal. It uses a driver VM to re-use the Linux GPU driver, and either a native Ethernet driver or another driver VM for re-using a Linux driver (to demonstrate multiple driver VMs).
Figure 3: Architecture of the LionsOS-based web server.
Another deployed system is a web server that hosts the sel4.systems web site. The web server has the business logic implemented in Python, supported by a port of MicroPython [MicroPython Developers, 2014] to LionsOS, Figure 3 shows the architecture.
5.5.3 Resource management
Implementing dynamic resource management, such as core off-/on-lining (cf. Section 5.3), is in progress.
6 Evaluation
We evaluate multiple aspects of LionsOS, covering development and debugging effort, legacy driver re-use, and performance.
6.1 Platforms
LionsOS supports the Arm AArch64, Intel x86_64 and RISC-V RV64 architectures. Development happens primarily on Arm platforms and is then ported to the other architectures. We evaluate on Arm and x86.
The Arm platform is an Avnet MaaXBoard with an NXP i.MX8MQ SoC, having four Cortex A53 cores running at a maximum of 1.5 GHz and sharing an L2 cache. We run all measurements at a clock rate of 1 GHz to prevent overheating. The board has 2 GiB of RAM, an on-chip 1 Gb/s NIC, and an on-chip SDHC controller. We perform Linux measurements on this board with a small Buildroot [2016] system using kernel version 6.1.0. Some measurements are made on a HardKernel Odroid-C4, which has an AMlogic S905X3 quad-core Cortex A55 running at 1.2 GHz, 4Gb of DDR4 RAM, and on-chip NIC and SDHC devices.
The x86 platform is an Intel Xeon® W-1250 six-core CPU running at 3.3 GHz, private L2 caches and a shared L3. It has an Intel IXGBE X550 10 Gb/s copper NIC. We disable hyperthreading and turbo-boost. For Linux measurements we use a Debian Bullseye userspace and use the “performance” CPU frequency controller to keep the CPU frequency at its maximum value. The kernel is the Debian Linux kernel 6.6.15-2 (2024-02-04) running the standard in-kernel IP stack.
6.2 Complexity and development effort
Table 1: SLOC of LionsOS NIC drivers compared with Linux.
6.2.1 Code size
Our subjective experience is that the LionsOS model dramatically simplifies development of core OS components. A striking example is the i.MX8 network driver, which was the first device driver written to the LionsOS driver model of Section 4.1.1. It was implemented by a second-year undergraduate student, less than 18 months after she wrote her first program. Table 1 compares the code sizes of Ethernet drivers to Linux. We use sloccount [Wheeler, 2001] for all code-size measurements.
The size gives an indication of the complexity of the task. The student found she was spending very little time in debugging the driver logic, unlike normal driver development.
Component LoC Library LoC Serial Driver 249 Microkit 303 Serial TxVirt 175 Serial queue 219 Serial RxVirt 126 I2C queue 101 I2C Driver 514 Eth queue 140 I2C Virt 154 Filesys queue & protocol Timer Driver 136 268 Eth Driver 397 Coroutines 848 Eth TxVirt 122 lwIP 16,280 Eth RxVirt 160 NFS 45,707 Eth Copier 79 VMM 3,098 Total 2,112 1,031 + 65,933 Table 2: Code-size breakdown of the point-of-sale terminal demonstrator. Components in red font are not part of the LionsOS TCB. The lwIP and NFS libraries ported from other systems, the rest is written from scratch.
Our experience with other LionsOS components is similar, components are small and simple. Table 2 gives a breakdown of the Odroid-C4-based point-of-sales terminal (see Section 5.5.2). LionsOS, as configured for this application, consists of about 3.1 kSLOC of trusted code, plus 66 kSLOC of untrusted library code that cannot bre