Testing and verifying concurrent systems is hard due to their non-deterministic nature — verifying behavior that changes with each execution is difficult. Race conditions thrive in the non-deterministic world of thread scheduling. Even more challenging is verifying timeliness constraints, i.e. ensuring that operations complete within specified deadlines or that service guarantees are maintained under load. Traditional testing approaches struggle with concurrency, and mocking strategies often fail to capture the subtle interactions between threads, time, and shared state that cause real production failures.
The io-sim Haskell library, developed by Well-Typed in partnership with engineers from IOG and Quviq, offers a compelling solution to…
Testing and verifying concurrent systems is hard due to their non-deterministic nature — verifying behavior that changes with each execution is difficult. Race conditions thrive in the non-deterministic world of thread scheduling. Even more challenging is verifying timeliness constraints, i.e. ensuring that operations complete within specified deadlines or that service guarantees are maintained under load. Traditional testing approaches struggle with concurrency, and mocking strategies often fail to capture the subtle interactions between threads, time, and shared state that cause real production failures.
The io-sim Haskell library, developed by Well-Typed in partnership with engineers from IOG and Quviq, offers a compelling solution to this problem. The library provides a pure simulation environment for IO computations, enabling deterministic execution of concurrent code with accurate time simulation and detailed execution traces. Unlike other testing approaches, with io-sim one is able to write highly concurrent, real-time systems and verify their timeliness constraints in a deterministic manner, by accurately simulating GHC’s runtime system (e.g. asynchronous exceptions, timeouts & delays, etc.).
This blog post introduces and explores io-sim through a practical example: debugging an elevator controller that violates its response time requirements.
There’s also this great blog post announcing
io-simand it goes a bit more into detail about its features!
The Problem
Consider a simple elevator located in a three-floor building (ground, first , second). It takes roughly 1 second for the elevator to go up and down between each floor. The service requirement is: no passenger should wait more than 4 seconds from pressing the call button until the elevator doors open at their floor. It should be possible to test and verify this requirement when writing our elevator controller.
This ensures a reasonable quality of service and prevents frustration. Given the short distance between floors, 4 seconds is sensible. In the worst case, the elevator must travel from ground to second floor and back again.
Here’s a first attempt at modelling the system. Let’s start with the core data structures:
data Direction = Up | Down | None
deriving (Eq, Show)
data Floor = Ground | First | Second
deriving (Eq, Ord, Show, Enum)
data ElevatorState = ElevatorState
{ currentFloor :: Floor
, moving :: Direction
, requests :: [Floor]
} deriving (Show)
The elevator’s state tracks three things: where it currently is, which direction it’s moving (if any), and a queue of floor requests.
The system has two main components that run concurrently:
- An elevator controller that continuously processes the request queue
- Button press handler that adds new floor requests
Let’s look at the controller first:
-- | Initialize an empty elevator state.
--
-- The elevator starts on the ground floor
--
initElevator :: IO (TVar ElevatorState)
initElevator = newTVarIO $ ElevatorState Ground None []
-- | Elevator controller logic.
--
-- 1. Read the current 'ElevatorState'
-- 2. Check if there are any requested floors
-- 3.
-- 3.1. Block waiting for new requests if there aren't any
-- 3.2. If there any requests, move to the floor at the top of the queue.
--
-- Straightforward FIFO elevator algorithm.
--
elevatorController :: TVar ElevatorState -> IO ()
elevatorController elevatorVar = forever $ do
-- Atomically get the next floor from the queue
(nextFloor, dir) <- atomically $ do
state <- readTVar elevatorVar
case requests state of
[] -> retry -- Block until a request arrives
(targetFloor:rs) -> do
-- Remove the floor from queue and start moving
let direction = getDirection (currentFloor state) targetFloor
writeTVar elevatorVar $ state
{ moving = direction, requests = rs }
return (targetFloor, direction)
putStrLn ("Going " ++ show dir ++ " to " ++ show nextFloor)
moveToFloor elevatorVar nextFloor
The moveToFloor function simulates the physical movement of the elevator.
moveToFloor :: TVar ElevatorState -> Floor -> IO ()
moveToFloor elevatorVar targetFloor = do
elevatorState <- readTVarIO elevatorVar
when (currentFloor elevatorState /= targetFloor) $ do
-- Takes 1 second to move between floors
threadDelay (1000000 * numberOfFloorsToMove)
atomically $
modifyTVar elevatorVar (\elevatorState' ->
elevatorState' { currentFloor = targetFloor
, moving = Idle
}
)
putStrLn ("Arrived at " ++ show targetFloor)
The buttonPress function handles both external calls (someone waiting for the elevator) and internal requests (someone inside selecting a destination):
-- | Whenever a button is pressed this function is called.
--
-- There are two scenarios when a button is pressed:
--
-- 1. When a person is calling the elevator to a floor in order to enter it.
-- 2. When a person is inside the elevator and wants to instruct the elevator
-- to go to a particular floor.
--
buttonPress :: TVar ElevatorState -> Floor -> IO ()
buttonPress elevatorVar floor = do
putStrLn ("Pressing button to " ++ show floor)
atomically $
modifyTVar elevatorVar $ \state -> do
case requests state of
rs@(nextFloor:_)
| let mostRecentRequestedFloor = last rs
, nextFloor /= floor
|| mostRecentRequestedFloor /= floor ->
state { requests = rs ++ [floor] }
| otherwise -> state
[] -> state { requests = [floor] }
Consider the following example scenario and timeline:
- The elevator starts on the ground floor.
- Person A is on the first floor and presses the button to call the elevator to the first floor.
- While the elevator is going up, Person B arrives on the ground floor calls it to the ground floor.
- Elevator arrives at the first floor.
- Person A enters and presses the button to go to the second floor.
- Elevator goes to the ground floor to pick up Person B.
- Person B enters and presses the button to go to the first floor.
- Elevator goes to the second floor.
- Elevator goes to the first floor.
-- | This example mimicks the scenario above, pressing buttons in the right
-- order.
elevatorExample :: [Floor] -> IO ()
elevatorExample floors = do
elevator <- initElevator
withAsync (elevatorController elevator)
$ \controllerAsync -> do
-- Simulate multiple people pressing buttons simultaneously
forConcurrently_ floors (buttonPress elevator)
threadDelay (10 * 1000000)
cancel controllerAsync
elevatorExample [First, Ground, Second, First]
This function spawns the elevator controller and then simulates multiple button presses happening concurrently. Let’s trace through our example:
Pressing button to First
Going Up to First
Pressing button to Ground
Pressing button to Second
Pressing button to First
Arrived at First
Going Down to Ground
Arrived at Ground
Going Up to Second
Arrived at Second
Going Down to First
Arrived at First
Does such a simple implementation adhere to the specified time constraints? The answer is no, a FIFO elevator algorithm is easy to implement but can be inefficient if the requests are spread out across floors, leading to more travel time.
How would one go about to test/verify this? Testing timeliness constraints in concurrent IO is tricky, due to its non-deterministic nature and limited observability.
io-sim: Deterministic IO Simulator
io-sim closes the gap between the code that’s actually run in production and the code that runs in tests. Combined with property based testing techniques it is possible to simulate execution of a program for years worth of simulated time and find reproducible, rare edge-case bugs.
io-sim achieves this by taking advantage of the io-classes set of packages, which offers a class-based API compatible with most of the core Haskell packages, including mtl. In general the APIs follow the base or async
io-sim is a time based, discrete event simulator. Which means, it provides a granular execution trace that can be used from inspecting the commit order of STM transactions to validating a high level, temporal logic property over some abstract trace. The best part is that code requires minimal changes to use io-sim, just polymorphic type signatures that work with both IO and IOSim monads. Here’s the elevator controller code refactored for testing with io-sim:
initElevator :: MonadSTM m => m (TVar m ElevatorState)
initElevator = ...
elevatorController
:: ( MonadSTM m
, MonadDelay m
, MonadSay m
)
=> TVar m ElevatorState -> m ()
elevatorController elevatorVar =
...
say ("Going " ++ show dir ++ " to " ++ show nextFloor)
...
moveToFloor
:: ( MonadSTM m
, MonadDelay m
, MonadSay m
)
=> TVar m ElevatorState -> Floor -> m ()
moveToFloor elevatorVar targetFloor = do
...
say ("Arrived at " ++ show targetFloor)
getDirection :: Floor -> Floor -> Direction
getDirection from to = ...
buttonPress
:: ( MonadSTM m
, MonadSay m
)
=> TVar m ElevatorState -> Floor -> m ()
buttonPress elevatorVar floor = do
say ("Pressing button to " ++ show floor)
...
elevatorExample
:: ( MonadSTM m
, MonadAsync m
, MonadDelay m
, MonadSay m
)
=> [Floor]
-> m ()
elevatorExample floors = ...
Notice that only type signatures and IO operations needed changes. The core business logic remains identical. When instantiated to IO, say becomes putStrLn, but in the IOSim monad it produces traceable events.
main :: IO ()
main = do
let simpleExample :: [Floor]
simpleExample = [First, Ground, Second, First]
-- Runs the 'elevatorExample' in IO. This outputs exactly the same output
-- as before
elevatorExample simpleExample
-- Runs the 'elevatorExample' in IOSim.
putStrLn . intercalate "\n"
. map show
. selectTraceEventsSayWithTime
-- ^ Extracts only the 'say' events from the 'SimTrace' and
-- attaches the timestamp for each event
--
-- selectTraceEventsSayWithTime :: SimTrace a -> [(Time, String)]
--
-- This function takes a 'SimTrace' and filters all 'EventSay'
-- traces. It also captures the time of the trace event.
$ runSimTrace (elevatorExample simpleExample)
-- ^ Runs example in 'IOSim'
--
-- runSimTrace :: (forall s. IOSim s a) -> SimTrace a
--
-- This function runs a IOSim program, yielding an execution trace.
Running the program above, the first noticeable thing is that when the program runs in IO, it actually takes 10 real seconds to run due to the threadDelay calls. However, when the program runs in IOSim the output is instantaneous. This is because io-sim operates on simulated time rather than wall-clock time, i.e. only the internal clock advances when threads execute time-dependent operations like threadDelay or timeouts. Between these operations, the simulation executes as if it had infinite CPU speed, i.e. all computations at a given timestamp complete instantly, yet remain sequentially ordered and deterministic.
(Time 0s,"Pressing button to First")
(Time 0s,"Going Up to First")
(Time 0s,"Pressing button to Ground")
(Time 0s,"Pressing button to Second")
(Time 0s,"Pressing button to First")
(Time 1s,"Arrived at First")
(Time 1s,"Going Down to Ground")
(Time 2s,"Arrived at Ground")
(Time 2s,"Going Up to Second")
(Time 4s,"Arrived at Second")
(Time 4s,"Going Down to First")
(Time 5s,"Arrived at First")
This particular scenario doesn’t violate the constraint. To find violations, property-based testing can explore the space of possible request patterns. The only problem is that our say traces are strings which is not a very functional way of tracing things.
contra-tracer: Structured Tracing
While say provides basic, string-based tracing, real systems need structured tracing of domain-specific events. String-based logging quickly becomes inadequate when trying to verify complex properties or analyze system behavior programmatically. Tracing strongly-typed events that can be filtered, analyzed, and used in property tests is much better. The contra-tracer library provides a contravariant tracing abstraction that integrates seamlessly with io-sim.
The key advantages of structured tracing:
- Type Safety: Events are strongly typed, preventing typos and logging errors.
- Composability: Tracers can be filtered, transformed, and combined.
- Testability: Events can be programmatically analyzed in tests.
All one needs to do is to have a custom trace type:
data ElevatorTrace = ButtonPress Floor
| Going Direction Floor
| ArrivedAt Floor
deriving (Eq, Show, Typeable)
And substitute all calls to say for traceWith tracer (ButtonPress floor), for example.
With structured tracing in place, extracting and analyzing traces becomes type-safe and straightforward:
-- | Extract typed elevator events with timestamps
extractElevatorEvents :: SimTrace a -> [(Time, ElevatorTrace)]
extractElevatorEvents =
selectTraceEventsDynamicWithTime
Property-Based Testing: Verifying Timing Constraints
The elevator system began with a clear requirement: no passenger should wait more than 4 seconds. The FIFO implementation seemed reasonable, but the elevator can end up travelling between the bottom and top floors whilst someone in the middle waits their turn.
With typed traces from contra-tracer and deterministic simulation from io-sim, QuickCheck can systematically explore the space of possible request patterns and verify this property.
To verify our timing constraint, we need to:
- Generate random sequences of floor requests
- Run each sequence through the elevator simulation
- Check that every passenger gets service within 4 seconds
Let’s start with the test data generation:
-- | 'Floor' Arbitrary instance.
--
-- Randomly generate floors. The shrink instance is the most important here
-- since it will be responsible for generating a simpler counterexample.
--
instance Arbitrary Floor where
arbitrary = elements [Ground, First, Second]
shrink Second = [Ground, First]
shrink First = [Ground]
shrink Ground = []
The shrinking strategy is important because when QuickCheck finds a failing case with many floors, it will try simpler combinations to find the minimal reproduction of the original input.
To verify the property that no passenger waits more than 4 seconds for the elevator to arrive to its floor, one needs to track the button presses and measure how long until the elevator arrives.
The property works by maintaining a map of pending requests. Each ButtonPress adds an entry (keeping the earliest if multiple people request the same floor), and each ArrivedAt checks if that floor was requested and whether the wait exceeded 4 seconds:
-- Traverse the event trace and check if there is any gap longer than 4s
-- between requests and the elevator arriving at the request's floor.
--
violatesFourSecondRule :: [(Time, ElevatorTrace)] -> Property
violatesFourSecondRule events = counterexample (intercalate "\n" $ map show events)
$ checkViolations events Map.empty
where
checkViolations :: [(Time, ElevatorTrace)] -> Map Floor DiffTime -> Property
-- Fail if there are pending requests
checkViolations [] pending =
counterexample ("Elevator never arrived at: " ++ show pending)
(Map.null pending)
checkViolations ((Time t, event):rest) pending =
case event of
-- Add request to the pending requests map. Note that if there's
-- already a request for a particular floor, overwriting the
-- timestamp is not the right thing to do because there's an older
-- request that shouldn't be forgotten.