How I repurposed async await to implement coroutines for a Game Boy emulator

Emulators in Rust with async/await

A few months ago I implemented a(nother) Game Boy emulator in Rust, where I repurposed Rust’s async await into coroutines for fun and no profit.

The Problem

The Game Boy’s hardware consists of several independent components – the CPU, Pixel Processing Unit (PPU), Audio Processing Unit (APU), etc. Each component advances in discrete time steps called M-cycles.

How do you keep the flow of code simple and direct inside each component while still being able to interleave their operations? Here’s what I mean:

![Left image shows 5 M-cycles and different components advancing in parallel on various cycles. Right image shows this simulated in a single-threaded environment by advancing components…

Emulators in Rust with async/await

A few months ago I implemented a(nother) Game Boy emulator in Rust, where I repurposed Rust’s async await into coroutines for fun and no profit.

The Problem

How do you keep the flow of code simple and direct inside each component while still being able to interleave their operations? Here’s what I mean:

Left image shows 5 M-cycles and different components advancing in parallel on various cycles. Right image shows this simulated in a single-threaded environment by advancing components in series depending on the M-cycle

On the left are different components advancing in parallel as they would on real hardware. On the right are the same components being advanced in series. The exact order of operations within an M-cycle doesn’t matter but all operations for cycle N must complete before any for cycle N + 1 begin. We want to implement the model on the right.

The Solution

This is exactly the type of problem where coroutine ergonomics shine. Unfortunately, stable Rust doesn’t have coroutines. But Rust’s async/await compiles to coroutine-like state-machines, which I decided to exploit!

Here’s a toy example of what I envisioned:

async fn cpu() {
sleep(3).await;
println!("CPU: 1");
sleep(3).await;
println!("CPU: 2");
sleep(2).await;
println!("CPU: 3");
}

async fn ppu() {
sleep(4).await;
println!("PPU: 1");
sleep(1).await;
println!("PPU: 2");
sleep(1).await;
println!("PPU: 3");
}

async fn apu() {
sleep(3).await;
println!("APU: 1");
sleep(2).await;
println!("APU: 2");
sleep(4).await;
println!("APU: 3");
}

fn main() {
let mut driver = Driver::new();

driver.spawn(cpu());
driver.spawn(gpu());
driver.spawn(apu());

// Run till completion.
driver.run();
}

The Implementation

So my goal was to implement a single-threaded async driver that could schedule the emulator’s components and let them talk back to it without wrestling with intrusive piping or backpointers; excruciatingly difficult patterns to implement in Rust anyway.

After a bit of googling I learned of thread_local which finally gave me hope that all this was doable. And since I was after single-threaded execution, it seemed perfectly cromulent.

Overview

I broke the solution into three parts:

The Driver.
The Sleep Futures.
Shared state that the futures and driver can use to pass messages and communicate.

Let’s traverse them in reverse.

I managed shared state with the thread_local! macro, which basically creates per-thread globals. With it I stored the current clock cycle which, as you’ll see, is used by the Sleep futures to calculate their wake cycles, and I stored the next wake cycle which is set by the Sleeps and read by the Driver.

pub struct Sleep {
cycles: usize,
initialized: bool,
}

On the Sleep struct, I implemented the Future trait. When this future is first polled, it reads the current clock cycle from the shared thread-local state and uses its cycles field to calculate when it should be resumed. The driver next polls it at the correct resume cycle.

I also added a helper function to keep things pretty.

pub fn sleep(cycles: usize) -> Sleep {
Sleep {
cycles,
initialized: false,
}
}

The Driver needs to poll pending futures at precisely their requested clock cycles.

pub struct Driver {
futures: FuturesVec,
}

You could use a for loop to check for ready futures on every cycle, but this is needlessly inefficient. So I used a priority queue instead.

// clock cycle → futures to resume at that clock cycle
type FutureQueue = BTreeMap<usize, Vec<Sleep>>;

The driver jumps directly to the next clock cycle where something happens and resumes the relevant futures. This triggers the component (CPU, GPU, etc.) that awaited the future to resume execution until the next sleep(n).await is called from somewhere within its entrails.

Implementation Details

To truly appreciate recoil from the gratuitous defilement of Rust described in this section, you’ll need some prior understanding of async await in Rust.

First, the thread_local! state:

use std::cell::Cell;

thread_local! {
static CURRENT_CYCLE: Cell<usize> = const { Cell::new(0) };
static NEXT_WAKE_CYCLE: Cell<Option<usize>> = const { Cell::new(None) };
static PENDING_EVENT: Cell<Option<Event>> = const { Cell::new(None) };
}

I’m storing three things here. We’ll focus on CURRENT_CYCLE and NEXT_WAKE_CYCLE for now and I’ll come back to PENDING_EVENT near the end.

And the Sleep future:

use std::future::Future;
use std::pin::Pin;
use std::task::{Context, Poll};

#[derive(Debug)]
pub struct Sleep {
cycles: usize,
initialized: bool,
}

impl Future for Sleep {
type Output = ();

fn poll(self: Pin<&mut Self>, _cx: &mut Context<'_>) -> Poll<Self::Output> {
let this = self.get_mut();

// Get the current cycle.
let current_cycle = CURRENT_CYCLE.with(|cell| cell.get());

if !this.initialized {
// Write the target cycle to thread_local variable for driver to read.
NEXT_WAKE_CYCLE.with(|cell| {
cell.set(Some(current_cycle + this.cycles));
});

this.initialized = true;

Poll::Pending
} else {
Poll::Ready(())
}
}
}

pub fn sleep(cycles: usize) -> Sleep {
Sleep {
cycles,
initialized: false,
}
}

The same song and dance described earlier, but in code: Get the current cycle, calculate the target cycle, share target cycle with the Driver.

Finally, the Driver.

use futures::task::Waker;
use std::collections::{BTreeMap, VecDeque};

// Type alias for complex future queue type
type FutureQueue = BTreeMap<usize, Vec<Pin<Box<dyn Future<Output = ()>>>>>;

pub struct Driver {
clock: usize,
futures_queue: FutureQueue,
events_queue: VecDeque<Event>,
}

impl Driver {
pub fn new() -> Self {
Self {
clock: 0,
futures_queue: BTreeMap::new(),
events_queue: VecDeque::with_capacity(100),
}
}

pub fn spawn<F>(&mut self, future: F)
where
F: Future<Output = ()> + 'static,
{
self.futures_queue
.entry(self.clock)
.or_insert(Vec::with_capacity(4))
.push(Box::pin(future));
}

pub fn run_for(&mut self, max_cycles: usize) -> ExecutionResult {
// If any events are yet to be handled, do that first
if let Some(event) = self.events_queue.pop_front() {
// Exit early due to unhandled event
return ExecutionResult {
event,
cycles_executed: 0,
};
}

let start_cycle = self.clock;
let target_cycle = start_cycle + max_cycles;

while !self.futures_queue.is_empty() && self.clock < target_cycle {
let next_cycle = *self.futures_queue.keys().next().unwrap();

// Don't exceed our cycle limit
if next_cycle >= target_cycle {
break;
}

// Advance clock directly to this cycle
self.clock = next_cycle;

// Reflect change in current cycle
CURRENT_CYCLE.with(|cell| {
cell.set(self.clock);
});

// Get futures to poll during this cycle
let futures = self.futures_queue.remove(&next_cycle).unwrap();
let mut cx = Context::from_waker(Waker::noop());

for mut future in futures {
// Reset next wake cycle
NEXT_WAKE_CYCLE.with(|cell| {
cell.set(None);
});

// If not done, re-queue
if future.as_mut().poll(&mut cx).is_pending() {
let wake_cycle = NEXT_WAKE_CYCLE
.with(|cell| cell.get())
.unwrap_or(self.clock + 1);

self.futures_queue
.entry(wake_cycle)
.or_default()
.push(future);
}

// Check for events after each future poll
let pending_event = PENDING_EVENT.with(|cell| cell.get());
if let Some(event) = pending_event {
// Clear the event and add to queue
PENDING_EVENT.with(|cell| {
cell.set(None);
});
self.events_queue.push_back(event);
}
}

if let Some(event) = self.events_queue.pop_front() {
// Exit early due to unhandled event
return ExecutionResult {
event,
cycles_executed: self.clock - start_cycle,
};
}
}

ExecutionResult {
event: Event::MaxCycles,
cycles_executed: self.clock - start_cycle,
}
}
}

The actual FutureQueue is a type alias to

BTreeMap<usize, Vec<Pin<Box<dyn Future<Output =
()>>>>>;

. This might look intimidating, and it is. So I just pretend:

type FutureQueue = BTreeMap<usize, Vec<Sleep>>;

In my actual implementation, I also defined a run_for method instead of the run from the toy example at the start. You might’ve noticed that the clock field on Driver is redundant because of CURRENT_CYCLE. I just have it for convenience.

Future::poll requires a Context argument, which I had no use for, so I used the noop waker from the futures crate. The rest of the loop just puts the prior described ideas into code.

On Events

Remember PENDING_EVENT from earlier? As I quickly realized, in practice your components also need to emit events like “Update Screen Pixels” or “Play Audio”. When this happens the driver interrupts the run_for method and lets outside code handle the event. Here’s my implementation of events for my Game Boy emulator:

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum Event {
VBlank,          // Frame complete, display framebuffer to screen
AudioBufferFull, // Audio buffer full, drain to audio system
MaxCycles,       // Reached cycle limit
}

pub fn emit_event(event: Event) {
PENDING_EVENT.with(|cell| {
cell.set(Some(event));
});
}

This is why run_for returns the number of cycles that were executed.

ExecutionResult {
event: Event::MaxCycles,
cycles_executed: self.clock - start_cycle,
}

When there’s an early exit, the outside code can call run_for again for just the remaining number of cycles.

Conclusion

I used this framework to implement a fully working cycle-accurate Game Boy emulator where I could write code like this:

async fn execute_ld_a_nn() {
// M2: Read LSB of address
let nn_lsb = fetch().await;
// M3: Read MSB of address
let nn_msb = fetch().await;
// Construct 16-bit address
let nn = ((nn_msb as u16) << 8) | (nn_lsb as u16);
// M4: Read from memory at nn address
let value = memory_read(nn).await;

with_state_mut(|state| {
state.cpu.a = value;
});
}

instead of the usual orgy of states, sub-states, and match constructs. With it I also fulfilled my five years long dream of making an emulator with coroutines in my favorite language (I was waiting for coroutines in stable Rust then and I’m still waiting now).

This approach is by no means anywhere close to perfect. Because of async, the emulator’s state also needed to be thread-local:

thread_local! {
static STATE: RefCell<EmulatorState> = const { RefCell::new(EmulatorState::new()) };
}

and repeated access of thread-locals is considerably slower. Serializing the state is also not possible which means no saves! You cannot resume an async function from somewhere in the middle the next time you run the process. These are some serious shortcomings, so was it still worth it? Absolutely! This was a great learning experience for me and I’m glad I finally managed to do it.

Similar Posts