- Author: Cat Game Research Team
- Date: October 6, 2025
- Milestone: M4 Phase 1 - Advanced Rendering Infrastructure
- Technical Level: Intermediate to Advanced
Abstract
To build a modern game engine, you need more than draw calls and good intentions. Today’s GPUs demand rendering architectures that are declarative, dependency-aware, and ruthlessly optimized. In this post, we unpack two core subsystems powering Bad Cat: Void Frontier:
- A DAG-based render graph with automatic synchronization and pass culling
- A Vulkan Memory Allocator (VMA) integration for high-performance, low-fragmentation GPU memory management We explore the journey from forward rendering to frame graphs, delve into the concepts of DAGs and resource lifetimes, and guide you throug…
- Author: Cat Game Research Team
- Date: October 6, 2025
- Milestone: M4 Phase 1 - Advanced Rendering Infrastructure
- Technical Level: Intermediate to Advanced
Abstract
To build a modern game engine, you need more than draw calls and good intentions. Today’s GPUs demand rendering architectures that are declarative, dependency-aware, and ruthlessly optimized. In this post, we unpack two core subsystems powering Bad Cat: Void Frontier:
- A DAG-based render graph with automatic synchronization and pass culling
- A Vulkan Memory Allocator (VMA) integration for high-performance, low-fragmentation GPU memory management We explore the journey from forward rendering to frame graphs, delve into the concepts of DAGs and resource lifetimes, and guide you through our implementation—from the builder API to barrier inference. Along the way, we showcase real-world Vulkan code, performance improvements, and the design decisions that make our pipeline efficient, scalable, and agent-friendly.
Introduction: How We Got Here — From Draw Calls to DAGs
Let’s rewind for a second. Back in the early 2000s, rendering was simple — and by “simple,” we mean terrifyingly manual. Game engines used immediate mode rendering, where every draw call was fired straight at the GPU like a shotgun blast. No batching, no dependency tracking, no real concept of resource lifetimes. It worked, but only because the hardware was forgiving and the visuals were modest.
Then came programmable shaders. OpenGL 3.0 and DirectX 10 cracked open the pipeline, letting us write custom vertex and fragment logic. But most engines still ran forward renderers — single-pass, brute-force, and increasingly fragile as scenes got more complex. You’d sort by material, maybe depth, and hope your lighting didn’t tank performance.
Deferred Rendering: Lighting Gets Smart (and Painful)
Around 2004–2008, deferred rendering changed the rules. Instead of lighting every pixel during geometry processing, we split rendering into multiple passes: geometry first, lighting later. This unlocked complex lighting setups — think dozens of dynamic lights, screen-space effects, and layered post-processing — but it came at a cost.
Suddenly, you had to manage render targets across multiple stages. You needed to know when a texture was written, when it was read, what layout it was in, and whether the GPU had finished using it. Forget one barrier and boom — undefined behavior. The pipeline became a minefield of synchronization bugs and memory leaks.
Frame Graphs: Declarative Rendering for Engines That Mean Business
Fast-forward to 2015+, and the industry starts waking up. Frostbite (EA’s internal engine) pioneers the frame graph — a declarative model where you describe what you want rendered, and the system figures out how to do it efficiently. Yuriy O’Donnell’s 2017 GDC talk lays it out: treat rendering like a compiler optimization problem.
Instead of imperatively issuing commands, you build a Directed Acyclic Graph (DAG) where each node is a render pass and each edge is a resource dependency. The engine topologically sorts the graph, inserts synchronization barriers, allocates memory based on lifetimes, and culls any dead passes. It’s clean, scalable, and shockingly robust.
Frame graphs solve the four big pain points:
- 🧠 Automatic Resource Management — lifetimes are tracked, memory is allocated and freed intelligently
- 🔒 Synchronization Inference — barriers and layout transitions are inserted based on usage
- 🧹 Pass Culling and Optimization — unused passes are dropped, execution order is topologically sorted
- 🕵️ Visualization and Debugging — the graph structure can be exported and inspected This isn’t just a new approach to rendering; it’s a whole new way of thinking about graphics programming. Picture a world where debugging those complicated shader pipelines is a thing of the past, because you’re not writing pipelines anymore, you’re defining intent. That’s the transformation our Void Engine brings to life.
Why This Matters for Bad Cat: Void Frontier
Bad Cat: Void Frontier isn’t just another action-adventure game — it’s a rich, system-driven experience packed with lots of cats, environmental puzzles, escape rooms, combat, ai, dynamic lighting, and GPU-pushing visual effects. Our rendering pipeline has to tackle:
- 🌀 Multi-pass complexity — G-buffer generation, lighting accumulation, and post-processing chains
- 📐 Dynamic resource allocation — resolution scaling, effect quality adjustments, platform-aware fallbacks
- ✨ Advanced effects — screen-space reflections, volumetrics, and particles with soft shadows
- 🧭 Cross-platform deployment — Vulkan support across NVIDIA, AMD, Intel, and more Managing all that manually? You’d end up writing thousands of lines of fragile synchronization code that breaks with every tweak. That’s just not sustainable, and you’re stuck rewriting code pointlessly when you could be enjoying your game or creating new mechanics and aesthetics for maximum fun.
Our render graph changes the game: we declare intent, and the system takes care of execution. It optimizes dependencies, lifetimes, barriers, and memory automatically. It’s declarative, reliable, and developer-friendly, letting us focus on what really matters: creating worlds, not micromanaging pipelines.
Part I: Render Graph Systems
How We Got Here: Forward, Deferred, and the Rise of the Frame Graph
Frame graphs didn’t appear out of nowhere. They evolved over years of trial and error through increasingly complex rendering pipelines. Let’s explore the timeline.
Render Queues (2005–2010): Sorting, Not Solving
Early engines relied on render queue systems—organizing drawables by material, depth, or shader and then processing them. While this improved batching, it didn’t address the core issue: resource lifecycle management. Render targets were pre-allocated and persisted indefinitely, regardless of whether they were in use, leading to inherent memory waste.
Material-Centric Pipelines (2010–2015): Artists Win, Engineers Sweat
With the rise of deferred rendering, engines like Unreal Engine 3 and Unity embraced material-driven pipelines. This allowed artists to define shaders and render states for each material, making iteration much easier. However, behind the scenes, it still required manually handling intermediate buffers, layout transitions, and synchronization. While the pipeline became more visually appealing, it also grew more delicate.
Frostbite’s Frame Graph (2015–2017): Rendering as a DAG
Frostbite took a different approach by switching to a dependency graph model. In this setup, nodes represent render passes, and edges define resource dependencies. The system:
- 🧮 Topologically sorts the passes to establish execution order
- 🧼 Allocates transient resources for only as long as they’re needed
- 🔒 Automatically inserts synchronization barriers
- 🧹 Eliminates unnecessary passes that don’t affect the final output The key idea? Approach rendering as a compiler optimization challenge. Just as compilers remove dead code and rearrange instructions, a frame graph cuts out unused passes and optimizes execution order for better efficiency. It’s more than an improved pipeline — it’s a more intelligent one.
Theoretical Foundations: DAGs, Lifetimes, and Barrier Inference
To build a render graph that’s actually useful, not just pretty, you need a few core concepts dialed in. This section breaks down the theory behind our system: how we model dependencies, track resource lifetimes, and infer synchronization automatically.
Directed Acyclic Graphs (DAGs): The Backbone
Our render graph is a DAG — a directed acyclic graph. This means:
- Directed: Edges flow from producer to consumer (e.g., Pass A writes a texture, Pass B reads it → edge from A to B)
- Acyclic: No loops allowed. A pass cannot depend on itself, even indirectly. Each node represents a render pass, and each edge signifies a resource dependency. For instance, if Pass B reads a texture created by Pass A, we establish an edge from A to B. Straightforward.
Why DAGs work:
- ✅ Topological sort always exists, ensuring a valid execution order that respects all dependencies.
- ⚠️ Cycle detection is efficient, validated in O(V+E) time using DFS.
- 🧹 Transitive reduction helps prune redundant edges and simplifies the graph. This structure provides a straightforward way to manage execution order, resource usage, and synchronization—without embedding rigid rules.
Resource Lifetime Analysis: Who Needs What, When
Every resource (texture, buffer, etc.) has a lifetime:
- First use: The earliest pass that accesses it for reading or writing
- Last use: The latest pass that interacts with it Between these two points, the resource must remain allocated. Outside this range, the memory can be reclaimed or repurposed for another resource. Essentially, this is like register allocation for GPU memory—aiming to reduce the number of “live” resources at any given moment in a frame.
We analyze lifetimes by performing a linear scan across the topologically sorted passes:
// Simplified lifetime calculation from render_graph.cpp
void RenderGraph::calculate_resource_lifetimes() {
for (uint32_t pass_index = 0; pass_index < sorted_passes_.size(); ++pass_index) {
auto* pass = sorted_passes_[pass_index];
for (auto& read : pass->reads_) {
auto& resource = resources_[read.handle.id];
resource.first_use_pass = std::min(resource.first_use_pass, pass_index);
resource.last_use_pass = std::max(resource.last_use_pass, pass_index);
}
for (auto& write : pass->writes_) {
auto& resource = resources_[write.handle.id];
resource.first_use_pass = std::min(resource.first_use_pass, pass_index);
resource.last_use_pass = std::max(resource.last_use_pass, pass_index);
}
}
}
This provides a clear timeline for each resource, which will later be used for memory aliasing and budgeting.
Barrier Insertion: Vulkan Without the Pain
Modern GPUs handle commands asynchronously. If one pass writes to a texture and the next pass reads from it, you need to insert a pipeline barrier. Without it, you could face undefined behavior or visual glitches like flickering shadows and ghost pixels.
Vulkan makes you specify:
- Source stage — which pipeline stage must finish
- Destination stage — which stage must wait
- Access masks — what kind of memory access is being synchronized
- Layout transitions — for images, what layout they’re switching between Doing this manually is a nightmare, so we don’t bother.
Our render graph intelligently handles barrier insertion by analyzing the current state of each resource and aligning it with what the next pass requires:
// From our barrier insertion logic
void RenderGraph::insert_barriers() {
for (size_t i = 0; i < sorted_passes_.size(); ++i) {
auto* pass = sorted_passes_[i];
BarrierInfo barriers;
for (auto& read : pass->reads_) {
auto& resource = resources_[read.handle.id];
bool needs_barrier =
resource.current_layout != read.expected_layout ||
resource.current_access != read.access_mask;
if (needs_barrier) {
VkImageMemoryBarrier barrier{};
barrier.sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;
barrier.srcAccessMask = resource.current_access;
barrier.dstAccessMask = read.access_mask;
barrier.oldLayout = resource.current_layout;
barrier.newLayout = read.expected_layout;
barrier.image = resource.image;
// ... subresource range setup
barriers.image_barriers.push_back(barrier);
barriers.src_stage |= resource.current_stage;
barriers.dst_stage |= read.stage_mask;
// Update resource state
resource.current_layout = read.expected_layout;
resource.current_access = read.access_mask;
resource.current_stage = read.stage_mask;
}
}
pass->barriers_ = std::move(barriers);
}
}
This system removes an entire category of bugs, including layout mismatches, missing barriers, and race conditions, making the pipeline reproducible and user-friendly. You simply describe what you need, and the graph ensures it is executed safely.
Implementation Deep Dive: How We Architect the Render Graph
Our render graph is designed with a builder pattern, divided into three distinct phases. This section focuses on Phase 1, detailing how we create the graph using a fluent, declarative API that captures intent without requiring micromanagement of execution.
Phase 1: Graph Construction (Builder API)
Each frame begins with a new graph, where passes and resources are defined using a fluent builder interface designed to be readable, reproducible, and easy for agents to work with.
auto& builder = render_graph.begin_frame();
// Declare a shadow map texture
auto shadow_map = builder.create_texture(
"shadow_map",
TextureDescriptor{
.width = 2048,
.height = 2048,
.format = VK_FORMAT_D32_SFLOAT,
.usage = VK_IMAGE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT |
VK_IMAGE_USAGE_SAMPLED_BIT
}
);
// Add a shadow pass that writes to the shadow map
builder.add_pass("shadow_pass")
.write(shadow_map, AttachmentLoadOp::Clear, AttachmentStoreOp::Store)
.execute([](RenderPassContext& ctx) {
ctx.draw_shadow_casters();
});
// Declare a lighting result texture
auto lighting_result = builder.create_texture("lighting", /*...*/);
// Add a lighting pass that reads the shadow map and writes lighting output
builder.add_pass("lighting_pass")
.read(shadow_map, PipelineStage::FragmentShader)
.write(lighting_result, AttachmentLoadOp::Clear, AttachmentStoreOp::Store)
.execute([](RenderPassContext& ctx) {
ctx.draw_lit_geometry();
});
render_graph.end_frame();
This API follows a declarative approach, allowing you to specify what you want to achieve, while the graph determines how to execute it safely and efficiently.
Why this matters:
- 🧠 Intent is clear — no boilerplate, no imperative sequencing
- 🔗 Dependencies are explicit — the graph knows lighting depends on shadow
- 🔄 Refactoring is safe — add or remove passes without breaking synchronization It’s modular, easy to read, and designed for automation. You can integrate it into agent workflows, spec-driven pipelines, or live-editing tools without the hassle of dealing with low-level Vulkan complexities.
Phase 2: Graph Compilation
After the passes are defined, the graph undergoes a sequence of compilation steps, each designed to be modular, predictable, and reproducible.
- Dependency analysis — build adjacency lists for topological sort
- Topological sort — determine a valid execution order that respects dependencies
- Lifetime calculation — track first and last use for every resource
- Pass culling — remove anything that doesn’t contribute to the final output
- Barrier insertion — infer synchronization automatically
- Memory allocation — assign GPU memory for transient resources The sorting process relies on Kahn’s algorithm, a technique for topologically sorting a Directed Acyclic Graph (DAG). It works by continuously removing nodes with an in-degree of 0 while updating the in-degrees of their neighbors. If all nodes are successfully processed, the graph is confirmed as a DAG; otherwise, it contains a cycle. Here’s an example of how cycle detection is built into the process:
void RenderGraph::topological_sort() {
std::vector<uint32_t> in_degree(passes_.size(), 0);
std::queue<RenderPass*> zero_in_degree;
for (const auto& pass : passes_) {
for (auto* dependency : pass->dependencies_) {
in_degree[dependency->index_]++;
}
}
for (auto& pass : passes_) {
if (in_degree[pass->index_] == 0) {
zero_in_degree.push(pass.get());
}
}
sorted_passes_.clear();
while (!zero_in_degree.empty()) {
auto* pass = zero_in_degree.front();
zero_in_degree.pop();
sorted_passes_.push_back(pass);
for (auto* dependent : pass->dependents_) {
if (--in_degree[dependent->index_] == 0) {
zero_in_degree.push(dependent);
}
}
}
if (sorted_passes_.size() != passes_.size()) {
throw std::runtime_error("Render graph contains cycles");
}
}
This ensures a clear, deterministic execution order—or throws an error if the graph is invalid. Either way, we have complete clarity on what we’re dealing with.
Phase 3: Graph Execution
Once compiled, the graph walks through the sorted passes and issues Vulkan commands. Barriers are inserted automatically before each pass:
void RenderGraph::execute(VkCommandBuffer cmd) {
for (auto* pass : sorted_passes_) {
if (!pass->barriers_.image_barriers.empty() ||
!pass->barriers_.buffer_barriers.empty()) {
vkCmdPipelineBarrier(
cmd,
pass->barriers_.src_stage,
pass->barriers_.dst_stage,
0,
0, nullptr,
pass->barriers_.buffer_barriers.size(),
pass->barriers_.buffer_barriers.data(),
pass->barriers_.image_barriers.size(),
pass->barriers_.image_barriers.data()
);
}
RenderPassContext ctx{cmd, this, pass};
pass->execute_callback_(ctx);
}
}
This phase is dead simple: walk the graph, insert barriers, run the callbacks. No surprises.
Performance Optimization: Sorting, Culling, and Aliasing
Two optimizations make this whole system scale:
1. Pass Culling (Dead Code Elimination)
Not every declared pass contributes to the final frame. Example:
auto debug_buffer = builder.create_texture("debug", /*...*/);
builder.add_pass("debug_visualization")
.write(debug_buffer, /*...*/)
.execute([](auto& ctx) { /* expensive debug rendering */ });
builder.add_pass("present")
.read(final_color, /*...*/)
.execute([](auto& ctx) { /* present to screen */ });
If debug_buffer
isn’t read by anything downstream, we cull debug_visualization
. The system runs a reverse reachability analysis from final outputs:
void RenderGraph::cull_unused_passes() {
std::unordered_set<RenderPass*> reachable;
std::queue<RenderPass*> to_visit;
for (auto& pass : passes_) {
if (pass->writes_to_external_) {
to_visit.push(pass.get());
reachable.insert(pass.get());
}
}
while (!to_visit.empty()) {
auto* pass = to_visit.front();
to_visit.pop();
for (auto* dependency : pass->dependencies_) {
if (reachable.insert(dependency).second) {
to_visit.push(dependency);
}
}
}
passes_.erase(
std::remove_if(passes_.begin(), passes_.end(),
[&](auto& pass) { return reachable.find(pass.get()) == reachable.end(); }),
passes_.end()
);
}
That control comes with hazards:
- Allocation limits — often ~4096 allocations per device
- Alignment rules — 256 bytes to 64KB per resource
- Fragmentation — free space in the wrong shape is useless Example:
Allocate 64KB texture [████████]
Allocate 64KB buffer [████████]
Allocate 64KB texture [████████]
Free middle buffer [--------]
// You have 64KB + 64KB free, but not contiguous.
// Can't fit a 128KB texture.
Suballocation: The Industry Fix
The standard approach: allocate big blocks (64–256MB) and carve them up yourself.
Big allocation [==================== 256MB ====================]
Suballocate [Tex1][Buf1][Tex2]...[Uniforms]...[Staging]
That means:
- Tracking free regions (free lists, buddy allocators, etc.)
- Handling alignment per resource
- Defragmenting when gaps appear
- Staying within memory budgets
Why Vulkan Memory Allocation Is Hard
Here’s a real-world example:
VkBufferCreateInfo buffer_info{};
buffer_info.size = 1024 * 1024; // 1MB
buffer_info.usage = VK_BUFFER_USAGE_VERTEX_BUFFER_BIT;
VkBuffer buffer;
vkCreateBuffer(device, &buffer_info, nullptr, &buffer);
VkMemoryRequirements mem_reqs;
vkGetBufferMemoryRequirements(device, buffer, &mem_reqs);
// mem_reqs:
// - size: 1048576 (may be larger due to alignment)
// - alignment: 256
// - memoryTypeBits: 0b00001010
From here you must:
- Find a compatible memory type from the bitmask
- Allocate memory with correct size/alignment
- Bind it to the buffer at the right offset
- Track it for freeing later
- Handle allocation failures (OOM, fragmentation) That’s just one buffer. A single frame can interact with hundreds of resources, making manual handling tedious, error-prone, and an ideal task for automation.
VMA: AMD’s Solution to Industry-Wide Challenges
Vulkan Memory Allocator (VMA) is an open-source library developed by AMD and released under the MIT license in 2017. Despite being developed by AMD, it works on all Vulkan-capable hardware (NVIDIA, Intel, ARM Mali, Qualcomm Adreno, etc.) because it operates entirely through standard Vulkan APIs.
Why AMD Created VMA
AMD’s motivations were multifaceted:
- Developer Experience: Make Vulkan more accessible by abstracting memory management complexity
- Performance: Help developers use GPU memory optimally (benefiting AMD hardware)
- Ecosystem: Accelerate Vulkan adoption by reducing implementation barriers
- Best Practices: Codify memory management patterns AMD engineers discovered The library embodies 20+ years of GPU driver engineering knowledge from AMD’s internal teams.
Core VMA Features
VMA provides several critical capabilities:
1. Automatic Memory Type Selection
Instead of manually checking compatibility bits:
// Without VMA - manual and error-prone
uint32_t find_memory_type(uint32_t type_filter, VkMemoryPropertyFlags properties) {
VkPhysicalDeviceMemoryProperties mem_props;
vkGetPhysicalDeviceMemoryProperties(physical_device, &mem_props);
for (uint32_t i = 0; i < mem_props.memoryTypeCount; i++) {
if ((type_filter & (1 << i)) &&
(mem_props.memoryTypes[i].propertyFlags & properties) == properties) {
return i;
}
}
throw std::runtime_error("Failed to find suitable memory type");
}
// With VMA - automatic and optimal
VmaAllocationCreateInfo alloc_info{};
alloc_info.usage = VMA_MEMORY_USAGE_GPU_ONLY; // VMA picks best device-local type
VMA understands the nuances of different GPU architectures and selects optimal memory types automatically.
2. Smart Suballocation with Multiple Strategies
VMA implements three allocation strategies:
- Best-fit: Find the smallest free block that fits (minimizes waste)
- Worst-fit: Use the largest free block (reduces fragmentation)
- Buddy allocator: Power-of-2 allocation for fast merging It automatically switches strategies based on allocation patterns and fragmentation levels.
3. Memory Mapping Abstraction
Raw Vulkan requires manual mapping/unmapping:
// Without VMA
void* data;
vkMapMemory(device, memory, offset, size, 0, &data);
memcpy(data, vertex_data, size);
vkUnmapMemory(device, memory);
// With VMA - RAII and safer
void* data;
vmaMapMemory(allocator, allocation, &data);
memcpy(data, vertex_data, size);
vmaUnmapMemory(allocator, allocation);
// Or even better - persistent mapping
VmaAllocationCreateInfo info{};
info.flags = VMA_ALLOCATION_CREATE_MAPPED_BIT;
// Memory stays mapped, no map/unmap overhead
4. Budget Tracking and Memory Statistics
VMA tracks memory usage across all allocations:
VmaBudget budgets[VK_MAX_MEMORY_HEAPS];
vmaGetHeapBudgets(allocator, budgets);
for (uint32_t i = 0; i < heap_count; ++i) {
printf("Heap %d: %llu / %llu MB used\n",
i,
budgets[i].usage / (1024*1024),
budgets[i].budget / (1024*1024));
}
This is critical for respecting system memory limits and avoiding out-of-memory crashes.
5. Defragmentation Support
For long-running applications (MMOs, open-world games), VMA can defragment memory:
VmaDefragmentationInfo defrag_info{};
defrag_info.maxBytesPerPass = 64 * 1024 * 1024; // 64MB per frame
defrag_info.maxAllocationsPerPass = 100;
VmaDefragmentationContext ctx;
vmaBeginDefragmentation(allocator, &defrag_info, &ctx);
// Over multiple frames:
while (true) {
VmaDefragmentationPassMoveInfo pass_info;
VkResult result = vmaBeginDefragmentationPass(allocator, ctx, &pass_info);
if (result == VK_SUCCESS) {
// Move allocations to compact memory
perform_allocation_moves(pass_info);
vmaEndDefragmentationPass(allocator, ctx);
} else {
break; // Defragmentation complete
}
}
This is comparable to garbage collection compaction found in managed languages.
Implementation: RAII Wrappers and Device Injection
Our VMA integration follows modern C++ best practices with RAII (Resource Acquisition Is Initialization) and dependency injection.
The VmaAllocatorWrapper Class
class VmaAllocatorWrapper {
public:
VmaAllocatorWrapper() = default;
~VmaAllocatorWrapper() { shutdown(); }
// Move semantics - allocator is move-only
VmaAllocatorWrapper(VmaAllocatorWrapper&& other) noexcept;
VmaAllocatorWrapper& operator=(VmaAllocatorWrapper&& other) noexcept;
// Delete copy - prevent accidental duplication
VmaAllocatorWrapper(const VmaAllocatorWrapper&) = delete;
VmaAllocatorWrapper& operator=(const VmaAllocatorWrapper&) = delete;
// Initialization with device injection
bool initialize(
VulkanDevice* device,
PFN_vkGetInstanceProcAddr get_instance_proc,
PFN_vkGetDeviceProcAddr get_device_proc
);
void shutdown();
bool is_initialized() const { return allocator_ != nullptr; }
// Buffer operations
bool create_buffer(
const VkBufferCreateInfo& buffer_info,
VmaMemoryUsage usage,
VkBuffer& out_buffer,
VmaAllocation& out_allocation
);
void destroy_buffer(VkBuffer buffer, VmaAllocation allocation);
// Memory mapping
bool map_memory(VmaAllocation allocation, void** data);
void unmap_memory(VmaAllocation allocation);
// Statistics
VmaBudget get_budget(uint32_t heap_index) const;
private:
VmaAllocator allocator_{nullptr};
VulkanDevice* device_{nullptr};
};
Key design decisions:
- RAII: Destructor automatically calls
shutdown()
, preventing memory leaks - Move semantics: Allocator can be moved but not copied (unique ownership)
- Device injection: Accepts
VulkanDevice*
instead of raw Vulkan handles (testability) - Opaque handles: VMA uses opaque
VmaAllocation
handles (good encapsulation)
Dynamic Function Pointer Loading
VMA requires Vulkan function pointers for dynamic loading contexts. Our implementation:
bool VmaAllocatorWrapper::initialize(
VulkanDevice* device,
PFN_vkGetInstanceProcAddr vkGetInstanceProcAddr,
PFN_vkGetDeviceProcAddr vkGetDeviceProcAddr
) {
VmaVulkanFunctions vma_funcs{};
vma_funcs.vkGetInstanceProcAddr = vkGetInstanceProcAddr;
vma_funcs.vkGetDeviceProcAddr = vkGetDeviceProcAddr;
VmaAllocatorCreateInfo create_info{};
create_info.vulkanApiVersion = VK_API_VERSION_1_2;
create_info.physicalDevice = device->get_vk_physical_device();
create_info.device = device->get_vk_device();
create_info.instance = device->get_vk_instance();
create_info.pVulkanFunctions = &vma_funcs;
if (vmaCreateAllocator(&create_info, &allocator_) != VK_SUCCESS) {
return false;
}
device_ = device;
return true;
}
This approach supports:
- Dynamic Vulkan loading (no static linking to vulkan-1.dll)
- Multiple devices (each device gets its own allocator)
- Cross-platform (works on Windows, Linux, Android, etc.)
Memory Budgeting and Defragmentation Strategies
Two advanced features we leverage from VMA:
Budget Tracking
VmaBudget VmaAllocatorWrapper::get_budget(uint32_t heap_index) const {
if (!is_initialized()) {
return VmaBudget{};
}
VmaBudget budgets[VK_MAX_MEMORY_HEAPS];
vmaGetHeapBudgets(allocator_, budgets);
return budgets[heap_index];
}
We use this to:
- Prevent over-allocation: Don’t exceed 80% of VRAM budget
- Adaptive quality: Reduce texture resolution if approaching budget
- Telemetry: Report memory usage in debug builds
Defragmentation (Planned for M5)
While not yet implemented, our architecture supports future defragmentation:
// Planned for M5: Background defragmentation
void RenderSystem::background_defragmentation() {
if (frame_count_ % 300 == 0) { // Every 5 seconds at 60 FPS
VmaDefragmentationInfo info{};
info.maxBytesPerPass = 32 * 1024 * 1024; // 32MB per frame
VmaDefragmentationContext ctx;
vma_allocator_.begin_defragmentation(info, ctx);
// Process incrementally over multiple frames
defrag_context_ = ctx;
defrag_active_ = true;
}
}
This will be critical for our open-world ark ship environment where players can spend hours in a single session.
Part III: Integration and Synergy
Bringing Render Graph and VMA Together
The real strength lies in the synergy between the render graph and VMA. Picture this rendering scenario:
Scenario: Dynamic Shadow Map Allocation
Traditional approach (manual):
// Manually create shadow map
VkImageCreateInfo image_info{};
image_info.extent = {2048, 2048, 1};
image_info.format = VK_FORMAT_D32_SFLOAT;
// ... 20 more lines of setup
VkImage shadow_map;
vkCreateImage(device, &image_info, nullptr, &shadow_map);
// Manually allocate memory
VkMemoryRequirements mem_reqs;
vkGetImageMemoryRequirements(device, shadow_map, &mem_reqs);
VkMemoryAllocateInfo alloc_info{};
alloc_info.allocationSize = mem_reqs.size;
alloc_info.memoryTypeIndex = find_device_local_memory_type(mem_reqs.memoryTypeBits);
VkDeviceMemory memory;
vkAllocateMemory(device, &alloc_info, nullptr, &memory);
vkBindImageMemory(device, shadow_map, memory, 0);
// Manually insert barriers
VkImageMemoryBarrier barrier{};
barrier.oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;
barrier.newLayout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL;
// ... 15 more lines of barrier setup
vkCmdPipelineBarrier(cmd,
VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT,
0, 0, nullptr, 0, nullptr, 1, &barrier);
// Later: manual cleanup
vkDestroyImage(device, shadow_map, nullptr);
vkFreeMemory(device, memory, nullptr);
With render graph + VMA:
// Declare shadow map - VMA handles allocation, graph handles barriers
auto shadow_map = builder.create_texture("shadow_map", TextureDescriptor{
.width = 2048, .height = 2048,
.format = VK_FORMAT_D32_SFLOAT,
.usage = VK_IMAGE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT | VK_IMAGE_USAGE_SAMPLED_BIT
});
builder.add_pass("shadow_pass")
.write(shadow_map, AttachmentLoadOp::Clear, AttachmentStoreOp::Store)
.execute([](auto& ctx) { /* rendering */ });
// Cleanup is automatic when the graph is destroyed
From ~60 lines of error-prone code to ~5 lines of declarative intent. The system automatically:
- Selects optimal memory type via VMA
- Allocates with proper alignment via VMA
- Inserts layout transition barriers via render graph
- Deallocates when resource lifetime ends
- Reuses memory for subsequent frames
Integration Architecture
Our integration follows this flow:
RenderSystem::init()
├─> Initialize VulkanDevice (M3)
├─> Initialize VMA Allocator (M4 Phase 1b)
│ └─> Load function pointers
│ └─> Create VmaAllocator with device handles
└─> Initialize RenderGraph (M4 Phase 1a)
└─> Store device/physical device references
RenderSystem::render_frame()
├─> Begin frame graph construction
├─> Declare passes and resources
├─> End frame (triggers compilation)
│ ├─> Topological sort
│ ├─> Lifetime analysis
│ ├─> Barrier insertion
│ └─> Memory allocation (via VMA)
└─> Execute graph
└─> Issue Vulkan commands
RenderSystem::shutdown()
├─> Destroy render graph
├─> Shutdown VMA allocator
└─> Destroy Vulkan device
The ordering is critical:
- Device must exist before VMA/graph
- Graph/VMA must be destroyed before device
- All GPU operations must complete before shutdown
Real-World Performance Characteristics
Our testing methodology and results:
Test Configuration
- Hardware: NVIDIA RTX 3070 Ti (12GB VRAM), Intel 11700
- API: Vulkan 1.2.198
- Resolution: 1920x1080
- Scenario: Deferred rendering with 3 shadow maps, screen-space reflections, bloom
Metrics Measured
Frame time breakdown:
- CPU graph construction: 0.12ms
- CPU graph compilation: 0.08ms
- GPU execution: 8.3ms
- Total: 8.5ms (117 FPS)
Memory efficiency:
- Without VMA: 847MB allocated, 623MB used (26% waste)
- With VMA: 643MB allocated, 619MB used (3.7% waste)
- Savings: 204MB (24% reduction)
Allocation count:
- Without VMA: 2,341 Vulkan allocations (approaching 4096 limit)
- With VMA: 47 Vulkan allocations (managed via suballocation)
- Reduction: 98% fewer allocations
Barrier correctness:
- Manual implementation: 3 synchronization bugs found in testing
- Render graph: 0 synchronization bugs (automatic correctness)
Performance Analysis
The 24% memory reduction comes from:
- Suballocation: Sharing large blo