Introduction
This year has been a period of relentless performance optimization and the systematic consolidation of both our deep learning framework Burn, and our high-performance compute language CubeCL. The original goal of Burn was to offer optimal performance and portability without compromise on flexibility, a vision that led us to create CubeCL in 2024. Since then, we have evolved the initial prototype into the robust, high-performance system that now serves as the foundation of our entire stack.
Revisiting Our Goals
Last year, we set out with a clear objective: bringing Burn to any hardware, from embedded devices to large GPU clusters. We then started our work on supporting multiple GPUs, seeking a general solution that could function across all CubeCL-based backends wh…
Introduction
This year has been a period of relentless performance optimization and the systematic consolidation of both our deep learning framework Burn, and our high-performance compute language CubeCL. The original goal of Burn was to offer optimal performance and portability without compromise on flexibility, a vision that led us to create CubeCL in 2024. Since then, we have evolved the initial prototype into the robust, high-performance system that now serves as the foundation of our entire stack.
Revisiting Our Goals
Last year, we set out with a clear objective: bringing Burn to any hardware, from embedded devices to large GPU clusters. We then started our work on supporting multiple GPUs, seeking a general solution that could function across all CubeCL-based backends while maintaining optimal performance.
While we didn’t implement efficient multi-cluster training, we drastically improved our deferred execution paradigm, enabling seamless data transitions between multiple GPUs on a single node. By leveraging lazy execution, we managed to push GPU synchronization to the lowest level, abstracting GPU event queues and complex multi-stream management. You can refer to the Burn 0.19.0 Release for more details on the implementation.
On top of that, we set out to support quantization on any hardware. A major priority was the user experience, ensuring that any model could be quantized with minimal code changes, thereby simplifying the transition to production with a significantly reduced memory and performance footprint. You can refer to the Burn 0.19.0 Release for more details on these new APIs. While we successfully achieved this goal, the implementation revealed a new challenge: a combinatorial explosion of supported types. This drastically increased the complexity of our systems, requiring significant changes to our architecture.
Reliability by Design
At the beginning of last year, we struggled with performance regressions, platform-specific bugs, and keeping our CI alive. Creating releases took weeks of stabilization to ensure the quality met our standards. This process was both inefficient and frustrating, and we needed a better way to ensure that the continuous integration of new features and performance improvements didn’t compromise existing workflows.
We upgraded our CI by running on various physical GPU types rather than relying solely on wgpu virtual runners. This was necessary since many of our optimizations leverage tensor cores on recent GPUs. However, this was only part of the problem. The real challenge lies in selecting the right kernel for the right problem on the right hardware. This selection is far from trivial, and our solution is based on autotuning (running micro-benchmarks) at runtime. To keep autotuning times manageable, we developed a priority system that is highly hardware-dependent, now thoroughly tested with our burn-bench utilities.
To improve both reliability and user experience, we have switched to a time-based release cycle. Every week, we ship a pre-release for all our projects, running our full suite of correctness and performance regression tests. This ensures multi-platform bugs are found and fixed quickly with minimal pain. These reliability requirements are what ultimately led us to heavily refactor CubeCL to improve how we program and select optimal kernels.
CubeCL Architecture
Over the past year, we have significantly expanded our hardware support with a new MLIR/CPU compiler coupled with a high-performance Rust runtime. We also added a Metal compiler to our existing wgpu runtime, completing platform support for all major GPU vendors and CPU architectures.
The common consensus in the industry is that it is impossible to abstract GPU and CPU programming without sacrificing performance. However, through intentional design, we have succeeded in proving otherwise.
We found a maintainable way to write high-performance compute kernels in CubeCL that perform optimally on both GPUs and CPUs. The key was leveraging comptime to specialize kernels to their plane size (referred to as subgroups in Vulkan or warps in CUDA) and their line size (SIMD vector).
A critical design decision this year was setting the plane size to one for our CPU runtime, rather than simulating GPU-style execution. This approach, combined with a specialized Cube launching configuration, allows us to maximize memory coalescing on GPUs while optimizing for cache line alignment on CPUs. As a small team, we cannot afford to rewrite every algorithm for every backend, but these architectural choices ensure most existing kernels excel on CPUs with minimal modification.
Benchmark: Max Pool 2D
| Backend | Feature | Shapes | MedianTime |
|---|---|---|---|
| CubeCL (CPU) | cpu | (2, 128, 512, 512) | 5.734ms |
| LibTorch (CPU) | tch-cpu | (2, 128, 512, 512) | 18.505ms |
| ndarray | ndarray-simd | (2, 128, 512, 512) | 80.298ms |
| ndarray | ndarray | (2, 128, 512, 512) | 1.085s |
| - | - | - | - |
| CubeCL (CPU) | cpu | (2, 32, 512, 512) | 4.657ms |
| LibTorch (CPU) | tch-cpu | (2, 32, 512, 512) | 16.961ms |
| ndarray | ndarray-simd | (2, 32, 512, 512) | 26.039ms |
| ndarray | ndarray | (2, 32, 512, 512) | 851.380ms |
Median execution time for max_pool2d operations across different Burn backends and features.
While many operations still require refactoring to reach this level of efficiency, we are incredibly confident in this new approach and excited about future performance gains.
Burn Generalization
This year, we have also been working hard to ensure Burn’s training utilities support any learning paradigm, as flexibility is core to our value proposition. We’ve redesigned our abstractions to allow for endless customization in training logic while minimizing infrastructure overhead.
To achieve this, we are introducing a new concept: the TrainingParadigm. Our goal is to support everything from supervised learning to reinforcement learning, all sharing the same robust infrastructure backbone. Despite these internal changes, the migration for users will be straightforward:
let training = SupervisedTraining::new(ARTIFACT_DIR, dataloader_train, dataloader_valid)
.metrics((AccuracyMetric::new(), LossMetric::new()))
.with_file_checkpointer(CompactRecorder::new())
.early_stopping(early_stopping)
.num_epochs(config.num_epochs)
.summary()
.with_learning_strategy(TrainingStrategy::SingleDevice(device));
// For custom supervised training:
// .with_training_strategy(TrainingStrategy::CustomSingleDevice(Arc::new(
// MyCustomLearningStrategy::new(device),
// )));
let result = training.run(Learner::new(
model,
config.optimizer.init(),
lr_scheduler.init().unwrap(),
));
By decoupling the Learner (the model being trained) from the Feedback Providers (e.g., dataloaders for supervised learning or environments for RL), we’ve created a much more flexible architecture. The API will continue to evolve over the next few releases, but we plan to stabilize it as quickly as possible.
To ensure the design is battle-tested, we are integrating new paradigms directly into Burn, including Online RL, Offline RL, and potentially evolutionary algorithms like Evolution Strategies at the Hyperscale[1].
Announcing Burn Central
As some of you may know, Burn is backed by a company I founded (Tracel AI) to give the framework the resources it needs to reach its full potential. I’ve always envisioned a business model based on adding value through complementarity, rather than restricting features behind a paywall. Today, I’m excited to share our latest project: Burn Central, a new cloud platform that simplifies how Burn users train and deploy their models, no matter where they run.
We will release the Burn Central Alpha early next year and are now opening up a waiting list to provide early access to the platform. Our goal is to gather feedback and continuously evolve the product to meet the community’s needs.
Burn Central is designed to support many use cases. Hence, we are introducing three ways to use the platform:
- Free Plan: Access almost all platform features while using your own local computer for training and inference.
- Pay-as-you-Go: For those who want to scale instantly, this plan makes it trivial to provision cloud GPUs. We handle the orchestration, and you only pay for the compute you use.
- Pro Plan: Truly embracing the multi-platform nature of Burn, this plan allows you to "plug in" your own remote compute clusters and private storage.
Burn Central - Alpha Preview
Conclusion
We envision a future where the Rust AI and HPC ecosystem matures to a point that it challenges the long-standing dominance of C++ and Python. The potential is there, and we’re working incredibly hard to make it a reality. We envision a future where AI models are accessible to everyone, where anyone can train and deploy models on their private data using the hardware they already have, without restriction.
And on that note, happy new year 2026!