Why memory swizzling is hidden tax on AI compute

Walk into any modern AI lab, data center, or autonomous vehicle…

Walk into any modern AI lab, data center, or autonomous vehicle development environment, and you’ll hear engineers talk endlessly about FLOPS, TOPS, sparsity, quantization, and model scaling laws. Those metrics dominate headlines and product datasheets. If you spend time with the people actually building or optimizing these systems, a different truth emerges: Raw arithmetic capability is not what governs real-world performance.

What matters most is how efficiently data moves. And for most of today’s AI accelerators, data movement is tangled up with something rarely discussed outside compiler and hardware circles, that is, memory swizzling.

Memory swizzling is one of the biggest unseen taxes paid by modern AI systems. It doesn’t enhance algorithmic processing efficiency. It doesn’t improve accuracy. It doesn’t lower energy consumption. It doesn’t produce any new insight. Rather, it exists solely to compensate for architectural limitations inherited from decades-old design choices. And as AI models grow larger and more irregular, the cost of this tax is growing.

This article looks at why swizzling exists, how we got here, what it costs us, and how a fundamentally different architectural philosophy, specifically, a register-centric model, removes the need for swizzling entirely.

The problem nobody talks about: Data isn’t stored the way hardware needs it

In any AI tutorial, tensors are presented as ordered mathematical objects that sit neatly in memory in perfect layouts. These layouts are intuitive for programmers, and they fit nicely into high-level frameworks like PyTorch or TensorFlow.

The hardware doesn’t see the world this way.

Modern accelerators—GPUs, TPUs, and NPUs—are built around parallel compute units that expect specific shapes of data: tiles of fixed size, strict alignment boundaries, sequences with predictable stride patterns, and arranged in ways that map into memory banks without conflicts.

Unfortunately, real-world tensors never arrive in those formats. Before the processing even begins, data must be reshaped, re-tiled, re-ordered, or re-packed into the format the hardware expects. That reshaping is called memory swizzling.

You may think of it this way: The algorithm thinks in terms of matrices and tensors; the computing hardware thinks in terms of tiles, lanes, and banks. Swizzling is the translation layer—a translation that costs time and energy.

Why hierarchical memory forces us to swizzle

Virtually, every accelerator today uses a hierarchical memory stack whose layers, from the top-down, encompass registers; shared or scratchpad memory; L1 cache, L2 cache, sometimes even L3 cache, high-bandwidth memory (HBM), and, at the bottom of the stack, the external dynamic random-access memory (DRAM).

Each level has different size, latency, bandwidth, access energy consumption, and, rather important, alignment constraints. This is a legacy of CPU-style architecture where caches hide memory latency. See Figure 1 and Table 1.

Figure 1 See the capacity and bandwidth attributes of a typical hierarchical memory stack in all current hardware processors. Source: VSORA

Table 1 Capacity, latency, bandwidth, and access energy dissipation of a typical hierarchical memory stack in all current hardware processors are shown here. Source: VSORA

GPUs inherited this model, then added single-instruction multiple-thread (SIMT) execution on top. That makes them phenomenally powerful—but also extremely sensitive to how data is laid out. If neighboring threads in a warp don’t access neighboring memory locations, performance drops dramatically. If tile boundaries don’t line up, tensor cores stall. If shared memory bank conflicts occur, everything waits.

TPUs suffer from similar constraints, just with different mechanics. Their systolic arrays operate like tightly choreographed conveyor belts. Data must arrive in the right order and at the right time. If weights are not arranged in block-major format, the systolic fabric can’t operate efficiently.

NPUs-based accelerators—from smartphone chips to automotive systems—face the same issues: multi-bank SRAMs, fixed vector widths, and 2D locality requirements for vision workloads. Without swizzling, data arrives “misaligned” for the compute engine, and performance nosedives.

In all these cases, swizzling is not an optimization—it’s a survival mechanism.

The hidden costs of swizzling

Swizzling takes time, sometimes a lot

In real workloads, swizzling often consumes 20–60% of the total runtime. That’s not a typo. In a convolutional neural network, half the time may be spent doing NHWC NCHW conversions; that is, two different ways of laying out 4D tensors in memory. In a transformer, vast amounts of time are wasted into reshaping Q/K/V tensors, splitting heads, repacking tiles for GEMMs, and reorganizing outputs.

Swizzling burns energy and energy is the real limiter

A single MAC consumes roughly a quarter of a picojoule. Moving a value from DRAM can cost 500 picojoules. Moving data from a DRAM dissipates in the ballpark of 1,000 times more energy than performing a basic multiply-accumulate operation.

Swizzling requires reading large blocks of data, rearranging them, and writing them back. And this often happens multiple times per layer. When 80% of your energy budget goes to moving data rather than computing on it, swizzling becomes impossible to ignore.

Swizzling inflates memory usage

Most swizzling requires temporary buffers: packed tiles, staging buffers, and reshaped tensors. These extra memory footprints can push models over the limits of L2, L3, or even HBM, forcing even more data movement.

Swizzling makes software harder and less portable

Ask a CUDA engineer what keeps him up at night. Ask a TPU compiler engineer why XLA is thousands of pages deep in layout inference code. Ask anyone who writes an NPU kernel for mobile why they dread channel permutations.

It’s swizzling. The software must carry enormous complexity because the hardware demands very specific layouts. And every new model architecture—CNNs, LSTMs, transformers, and diffusion models—adds new layout patterns that must be supported.

The result is an ecosystem glued together by layout heuristics, tensor transformations, and performance-sensitive memory choreography.

How major architectures became dependent on swizzling

Nvidia GPUs

Tensor cores require specific tile-major layouts. Shared memory is banked, avoiding conflicts requires swizzling. Warps must coalesce memory accesses; otherwise, efficiency tanks. Even cuBLAS and cuDNN, the most optimized GPU libraries on Earth, are filled with internal swizzling kernels.

Google TPUs

TPUs rely on systolic arrays. The flow of data through these arrays must be perfectly ordered. Weights and activations are constantly rearranged to align with the systolic fabric. Much of XLA exists simply to manage data layout.

AMD CDNA, ARM Ethos, Apple ANE, and Qualcomm AI engine

Every one of these architectures performs swizzling. Morton tiling, interleaving, channel stacking, etc. It’s a universal pattern. Every architecture that uses hierarchical memory inherits the need for swizzling.

A different philosophy: Eliminating swizzling at the root

Now imagine stepping back and rethinking AI hardware from first principles. Instead of accepting today’s complex memory hierarchies as unavoidable—the layers of caches, shared-memory blocks, banked SRAMs, and alignment rules—imagine an architecture built on a far simpler premise.

What if there were no memory hierarchy at all? What if, instead, the entire system revolved around a vast, flat expanse of registers? What if the compiler, not the hardware, orchestrated every data movement with deterministic precision? And what if all the usual anxieties—alignment, bank conflicts, tiling strategies, and coalescing rules—simply disappeared because they no longer mattered?

This is the philosophy behind a register-centric architecture. Rather than pushing data up and down a ladder of memory levels, data simply resides in the registers where computation occurs. The architecture is organized not around the movement of data, but around its availability.

That means:

No caches to warm up or miss
No warps to schedule
No bank conflicts to avoid
No tile sizes to match
No tensor layouts to respect
No sensitivity to shapes or strides, and therefore no swizzling at all

In such a system, the compiler always knows exactly where each value lives, and exactly where it needs to be next. It doesn’t speculate, prefetch, tile, or rely on heuristics. It doesn’t cross its fingers hoping the hardware behaves. Instead, data placement becomes a solvable, predictable problem.

The result is a machine where throughput remains stable, latency becomes predictable, and energy consumption collapses because unnecessary data motion has been engineered out of the loop. It’s a system where performance is no longer dominated by memory gymnastics—and where computing, the actual math, finally takes center stage.

The future of AI: Why a register-centric architecture matters

As AI systems evolve, the tidy world of uniform tensors and perfectly rectangular compute tiles are steadily falling away. Modern models are no longer predictable stacks of dense layers marching in lockstep. Instead, they expand in every direction: They ingest multimodal inputs, incorporate sparse and irregular structures, reason adaptively, and operate across ever-longer sequences. They must also respond in real time for safety-critical applications, and they must do so within tight energy budgets—from cars to edge devices.

In other words, the assumptions that shaped GPU and TPU architectures—the expectation of regularity, dense grids, and neat tiling—are eroding. The future workloads are simply not shaped the way the hardware wants them to be.

A register-centric architecture offers a fundamentally different path. Because it operates directly on data where it lives, rather than forcing that data into tile-friendly formats, it sidesteps the entire machinery of memory swizzling. It does not depend on fixed tensor shapes.

It doesn’t stumble when access patterns become irregular or dynamic. It avoids the costly dance of rearranging data just to satisfy the compute units. And as models grow more heterogeneous and more sophisticated, such an architecture scale with their complexity instead of fighting against it.

This is more than an incremental improvement. It represents a shift in how we think about AI compute. By eliminating unnecessary data movement—the single largest bottleneck and energy sink in modern accelerators—a register-centric approach aligns hardware with the messy, evolving reality of AI itself.

Memory swizzling is the quiet tax that every hierarchical-memory accelerator pays. It is fundamental to how GPUs, TPUs, NPUs, and nearly all AI chips operate. It’s also a growing liability. Swizzling introduces latency, burns energy, bloats memory usage, and complicates software—all while contributing nothing to the actual math.

One register-centric architecture eliminates swizzling at the root by removing the hierarchy that makes it necessary. It replaces guesswork and heuristics with deterministic dataflow. It prioritizes locality without requiring rearrangement. It lets the algorithm drive the hardware, not vice versa.

As AI workloads become more irregular, dynamic, and power-sensitive, architectures that keep data stationary and predictable—rather than endlessly reshuffling it—will define the next generation of compute.

Swizzling was a necessary patch for the last era of hardware. It should not define the next one.

Lauro Rizzatti is a business advisor to VSORA, a technology company offering silicon semiconductor solutions that redefine performance. He is a noted chip design verification consultant and industry expert on hardware emulation.

Related Content

The post Why memory swizzling is hidden tax on AI compute appeared first on EDN.

Similar Posts