
Walk into any modern AI lab, data center, or autonomous vehicleâŚ

Walk into any modern AI lab, data center, or autonomous vehicle development environment, and youâll hear engineers talk endlessly about FLOPS, TOPS, sparsity, quantization, and model scaling laws. Those metrics dominate headlines and product datasheets. If you spend time with the people actually building or optimizing these systems, a different truth emerges: Raw arithmetic capability is not what governs real-world performance.
What matters most is how efficiently data moves. And for most of todayâs AI accelerators, data movement is tangled up with something rarely discussed outside compiler and hardware circles, that is, memory swizzling.
Memory swizzling is one of the biggest unseen taxes paid by modern AI systems. It doesnât enhance algorithmic processing efficiency. It doesnât improve accuracy. It doesnât lower energy consumption. It doesnât produce any new insight. Rather, it exists solely to compensate for architectural limitations inherited from decades-old design choices. And as AI models grow larger and more irregular, the cost of this tax is growing.
This article looks at why swizzling exists, how we got here, what it costs us, and how a fundamentally different architectural philosophy, specifically, a register-centric model, removes the need for swizzling entirely.
The problem nobody talks about: Data isnât stored the way hardware needs it
In any AI tutorial, tensors are presented as ordered mathematical objects that sit neatly in memory in perfect layouts. These layouts are intuitive for programmers, and they fit nicely into high-level frameworks like PyTorch or TensorFlow.
The hardware doesnât see the world this way.
Modern acceleratorsâGPUs, TPUs, and NPUsâare built around parallel compute units that expect specific shapes of data: tiles of fixed size, strict alignment boundaries, sequences with predictable stride patterns, and arranged in ways that map into memory banks without conflicts.
Unfortunately, real-world tensors never arrive in those formats. Before the processing even begins, data must be reshaped, re-tiled, re-ordered, or re-packed into the format the hardware expects. That reshaping is called memory swizzling.
You may think of it this way: The algorithm thinks in terms of matrices and tensors; the computing hardware thinks in terms of tiles, lanes, and banks. Swizzling is the translation layerâa translation that costs time and energy.
Why hierarchical memory forces us to swizzle
Virtually, every accelerator today uses a hierarchical memory stack whose layers, from the top-down, encompass registers; shared or scratchpad memory; L1 cache, L2 cache, sometimes even L3 cache, high-bandwidth memory (HBM), and, at the bottom of the stack, the external dynamic random-access memory (DRAM).
Each level has different size, latency, bandwidth, access energy consumption, and, rather important, alignment constraints. This is a legacy of CPU-style architecture where caches hide memory latency. See Figure 1 and Table 1.

Figure 1 See the capacity and bandwidth attributes of a typical hierarchical memory stack in all current hardware processors. Source: VSORA

Table 1 Capacity, latency, bandwidth, and access energy dissipation of a typical hierarchical memory stack in all current hardware processors are shown here. Source: VSORA
GPUs inherited this model, then added single-instruction multiple-thread (SIMT) execution on top. That makes them phenomenally powerfulâbut also extremely sensitive to how data is laid out. If neighboring threads in a warp donât access neighboring memory locations, performance drops dramatically. If tile boundaries donât line up, tensor cores stall. If shared memory bank conflicts occur, everything waits.
TPUs suffer from similar constraints, just with different mechanics. Their systolic arrays operate like tightly choreographed conveyor belts. Data must arrive in the right order and at the right time. If weights are not arranged in block-major format, the systolic fabric canât operate efficiently.
NPUs-based acceleratorsâfrom smartphone chips to automotive systemsâface the same issues: multi-bank SRAMs, fixed vector widths, and 2D locality requirements for vision workloads. Without swizzling, data arrives âmisalignedâ for the compute engine, and performance nosedives.
In all these cases, swizzling is not an optimizationâitâs a survival mechanism.
The hidden costs of swizzling
Swizzling takes time, sometimes a lot
In real workloads, swizzling often consumes 20â60% of the total runtime. Thatâs not a typo. In a convolutional neural network, half the time may be spent doing NHWC
NCHW conversions; that is, two different ways of laying out 4D tensors in memory. In a transformer, vast amounts of time are wasted into reshaping Q/K/V tensors, splitting heads, repacking tiles for GEMMs, and reorganizing outputs.
Swizzling burns energy and energy is the real limiter
A single MAC consumes roughly a quarter of a picojoule. Moving a value from DRAM can cost 500 picojoules. Moving data from a DRAM dissipates in the ballpark of 1,000 times more energy than performing a basic multiply-accumulate operation.
Swizzling requires reading large blocks of data, rearranging them, and writing them back. And this often happens multiple times per layer. When 80% of your energy budget goes to moving data rather than computing on it, swizzling becomes impossible to ignore.
Swizzling inflates memory usage
Most swizzling requires temporary buffers: packed tiles, staging buffers, and reshaped tensors. These extra memory footprints can push models over the limits of L2, L3, or even HBM, forcing even more data movement.
Swizzling makes software harder and less portable
Ask a CUDA engineer what keeps him up at night. Ask a TPU compiler engineer why XLA is thousands of pages deep in layout inference code. Ask anyone who writes an NPU kernel for mobile why they dread channel permutations.
Itâs swizzling. The software must carry enormous complexity because the hardware demands very specific layouts. And every new model architectureâCNNs, LSTMs, transformers, and diffusion modelsâadds new layout patterns that must be supported.
The result is an ecosystem glued together by layout heuristics, tensor transformations, and performance-sensitive memory choreography.
How major architectures became dependent on swizzling
- Nvidia GPUs
Tensor cores require specific tile-major layouts. Shared memory is banked, avoiding conflicts requires swizzling. Warps must coalesce memory accesses; otherwise, efficiency tanks. Even cuBLAS and cuDNN, the most optimized GPU libraries on Earth, are filled with internal swizzling kernels.
- Google TPUs
TPUs rely on systolic arrays. The flow of data through these arrays must be perfectly ordered. Weights and activations are constantly rearranged to align with the systolic fabric. Much of XLA exists simply to manage data layout.
- AMD CDNA, ARM Ethos, Apple ANE, and Qualcomm AI engine
Every one of these architectures performs swizzling. Morton tiling, interleaving, channel stacking, etc. Itâs a universal pattern. Every architecture that uses hierarchical memory inherits the need for swizzling.
A different philosophy: Eliminating swizzling at the root
Now imagine stepping back and rethinking AI hardware from first principles. Instead of accepting todayâs complex memory hierarchies as unavoidableâthe layers of caches, shared-memory blocks, banked SRAMs, and alignment rulesâimagine an architecture built on a far simpler premise.
What if there were no memory hierarchy at all? What if, instead, the entire system revolved around a vast, flat expanse of registers? What if the compiler, not the hardware, orchestrated every data movement with deterministic precision? And what if all the usual anxietiesâalignment, bank conflicts, tiling strategies, and coalescing rulesâsimply disappeared because they no longer mattered?
This is the philosophy behind a register-centric architecture. Rather than pushing data up and down a ladder of memory levels, data simply resides in the registers where computation occurs. The architecture is organized not around the movement of data, but around its availability.
That means:
- No caches to warm up or miss
- No warps to schedule
- No bank conflicts to avoid
- No tile sizes to match
- No tensor layouts to respect
- No sensitivity to shapes or strides, and therefore no swizzling at all
In such a system, the compiler always knows exactly where each value lives, and exactly where it needs to be next. It doesnât speculate, prefetch, tile, or rely on heuristics. It doesnât cross its fingers hoping the hardware behaves. Instead, data placement becomes a solvable, predictable problem.
The result is a machine where throughput remains stable, latency becomes predictable, and energy consumption collapses because unnecessary data motion has been engineered out of the loop. Itâs a system where performance is no longer dominated by memory gymnasticsâand where computing, the actual math, finally takes center stage.
The future of AI: Why a register-centric architecture matters
As AI systems evolve, the tidy world of uniform tensors and perfectly rectangular compute tiles are steadily falling away. Modern models are no longer predictable stacks of dense layers marching in lockstep. Instead, they expand in every direction: They ingest multimodal inputs, incorporate sparse and irregular structures, reason adaptively, and operate across ever-longer sequences. They must also respond in real time for safety-critical applications, and they must do so within tight energy budgetsâfrom cars to edge devices.
In other words, the assumptions that shaped GPU and TPU architecturesâthe expectation of regularity, dense grids, and neat tilingâare eroding. The future workloads are simply not shaped the way the hardware wants them to be.
A register-centric architecture offers a fundamentally different path. Because it operates directly on data where it lives, rather than forcing that data into tile-friendly formats, it sidesteps the entire machinery of memory swizzling. It does not depend on fixed tensor shapes.
It doesnât stumble when access patterns become irregular or dynamic. It avoids the costly dance of rearranging data just to satisfy the compute units. And as models grow more heterogeneous and more sophisticated, such an architecture scale with their complexity instead of fighting against it.
This is more than an incremental improvement. It represents a shift in how we think about AI compute. By eliminating unnecessary data movementâthe single largest bottleneck and energy sink in modern acceleratorsâa register-centric approach aligns hardware with the messy, evolving reality of AI itself.
Memory swizzling is the quiet tax that every hierarchical-memory accelerator pays. It is fundamental to how GPUs, TPUs, NPUs, and nearly all AI chips operate. Itâs also a growing liability. Swizzling introduces latency, burns energy, bloats memory usage, and complicates softwareâall while contributing nothing to the actual math.
One register-centric architecture eliminates swizzling at the root by removing the hierarchy that makes it necessary. It replaces guesswork and heuristics with deterministic dataflow. It prioritizes locality without requiring rearrangement. It lets the algorithm drive the hardware, not vice versa.
As AI workloads become more irregular, dynamic, and power-sensitive, architectures that keep data stationary and predictableârather than endlessly reshuffling itâwill define the next generation of compute.
Swizzling was a necessary patch for the last era of hardware. It should not define the next one.
Lauro Rizzatti is a business advisor to VSORA, a technology company offering silicon semiconductor solutions that redefine performance. He is a noted chip design verification consultant and industry expert on hardware emulation.
Â
Related Content
- Overcoming the AI memory bottleneck
- AI to Drive Surge in Memory Prices Through 2026
- HBM memory chips: The unsung hero of the AI revolution
- Generative AI and memory wall: A wakeup call for IC industry
- Breaking Through Memory Bottlenecks: The Next Frontier for AI Performance
The post Why memory swizzling is hidden tax on AI compute appeared first on EDN.