Google’s Tensor Processing Units power the majority of cutting-edge AI models you interact with daily, yet most engineers remain surprisingly unfamiliar with their architecture. While NVIDIA GPUs dominate developer mindshare, TPUs quietly train and serve Gemini 2.0, Claude, and dozens of other frontier models at scales that would bankrupt most organizations using conventional GPU infrastructure. Anthropic recently committed to deploying over one million TPU chips—representing more than a gigawatt of compute capacity—to train future Claude models.¹ Google’s latest Ironwood generation delivers 42.5 exaflops of FP8 compute across 9,216-chip superpods, a scale that redefines what production AI infrastructure means.²
The technical sophistication behind TPUs extends far beyond simple perf…
Google’s Tensor Processing Units power the majority of cutting-edge AI models you interact with daily, yet most engineers remain surprisingly unfamiliar with their architecture. While NVIDIA GPUs dominate developer mindshare, TPUs quietly train and serve Gemini 2.0, Claude, and dozens of other frontier models at scales that would bankrupt most organizations using conventional GPU infrastructure. Anthropic recently committed to deploying over one million TPU chips—representing more than a gigawatt of compute capacity—to train future Claude models.¹ Google’s latest Ironwood generation delivers 42.5 exaflops of FP8 compute across 9,216-chip superpods, a scale that redefines what production AI infrastructure means.²
The technical sophistication behind TPUs extends far beyond simple performance metrics. These processors embody a fundamentally different design philosophy than GPUs, trading general-purpose flexibility for extreme specialization in matrix multiplication and tensor operations. Engineers who understand TPU architecture can exploit 256×256 systolic arrays that process 65,536 multiply-accumulate operations per cycle, leverage third-generation SparseCore accelerators for embedding-intensive workloads, and program optical circuit switches that reconfigure multi-petabit datacenter topologies in under 10 nanoseconds.³ The architecture spans everything from transistor-level design decisions to building-scale supercomputer orchestration.
The technical content ahead demands careful attention. We examine seven generations of TPU evolution, dissect systolic array mathematics and dataflow patterns, explore memory hierarchies from SRAM tiles to HBM3e channels, analyze XLA compiler optimizations at the intermediate representation level, and investigate why collective operations execute 10× faster than equivalent Ethernet-based GPU clusters.⁴ You’ll encounter register-level specifications, cycle-accurate performance modeling, and the architectural tradeoffs that make TPUs simultaneously more powerful and more constrained than GPUs. The depth here serves engineers building the next generation of AI infrastructure and researchers pushing the boundaries of what current accelerators can achieve.
The Evolution: Seven Generations of Architectural Innovation
TPU v1: Inference-Only Specialization (2015)
Google deployed the first Tensor Processing Unit in 2015 to address a critical problem: neural network inference workloads threatened to double the company’s datacenter footprint.⁵ Engineers designed TPU v1 exclusively for inference, removing training capabilities entirely to maximize performance and power efficiency for deployed models. The chip featured a 256×256 systolic array of 8-bit integer multiply-accumulate units, delivering 92 teraops per second at just 28-40 watts thermal design power.⁶
The architecture embodied radical minimalism. A single Matrix Multiply Unit processed INT8 operations through weight-stationary dataflow, where weights remained fixed in the systolic array while activations streamed horizontally across the grid. Partial sums are propagated vertically, eliminating intermediate memory writes for the entire matrix multiplication. The chip, connected to host systems via PCIe, relied on DDR3 DRAM for external memory and operated at 700 MHz—deliberately conservative for power efficiency.⁷
Performance gains astonished even Google’s engineers. TPU v1 achieved 30× to 80× improvements in operations per watt compared to contemporary CPUs and GPUs for production inference workloads.⁸ The chip handled Google Search ranking, translation services processing 1 billion daily requests, and YouTube recommendations for 2 billion users. The success validated a core architectural insight: purpose-built accelerators optimized for narrow workloads could deliver order-of-magnitude improvements over general-purpose processors.
TPU v2: Enabling Training at Scale (2017)
The second generation transformed TPUs from inference-only accelerators into complete training platforms. Google redesigned the entire architecture around floating-point operations, replacing the 256×256 INT8 array with dual 128×128 bfloat16 multiply-accumulators per core.⁹ Each chip contained two TensorCores sharing 8GB of High Bandwidth Memory per core, a massive upgrade from DDR3 that provided the bandwidth neural network training demanded.
Bfloat16 precision proved critical for TPU v2’s success. The format maintains the same 8-bit exponent range as FP32 while reducing the mantissa to 7 bits, preserving dynamic range for training while halving memory bandwidth requirements.¹⁰ Engineers observed that the reduced mantissa precision actually improved generalization in many models by acting as a form of regularization, while the full FP32 exponent range prevented the underflow and overflow issues that plagued FP16 training.
The architectural innovation that truly differentiated TPU v2 was the Inter-Chip Interconnect (ICI). Previous accelerators required Ethernet or InfiniBand for multi-chip communication, introducing latency and bandwidth bottlenecks. Google designed custom high-speed bidirectional links that connected each TPU directly to four neighbors in a 2D torus topology.¹¹ The interconnect enabled TPU v2 "pods" of up to 256 chips to function as a single logical accelerator, with collective operations like all-reduce executing far faster than network-based alternatives.
TPU v3: Water-Cooled Performance Scaling (2018)
Google pushed clock speeds and core counts aggressively in TPU v3, delivering 420 teraflops per chip—more than doubling v2’s performance.¹² The increased power density forced a dramatic architectural change: liquid cooling. Each TPU v3 pod required water cooling infrastructure, a departure from the air-cooled designs of previous generations and most datacenter accelerators.¹³
The chip maintained the dual 128×128 MXU architecture but increased the total number of cores and improved memory bandwidth. Each TPU v3 contained four chips with two cores each, sharing 32GB of HBM memory total across the chips.¹⁴ The vector processing units received enhancements for activation functions, normalization operations, and gradient computations that frequently bottlenecked training on the matrix units alone.
Deployments scaled to 2,048-chip pods using the same 2D torus ICI topology as v2 but with increased per-link bandwidth. Google trained increasingly large models on v3 pods, discovering that the torus topology’s reduced network diameter (maximum distance between any two chips scales as N/2 rather than N) minimized communication overhead for both data-parallel and model-parallel training strategies.¹⁵
TPU v4: Optical Circuit Switching Breakthrough (2021)
The fourth generation represented Google’s most significant architectural leap since the original TPU. Engineers increased pod scale to 4,096 chips while introducing optical circuit switching (OCS) for interconnect, a technology borrowed from telecommunications that revolutionized datacenter-scale ML infrastructure.¹⁶
TPU v4’s core architecture featured four 128×128 MXUs per TensorCore alongside enhanced vector and scalar units. Each TensorCore pair shared 128MB of Common Memory in addition to per-core Vector Memory, enabling more sophisticated data staging and reuse patterns.¹⁷ The chip topology evolved from 2D to 3D torus, connecting each TPU to six neighbors rather than four, further reducing network diameter and improving bisection bandwidth.
The optical circuit switching system changed everything about large-scale deployment. Rather than fixed cabling between TPUs, Google deployed programmable optical switches that could dynamically reconfigure which chips connected to which. MEMS (microelectromechanical systems) mirrors physically redirect light beams to patch arbitrary TPU pairs together, introducing essentially zero latency beyond the optical fiber transmission time.¹⁸ The switches reconfigure in sub-10-nanosecond windows, faster than most network protocol handshakes.
The OCS architecture enabled capabilities previously impossible. Google could provision "slices" of any size, from four chips to the full 4,096-chip pod, by programming the optical switches appropriately. Failed chips could be seamlessly routed around without taking down entire racks. Most remarkably, physically distant TPUs in different datacenter locations could be logically adjacent in the network topology, decoupling physical and logical layout entirely.¹⁹
TPU v4 also introduced SparseCore, a specialized processor for handling embedding operations used every day in recommendation systems, ranking models, and large language models with massive vocabulary embeddings. The SparseCore featured four dedicated processors per chip, each with 2.5MB of scratchpad memory and optimized dataflow for sparse memory access patterns.²⁰ Models with ultra-large embeddings achieved 5-7× speedups using just 5% of total chip die area and power budget.
TPU v5p and v5e: Specialization and Scale (2022-2023)
Google split the fifth generation into two distinct products targeting different use cases. TPU v5p prioritized maximum performance for large-scale training, while v5e optimized for cost-effective inference and smaller training jobs.²¹
TPU v5p achieved approximately 4.45 exaflops per second across 8,960-chip pods, more than doubling v4’s maximum pod size.²² The interconnect bandwidth reached 4,800 Gbps per chip, and the 3D torus topology connected chips in massive 16×20×28 superpods. The optical circuit switching fabric managed 13,824 optical ports across 48 OCS units to wire a complete v5p superpod, representing one of the largest production optical switching deployments in computing history.²³
TPU v5e took a different approach, reducing core count and clock speed to hit aggressive power and cost targets. Inference-optimized chips contained only one TPU core per chip rather than two, and returned to the 2D torus topology, which was sufficient for smaller pod sizes.²⁴ The architectural simplification enabled Google to price v5e competitively for workloads where absolute performance mattered less than performance per dollar.
TPU v6e Trillium: Quadrupling Matrix Performance (2024)
Trillium marked another architectural inflection point by expanding the Matrix Multiply Unit from 128×128 to 256×256 multiply-accumulators.²⁵ The larger array quadrupled FLOPs per cycle at the same clock speed, delivering 4.7× the peak compute performance of TPU v5e through a combination of the expanded MXU and increased clock frequencies.
The memory subsystem received equally dramatic upgrades. HBM capacity doubled to 32GB per chip, with bandwidth doubled by next-generation HBM channels.²⁶ The Interchip Interconnect bandwidth similarly doubled, enabling pods of 256 Trillium chips to sustain higher throughput for models that stressed both compute and communication.²⁷
Trillium featured the third-generation SparseCore accelerator, with enhanced capabilities for ultra-large embeddings in ranking and recommendation workloads. The updated design improved memory access patterns and increased the adequate bandwidth between SparseCores and HBM for models dominated by embedding lookups rather than matrix multiplications.²⁸
Energy efficiency improved by 67% over v5e despite substantial performance gains.²⁹ Google achieved the efficiency gains through advanced process nodes, architectural optimizations that reduced wasted work, and careful power gating of unused units during operations that didn’t stress all parts of the chip simultaneously.
TPU v7 Ironwood: The FP8 Era (2025)
Google’s seventh-generation TPU, codenamed Ironwood, represents the first TPU designed with native FP8 support and optimized specifically for the "age of inference" while maintaining state-of-the-art training performance.³⁰ Each Ironwood chip delivers 4.6 petaFLOPS of dense FP8 compute—slightly exceeding NVIDIA’s competing B200 at 4.5 petaFLOPS—while pulling 600W thermal design power.³¹
The memory system expanded to 192GB of HBM3e memory per chip, six times Trillium’s capacity, with bandwidth reaching 7.4TB/s.³² The dramatic memory increase enables serving ultra-large models with key-value caches that previously required complex tensor parallelism across multiple accelerators. Google specifically designed the memory capacity to support emerging multi-modal models and long-context applications approaching million-token windows.
Ironwood’s interconnect provides 9.6 Tbps of aggregate bidirectional bandwidth through four ICI links, translating to 1.2 TB/s of peak per-chip bandwidth.³³ The architecture scales from 256-chip pods for smaller deployments to massive 9,216-chip superpods delivering 42.5 FP8 exaflops of compute power.³⁴ Google’s Jupiter datacenter network technology could theoretically support up to 43 Ironwood superpods in a single cluster—roughly 400,000 accelerators representing an almost incomprehensible scale of compute.³⁵
The FP8 support represents a fundamental shift in precision strategy. Prior TPU generations emulated 8-bit operations using software techniques, which introduced overhead. Ironwood implements native FP8 multiply-accumulate units supporting both E4M3 (4-bit exponent, 3-bit mantissa) and E5M2 (5-bit exponent, 2-bit mantissa) formats.³⁶ The dual format support enables mixing E4M3 for forward passes where precision matters less and E5M2 for backward passes where maintaining gradient magnitudes prevents training instability.
Anthropic’s commitment to deploy over one million Ironwood chips beginning in 2026 demonstrates the architecture’s production readiness. The company plans to leverage well over a gigawatt of TPU capacity—enough to power a small city—exclusively for training and serving Claude models.³⁷ The scale dwarfs even the most significant known GPU deployments and represents a fundamental bet on TPU architecture for frontier model development.
Current-Generation Quick Reference
The following tables provide scannable specifications for the three current-generation TPUs most relevant to production deployments in 2025:
Table 1: Core Compute Specifications
Table 2: Memory and Bandwidth
Table 3: Interconnect and Scaling
Table 4: Power and Efficiency
Table 5: Recommended Use Cases
Hardware Architecture: Inside the Silicon
Systolic Array Mathematics and Dataflow
The Matrix Multiply Unit forms the heart of TPU architecture, and understanding systolic arrays requires grasping their fundamentally different approach to parallelism compared to GPU SIMD lanes. A systolic array chains multiply-accumulate units in a grid where data flows rhythmically through the structure—hence "systolic," evoking the rhythmic pumping of blood through the heart.³⁸
Consider TPU v6e’s 256×256 systolic array performing the matrix multiplication C = A × B. Engineers preload the weights of matrix B into the 65,536 individual multiply-accumulate units arranged in a grid. Matrix A’s activation values enter from the left edge and flow horizontally across the array. Each MAC unit multiplies its stored weight by the incoming activation, adds the result to a partial sum arriving from above, and passes both the activation (horizontally) and updated partial sum (vertically) to neighboring units.³⁹
The dataflow pattern means each activation value gets reused 256 times as it traverses the horizontal dimension, and each partial sum accumulates contributions from 256 multiplications as it flows vertically. Critically, all intermediate results pass directly between adjacent MAC units via short wires rather than round-tripping to memory. The architecture performs 65,536 multiply-accumulate operations every clock cycle, and during the entire matrix multiplication involving potentially millions of operations, zero intermediate values touch DRAM or even on-chip SRAM.⁴⁰
The weight-stationary dataflow pattern optimizes for the most common case in neural network inference and training: repeatedly multiplying many different activation matrices by the same weight matrix. Engineers load weights once, then stream unbounded activation batches through the array without reloading. The pattern works exceptionally well for convolutional layers, fully connected layers, and the Q·K^T and attention·V operations that dominate transformer models.⁴¹
Energy efficiency stems from data reuse and spatial locality. Reading a value from DRAM consumes roughly 200× as much energy as a single multiply-accumulate operation.⁴² By reusing each weight 256 times and each activation 256 times without memory accesses, the systolic array achieves operations-per-watt ratios impossible for architectures that shuttle data back and forth between compute units and memory hierarchies.
The systolic array’s weakness emerges with dynamic or irregular computation patterns. Because data flows through the grid on a fixed schedule, the architecture struggles with conditional execution, sparse matrices (unless using SparseCore), and operations that require random access patterns. The inflexibility trades generality for extreme efficiency on its target workload: dense matrix multiplication with predictable access patterns.
TensorCore Internal Architecture
Each TPU chip contains one or more TensorCores—the complete processing unit comprising the Matrix Multiply Unit, Vector Processing Unit, and Scalar Unit working in concert.⁴³ The TensorCore represents the fundamental building block that software targets, and understanding the interaction between its three components explains both TPU performance characteristics and programming patterns.
The Matrix Multiply Unit executes 16,000 multiply-accumulate operations per cycle on bfloat16 or FP8 inputs with FP32 accumulation.⁴⁴ The mixed-precision approach preserves numerical accuracy in the accumulator while reducing memory bandwidth for inputs. Engineers observed that maintaining complete FP32 precision during accumulation prevents catastrophic cancellation errors when summing hundreds or thousands of intermediate products, while reduced-precision inputs rarely affect final model quality.
The Vector Processing Unit handles operations poorly suited to the MXU’s rigid structure. Activation functions (ReLU, GELU, SiLU), normalization layers (batch norm, layer norm), softmax, pooling, dropout, and element-wise operations execute on the VPU’s 128-lane SIMD architecture.⁴⁵ The VPU operates on FP32 and INT32 datatypes, providing the precision required for numerically sensitive operations like softmax, where exponentials and divisions can create large dynamic ranges.
The Scalar Unit orchestrates the entire TensorCore. The single-threaded processor executes control flow, calculates memory addresses for complex indexing patterns, and initiates DMA transfers from High Bandwidth Memory to Vector Memory.⁴⁶ Because the scalar unit runs single-threaded, each TensorCore can create only one DMA request per cycle—a bottleneck for memory-intensive operations that don’t saturate the MXU or VPU compute throughput.
The memory hierarchy feeding the TensorCore determines achievable performance as much as raw compute capability. Vector Memory (VMEM) acts as a software-managed scratchpad SRAM exclusive to each TensorCore, typically sized at tens of megabytes. The XLA compiler explicitly schedules data movement between HBM and VMEM, deciding what to stage into the fast local memory and when to write results back.⁴⁷
Common Memory (CMEM), present in TPU v4 and later generations, provides a larger shared pool accessible to all TensorCores on a chip. The TPU v4 architecture allocated 128MB of CMEM shared between two TensorCores, enabling more sophisticated producer-consumer patterns in which one core’s outputs feed another core’s inputs without round-tripping to HBM.⁴⁸.
The programming model implications matter enormously. Because the scalar unit single-threads and the vector memory require explicit management, TPU programming resembles 1990s-era embedded systems development more than modern GPU programming. CUDA abstracts memory movement with unified memory and hardware-managed caches; TPU code (whether generated by XLA or written by hand in Pallas) must explicitly orchestrate every data transfer. Manual control enables expert optimization but raises the bar for competent performance.
High Bandwidth Memory Architecture
Modern TPUs use HBM (High Bandwidth Memory), or HBM3e, a radically different memory technology from the DDR SDRAM found in CPUs, and the GDDR used in many GPUs. HBM stacks multiple DRAM dies vertically using through-silicon vias (TSVs), then places the stack directly adjacent to the processor die on a silicon interposer.⁴⁹ The short electrical path and wide interface enable dramatically higher bandwidth than conventional memory technologies.
TPU v7 Ironwood implements 192GB of HBM3e with a total bandwidth of 7.4 TB/s.⁵⁰ The memory system is divided into multiple channels, each providing independent access to a separate portion of the total capacity. The XLA compiler and runtime must carefully partition tensors across HBM channels to maximize parallel access and avoid hotspots where one channel saturates while others sit idle.
The memory interface width dwarfs conventional DRAM. Where a DDR5 channel might provide 64 bits of width, an HBM channel typically spans 1,024 bits.⁵¹ The extreme width enables high bandwidth at relatively modest clock speeds, reducing power consumption and signal integrity challenges compared to pushing narrow interfaces to multi-gigahertz frequencies.
Latency characteristics differ substantially from GPU memory systems. TPUs lack hardware-managed caches beyond small local buffers, so the architecture relies on software explicitly staging data into VMEM well before compute units need it. The lack of caches means memory latency directly impacts performance unless the compiler successfully hides latency through prefetching and double-buffering.⁵²
Memory capacity limits dominate many workloads more than compute throughput. A 175-billion parameter model with bfloat16 weights requires 350GB to store parameters—already exceeding Ironwood’s 192GB HBM even before accounting for activations, optimizer states, or gradient buffers. Training such models demands sophisticated techniques like gradient checkpointing, optimizer state sharding across multiple chips, and careful scheduling of parameter updates to minimize memory footprint.⁵³
The TPU runtime enforces specific tensor layout requirements to maximize MXU efficiency. Because the systolic array processes data in 128×8 tiles, tensors should align to these dimensions to avoid padding waste.⁵⁴ Poorly sized matrices force the hardware to process partial tiles with MACs sitting idle, directly reducing FLOPS utilization. The compiler attempts to pad and reshape tensors automatically, but conscious layout choices in the model architecture can substantially improve performance.
SparseCore: Specialized Embedding Acceleration
While the Matrix Multiply Unit excels at dense matrix operations, embedding-intensive workloads exhibit radically different characteristics. Recommendation models, ranking systems, and large language models frequently access massive embedding tables (often hundreds of gigabytes) through irregular, data-dependent indices. The MXU’s structured dataflow provides no advantage for these sparse memory access patterns, motivating SparseCore’s specialized architecture.⁵⁵
SparseCore implements a tiled dataflow processor fundamentally different from the MXU’s systolic array. TPU v4 featured four SparseCores per chip, each containing 16 compute tiles.⁵⁶ Each tile operates as an independent dataflow unit with local scratchpad memory (SPMEM) and processing elements. The tiles execute in parallel, processing disjoint subsets of embedding operations simultaneously.
The memory hierarchy places hot data in small, fast SPMEM while keeping the full embedding tables in HBM. The XLA compiler analyzes embedding access patterns to determine which embedding vectors merit caching in SPMEM versus fetching on demand from HBM.⁵⁷ The strategy resembles traditional CPU cache hierarchies, but with software rather than hardware making placement decisions.
SparseCores connect directly to HBM channels, bypassing the TensorCore’s memory path entirely. The dedicated connection prevents embedding operations from competing with dense matrix operations for memory bandwidth, enabling both to proceed in parallel.⁵⁸ The partitioning works exceptionally well for models like Deep Learning Recommendation Models (DLRMs) that interleave dense neural network layers with large embedding lookups.
The mod-sharding strategy distributes embeddings across SparseCores by computing target_sc_id = col_id % num_total_sparse_cores.⁵⁹ The simple sharding function ensures load balancing when embedding IDs are distributed uniformly, but can create hotspots for skewed access patterns. Engineers working with real-world data often need to analyze embedding frequency distributions and manually rebalance sharding to avoid bottlenecks.
Performance gains from SparseCore reach 5-7× compared to implementing identical operations on the MXU and VPU, while consuming only 5% of chip die area and power.⁶⁰ The dramatic efficiency advantage stems from purpose-building the dataflow for sparse operations rather than forcing them through dense matrix infrastructure. The specialization principle applies recursively within TPU architecture: just as TPUs specialize beyond GPUs’ general-purpose design, SparseCores specialize beyond TPUs’ matrix-oriented design.
Trillium’s third-generation SparseCore introduced variable SIMD width (8 elements for FP32, 16 for bfloat16) and improved memory access patterns, reducing wasted bandwidth from misaligned reads.⁶¹ The architectural evolution demonstrates Google’s continued investment in embedding acceleration as large language models trend toward larger vocabularies and more sophisticated retrieval-augmented generation patterns.
Interconnect Technology: Wiring the Supercomputer
Inter-Chip Interconnect (ICI) Architecture
The Inter-Chip Interconnect is the critical technology that enables TPUs to function as unified supercomputers rather than isolated accelerators. Unlike GPUs that communicate through Ethernet or InfiniBand networks, ICI implements custom high-speed serial links directly connecting neighboring TPUs with microsecond-scale latency and terabit-per-second bandwidth.⁶²
Topology evolution across TPU generations reflects changing requirements for pod scaling. TPU v2, v3, v5e, and v6e implement 2D torus topologies in which each chip connects to its four nearest neighbors (north, south, east, and west).⁶³ The links wrap around at boundaries, creating a donut-shaped logical topology that eliminates edge chips with fewer connections. A 16×16 grid of 256 TPUs thus provides uniform bandwidth and latency characteristics regardless of which two chips communicate.
TPU v4 and v5p upgraded to 3D torus topologies with each chip connecting to six neighbors.⁶⁴ The additional dimension reduces network diameter—the maximum hop count between any two chips—from roughly 2√N to 3∛N. For a 4,096-chip pod, the maximum hops drop from approximately 128 to 48, substantially reducing worst-case communication latency for globally synchronizing operations such as all-reduce.
The toroidal structure delivers another critical advantage: equal bisection bandwidth regardless of how workloads partition across chips. Any cut that divides the torus in half crosses the same number of links, preventing pathological cases where poor job placement creates network bottlenecks.⁶⁵ The uniform bisection bandwidth simplifies scheduling and enables the optical circuit switch reconfigurability discussed below.
Bandwidth specifications scale impressively across generations. TPU v6e provides 13 TB/s of ICI bandwidth per chip.⁶⁶ TPU v5p reached 4,800 Gbps per chip across six 3D torus links.⁶⁷ Ironwood implements four ICI links with a 9.6 Tbps aggregate bidirectional bandwidth, translating to 1.2 TB/s per chip.⁶⁸ For comparison, a top-tier 400GbE network interface provides 50GB/s bidirectional bandwidth—an order of magnitude less than modern TPU ICI.
Link technology within racks uses direct-attached copper (DAC) cables for short distances between chips in the same 4×4×4 cube.⁶⁹ The copper connections minimize cost and power while providing the required bandwidth for tightly coupled chips executing synchronized operations. Inter-cube and pod-scale links transition to optical transceivers, trading higher cost and power for the distance and bandwidth needed to span datacenter racks.
Collective operations exploit ICI’s unique properties. All-reduce, all-gather, and reduce-scatter operations frequently synchronize activations and gradients across chips during training. On Ethernet-based GPU clusters, these collectives traverse a hierarchical network with switches, cables, and network interface cards, introducing latency at each hop. TPU ICI implements optimized collective algorithms directly in hardware, executing all-reduce operations 10× faster than equivalent Ethernet-based GPU implementations.⁷⁰
Optical Circuit Switching: Dynamic Topology Reconfiguration
Google’s deployment of optical circuit switching (OCS) with TPU v4 represented one of the most significant innovations in datacenter networking in decades. Traditional packet-switched networks—whether Ethernet or InfiniBand—establish logical connections by routing packets hop-by-hop through switches that examine headers and forward to appropriate output ports. OCS instead uses programmable optical elements to create direct physical light paths between endpoints, eliminating switching latency entirely.⁷¹
The core technology relies on MEMS (microelectromechanical systems) mirrors that physically rotate to redirect light beams. A transmitter on TPU A sends light into the OCS. Tiny mirrors inside the OCS rotate to reflect that light beam to a receiver on TPU B. The connection becomes a direct optical path from A to B with essentially zero added latency beyond light propagation through the fiber.⁷²
Reconfiguration speed determines the practicality of OCS in production systems. Google’s deployment achieves sub-10-nanosecond switching times—faster than typical network protocol round-trip times.⁷³ The reconfiguration speed enables dynamic topology changes matching workload requirements without disrupting running jobs or requiring carefully coordinated traffic engineering.
TPU v5p demonstrated OCS at a massive scale. The architecture uses optical circuit switches that deliver four petabits per second of aggregate bandwidth across the switching fabric.⁷⁴ A single v5p superpod requires 48 OCS units managing 13,824 optical ports to wire 8,960 chips in the 16×20×28 3D torus configuration.⁷⁵ The switching system represents one of the largest optical networking deployments in any computing environment.
OCS provides capabilities impossible with traditional networks. Physical topology and logical topology fully decouple—two TPUs in opposite corners of the datacenter appear as adjacent neighbors if the OCS creates direct optical paths. Failed chips or links get routed around by reprogramming mirrors to exclude faulty components and maintain the logical torus structure. New jobs receive "slices" of any size by programming the OCS to create appropriate pod configurations without physically re-cabling racks.⁷⁶
The architecture integrates with Google’s Jupiter data center network to scale beyond a single pod. Jupiter delivers multi-petabit-per-second bisection bandwidth across entire datacenters using Google’s custom silicon switches and control plane.⁷⁷ Multiple TPU superpods connect via Jupiter fabric, theoretically supporting clusters of up to 400,000 accelerators if network capacity permits.⁷⁸
Power consumption and reliability characteristics favor optical circuit switching for TPU-scale deployments. Traditional packet switches consume substantial power processing and forwarding packets at terabit-per-second rates. OCS switches consume power only to operate MEMS mirrors during reconfiguration events, then sit idle, passing light with minimal loss while connections remain stable.⁷⁹ The architecture’s simplicity improves reliability by eliminating complex packet processing and buffering logic prone to bugs and performance anomalies.
Pod Architecture and Scaling Characteristics
TPU pods represent the largest single unit of TPUs connected through ICI, forming a unified accelerator. The physical structure builds hierarchically from individual chips to trays to cubes to racks to complete pods.⁸⁰ Understanding the hierarchy matters for reasoning about memory capacity, communication bandwidth, and fault tolerance at different scales.
The fundamental building block consists of four chips on a single tray connected to a host CPU via PCIe.⁸¹ The PCIe connection handles control plane operations, initial program loading, and infeed/outfeed for training data and inference results. The actual inter-chip communication for distributed training flows through ICI rather than PCIe, avoiding PCIe bandwidth bottlenecks.
Sixteen trays (64 chips) form a single 4×4×4 cube—the basic unit for pod construction. Within a cube, all ICI connections use direct-attached copper cables since chips reside in the same rack with short physical distances.⁸² The cube implements a complete 3D torus with wrap-around connections, creating a self-contained 64-chip unit that could theoretically operate independently.
TPU v4 pods scale to 64 cubes totaling 4,096 chips.⁸³ The inter-cube connections transition to optical links managed by the optical circuit switching fabric. The OCS can provision these 4,096 chips as a single enormous pod, multiple smaller independent pods, or dynamically reconfigure mid-job if required. The flexibility enables datacenter operators to balance utilization across different job sizes and priorities.
TPU v5p pushed pod scale to 8,960 chips in a 16×20×28 3D torus.⁸⁴ The specific dimensions reflect careful bandwidth and diameter optimization—prime factorizations matter for network topology! The pod delivers 4.45 exaflops of compute and represents one of the largest single-pod configurations deployed in production.
Ironwood supports both 256-chip pods for smaller deployments and 9,216-chip superpods for massive frontier model training.⁸⁵ The 9,216-chip configuration delivers 42.5 FP8 exaflops—more compute than the entire Top500 list of supercomputers contained just five years earlier.⁸⁶ The scale redefines what organizations can accomplish with synchronous training rather than pipelined or asynchronous approaches.
Scaling efficiency determines whether larger pods actually help. Communication overhead increases with pod size as chips spend more time synchronizing rather than computing. Google Research published results demonstrating 95% scaling efficiency at 32,768 TPUs for specific workloads, meaning 32,768 TPUs delivered 95% of the performance that perfect linear scaling would predict.⁸⁷ The efficiency stems from hardware-accelerated collectives, optimized compiler transformations, and clever algorithmic approaches to reduce gradient synchronization frequency.
Fault tolerance at the pod scale requires sophisticated handling. Statistical probability guarantees component failures in any system with thousands of chips running continuously. The optical circuit switch enables graceful degradation by reconfiguring around failed components. Training checkpointing occurs at regular intervals (typically every few minutes), so job failure requires restarting only from the last checkpoint rather than from scratch.⁸⁸
Software Stack: Compilers, Frameworks, and Programming Models
XLA Compiler: Optimizing Computation Graphs
XLA (Accelerated Linear Algebra) forms the foundation of TPU’s software stack, compiling high-level framework operations into optimized machine code for execution on the TPU.⁸⁹ The compiler implements aggressive optimizations impossible in general-purpose compilers because it exploits domain knowledge about machine learning workloads and TPU architecture characteristics.
Fusion represents XLA’s most impactful optimization. The compiler analyzes computation graphs to identify sequences of operations that can execute without materializing intermediate tensors. A simple example: element-wise operations like relu(batch_norm(conv(x))) normally require writing the convolution output to memory, reading it for batch normalization, writing that result to memory, and reading again for ReLU. XLA fuses these operations into a single kernel that produces the final ReLU output without intermediate memory traffic.⁹⁰
Fusion’s impact scales with TPU’s architecture. Memory bandwidth constrains many workloads more than compute throughput—the MXU can perform matrix multiplications faster than the memory system can feed it data. Eliminating intermediate memory writes and reads through fusion directly translates to performance improvements, often delivering 2× or more speedup for activation-function-heavy networks.⁹¹
Memory layout transformations optimize tensor storage for hardware requirements. Neural networks often represent tensors in the NHWC format (batch, height, width, channels) for intuitive indexing, but TPU MXUs perform best with layouts that align with 128×8 tiles.⁹² XLA automatically transposes, reshapes, and pads tensors to match hardware preferences, inserting layout transformations only where necessary and sometimes propagating preferred layouts backward through the graph to minimize total transformation overhead.
The compiler implements sophisticated constant folding and dead code elimination. ML graphs frequently contain subgraphs whose outputs depend only on constants—batch normalization parameters, inference dropout rates, and shape calculations that can be executed once rather than per batch. XLA evaluates these subgraphs at compile time and replaces them with constant tensors, reducing runtime work.⁹³
Cross-replica optimization exploits knowledge about distributed execution. When training across multiple TPU cores, certain operations (like batch normalization statistics) require aggregation across all replicas. XLA identifies these patterns and generates optimized collective operations that exploit ICI’s hardware-accelerated all-reduce rather than implementing aggregation through explicit message passing.⁹⁴
The compiler targets an intermediate representation, Mosaic, specifically for TPUs. Mosaic operates at a higher abstraction level than assembly language but lower than the input computation graph. The language exposes TPU architectural features, such as systolic arrays, vector memory, and VMEM staging, while hiding low-level details, such as instruction scheduling and register allocation.⁹⁵
Auto-tuning capabilities select optimal tile sizes and operation parameters through empirical search. The XLA Auto-Tuning (XTAT) system tries different fusion strategies, memory layouts, and tile dimensions, profiles each variant’s performance, and selects the fastest configuration.⁹⁶ The search can require substantial compile time for complex models, but produces dramatic runtime speedups by discovering counter-intuitive optimizations humans rarely identify manually.
JAX: Composable Transformations and SPMD
JAX provides a NumPy-compatible interface for numerical computation with automatic differentiation, JIT compilation to XLA, and first-class support for program transformation.⁹⁷ The framework’s functional programming paradigm and composable transformation model align naturally with TPU execution models and distributed parallelism patterns.
The core JAX abstraction applies mathematical transformations to functions. Grad (f) computes f’s gradient. Jit (f) JIT-compiles f to XLA. vmap(f) vectorizes f over a new dimension. Critically, transformations compose: jit(grad(vmap(f))) works exactly as expected, compiling a vectorized gradient function.⁹⁸ The compositional model enables building complex distributed training loops from simple, testable components.
SPMD (Single Program, Multiple Data) represents JAX’s distributed execution model. Programmers write code as if targeting a single device, then add sharding annotations indicating how to partition tensors across multiple TPU cores. The XLA compiler and GSPMD (General SPMD) subsystem automatically insert communication operations to maintain program semantics while executing across distributed devices.⁹⁹
Sharding annotations use PartitionSpec to declare distribution strategies. PartitionSpec(‘batch’, None) shards a tensor’s first dimension across the ‘batch’ axis of the device mesh while replicating the second dimension. PartitionSpec(None, ‘model’)implements tensor parallelism by partitioning the second dimension. The annotations can be composed with arbitrary tensor ranks and device mesh dimensions.¹⁰⁰
GSPMD’s automatic parallelization eliminates vast amounts of boilerplate code. Traditional distributed training requires manually inserting an all-gather before operations that need full tensors, a reduce-scatter after computing distributed gradients, and an all-reduce for global reductions. GSPMD analyzes sharding specifications and automatically inserts appropriate collectives, freeing programmers to focus on the algorithm rather than communication engineering.¹⁰¹
The compiler propagates sharding decisions through the computation graph using constraint solving. If operation A outputs a sharded tensor consumed by operation B, GSPMD infers B’s optimal sharding based on how the output gets used, potentially inserting resharding operations only where mathematically necessary.¹⁰² The automated inference prevents the "sharding spaghetti" that plagues hand-written distributed code.
JAX provides fine-grained control when automation falls short. with_sharding_constraint forces specific sharding at graph locations, overriding automatic inference. Custom PJIT (parallel JIT) annotations specify exact device placement and sharding strategies for performance-critical code paths. The layered model enables rapid prototyping with automatic sharding while supporting expert optimization where required.¹⁰³
Shardy emerged as GSPMD’s successor in 2025, implementing improved constraint propagation algorithms and better handling of dynamic shapes.¹⁰⁴ The new system exposes additional optimization opportunities by reasoning about sharding choices jointly across larger graph regions rather than operation-by-operation.
PyTorch/XLA: Bringing PyTorch to TPUs
PyTorch/XLA enables running PyTorch models on TPUs with minimal code changes, bridging the gap between PyTorch’s imperative programming model and XLA’s graph-based compilation.¹⁰⁵ The integration balances preserving PyTorch’s developer experience with exposing TPU-specific optimizations.
The fundamental challenge stems from PyTorch’s eager execution philosophy. PyTorch executes operations immediately as Python statements execute, enabling debugging with standard tools and natural control flow. XLA requires capturing complete computation graphs before compilation, creating tension between eager execution and the performance benefits of graph compilation.¹⁰⁶
PyTorch/XLA 2.4 introduced eager mode support, addressing the impedance mismatch. The implementation dynamically traces PyTorch operations into XLA graphs, allowing developers to write standard PyTorch code while still benefiting from XLA compilation.¹⁰⁷ The mode trades some compilation optimization opportunities for development velocity and debugging simplicity.
Graph mode remains the primary path for production deployments. Developers explicitly mark functions for XLA compilation using decorators or compilation APIs. The explicit annotations enable aggressive optimization but require understanding which operations should be fused into a single XLA graph versus executed independently.¹⁰⁸
Pallas integration brings custom kernel development to PyTorch/XLA. Pallas provides a low-level language for writing TPU kernels when XLA’s automatic fusion falls short or specialized operations require hand-optimization.¹⁰⁹ The language exposes TPU memory hierarchy (VMEM, CMEM, HBM) and compute units (MXU, VPU) while remaining higher-level than raw assembly.
Built-in Pallas kernels implement performance-critical operations like FlashAttention and PagedAttention. FlashAttention’s tiled attention computation reduces memory bandwidth requirements from O(n²) to O(n) for sequence length n, enabling models to process much longer sequences within fixed memory budgets.¹¹⁰ PagedAttention optimizes key-value cache management for serving, achieving 5× speedup compared to padded implementations.¹¹¹
The PyTorch/XLA bridge proved critical for vLLM TPU—a high-performance serving framework designed initially for GPUs. The implementation actually uses JAX as an intermediate lowering path even for PyTorch models, exploiting JAX’s superior parallelism support while maintaining PyTorch frontend compatibility.¹¹² The architecture achieved 2-5× performance improvements throughout 2025 compared to initial prototypes.
Model compatibility challenges persist despite improvements. Some PyTorch operations lack XLA equivalents, forcing a fallback to CPU execution that degrades performance. Dynamic control flow is poorly supported by graph compilation, often necessitating architectural changes to replace dynamic behavior with static, compilable alternatives. The PyTorch/XLA repository documents compatibility and provides migration guides for common problematic patterns.¹¹³
Precision Formats: BFloat16, FP8, and Quantization
TPU’s support for reduced-precision arithmetic enables dramatic performance and memory improvements while maintaining acceptable model quality. Understanding the numerical properties of different formats and when to apply each proves critical for achieving optimal performance.¹¹⁴
BFloat16 represents Google’s early bet on reduced-precision training, first appearing in TPU v2. The format maintains FP32’s 8-bit exponent while truncating the mantissa to 7 bits (plus sign bit).¹¹⁵ The full exponent range prevents the underflow and overflow that plagued early FP16 training, where gradients frequently escaped FP16’s representable range.
The reduced mantissa introduces quantization error but rarely impacts final model quality. Engineers observed that models trained in bfloat16 typically match FP32-trained baselines within statistical noise, likely because the quantization acts as a form of regularization, preventing overfitting to tiny numerical details.¹¹⁶ The format halves memory bandwidth and capacity requirements compared to FP32, directly translating to performance gains on memory-bound workloads.
FP8 takes reduced precision further, compressing weights and activations to 8 bits. Two standard encodings exist: E4M3 (4-bit exponent, 3-bit mantissa) prioritizes precision for forward passes, while E5M2 (5-bit exponent, 2-bit mantissa) prioritizes range for backward passes where gradient magnitudes vary widely.¹¹⁷ Ironwood implements native FP8 support for both formats, whereas earlier TPUs emulated FP8 through software transformations.¹¹⁸
Quantization awareness during training enables FP8’s numerical success. Models trained from scratch with FP8 or fine-tuned with FP8-aware techniques learn weight distributions that tolerate the format’s limited precision. Post-training quantization (converting FP32 models to FP8 after training) often degrades quality without careful calibration.¹¹⁹
INT8 quantization delivers even greater memory savings and inference speedups. Google’s Accurate Quantized Training (AQT) enables INT8 training on TPUs with minimal quality loss compared to bfloat16 baselines.¹²⁰ The technique applies quantization-aware training from scratch, allowing models to adapt to INT8’s constraints during learning rather than through post-training approximation.
Mixed-precision strategies combine formats strategically. Forward passes might use FP8 for activations and weights, backward passes use FP8 E5M2 or bfloat16 for gradients, and optimizer states remain in FP32 for numerical stability during weight updates.¹²¹ The mixed approach balances speed, memory, and accuracy, often achieving 90%+ of FP32 quality while running 4× faster.
Precision tradeoffs extend beyond speed and memory to include numerical stability considerations. Batch normalization, layer normalization, and softmax require careful numerical handling in reduced precision. Large exponentials in softmax can overflow FP8 or bfloat16 ranges; subtracting the maximum logit before exponentiati