Microarchitectural Modeling in the Era of Accelerator-Rich Systems and Computing at Scale

Microarchitecture simulators have been conceived and implemented to be valuable tools for the design of computing chips of all types (SimpleScalar,gem5, SMTSIM, Sniper,Qflex,Scarab,GPGPU-sim,Accel-Sim,Multi2Sim, NaviSim,SCALE-sim,gem5-Salam,TAO, PyTorchSim – the list is neither historically complete nor updated). In essence, microarchitecture simulators have an “impossible” objective: to model and measure properties of a chip (and a complete system built on it) both quickly and accurately; a contradiction in terms. They try to approach this goal operating at an abstraction layer which allows fast exploration of a large space of options but, still, remaining relatively close to the actual hardware they model; at least at a cycle-level. The objective is the same across different computing units, CPUs, GPUs, DSAs (Domain-Specific Accelerators), including, of course, AIAs (AI accelerators).

Trace based simulation, event-driven simulation, and, more recently, machine-learning based simulation approaches are employed for the same purpose: explore and rank-order the designers’ ideas about the microarchitecture in terms of important system properties – primarily performance, but also power and energy consumption, reliability, and security. Among hundreds or thousands of different microarchitecture and software combinations, the architects and designers are suffocating. Microarchitecture simulation narrows down this space to a handful of configurations that can subsequently be analyzed at lower (finer) abstraction layers.

This short blog post aims to simplistically quantify the challenges of the role of microarchitecture simulators in today’s landscape.

The Simulation Throughput Bet

Assuming that simulation at any abstraction layer can be parallelized similarly, we compare single simulation runs at each level. A single workload run on a simulated or real CPU, GPU, or AIA has a throughput of approximately (IPS = instructions/operations per second):

< 1 Kilo-IPS simulated at the gate level
10 – 50 Kilo-IPS simulated at the register-transfer (boosted to 5 – 10 Mega-IPS whenFPGA-accelerated simulation is employed)
0.3 – 1 Mega-IPS simulated at the microarchitecture level
1 – 3 Giga-IPS running at real silicon

(Absolute numbers may vary depending on the host system running simulations, but relative differences are close to the above.)

The closest representation of the real physical system in the list is the gate level (if one wants to descend further down, it can be the transistor level). Using the above approximate numbers, a short 10-second runtime workload on final silicon needs about 1 year of gate-level simulation. The same run at the microarchitecture level takes less than a week!

Let’s now assume a simple design space exploration case, which only involves:

10 workloads of about the above short duration each
20 different microarchitecture points (counts, sizes, and organizations of registers, buffers, queues, caches, arithmetic units)
5 compilation options

Exploring the design space using gate-level simulation (bravely assuming the entire workloads can run at this level) would require 1000 years of simulation time. If 1000 servers were available for simulation, this time would be reduced again to only (!) 1 year. At the microarchitecture level, it is again a matter of only a few days!

The space that architects and designers need to explore does not consist of only 10 workloads, 20 microarchitecture points, and 5 compilation options. Microarchitecture-level simulation reduces years of simulation studies for design exploration down to days.

Do Accelerator-rich Designs Make a Microarchitecture Simulator’s Life Easier?

Domain-specific accelerators (DSAs) or particularly Artificial intelligence accelerators (AIAs) are very fast and energy-efficient in the few tasks they are dedicated to serve. Does the small number of tasks mean that the design space to explore is smaller or at least more manageable than CPUs or GPUs? Most probably not. The AI accelerators design space is very large because of the rapidly evolving and expanding set of ML algorithms. As a result, the accelerator chip designs often undergo major changes. For example, in systolic array based AIAs, at least the following design parameters must be evaluated in simulation at design time:

Dimensions of the systolic array
Data flow of the systolic array (input, output, weight stationary)
Data type of processing elements (short or long integers or floating-point numbers)
Level of memory hierarchy the AIA is connected to

Because of the rapid changes in ML/AI algorithms used by AIAs for training and inference, their “useful” life span is and will likely remain much shorter than that of established general purpose CPU and GPU architectures, where the microarchitecture knobs are also plentiful, but well understood. Therefore, the effort put into an AIA design may see a much shorter useful production time compared to a CPU or a GPU. A new design space exploration phase will need to be performed for the new generation of AIAs designed for the new generation of ML/AI algorithms.

Beyond Performance Exploration – Resilience at Scale

Microarchitecture simulators have been employed for resilience analysis ofCPUs andGPUs for almost a decade now. Recently, they have been proven very effective in Silent Data Corruption (SDCs) research to automatically generatefunctional test programs for CPUs, to demystify the actual rate of SDCs generated by CPUs at datacenter scale,accurately analyze SDCs for AIAs and contribute todecision making for fault protection of AIA-based systems.

Resilience (and SDCs) analysis for large-scale AI systems (recentlyopenly recognized by the OCP group of companies and the research community as a critical problem) adds extra dimensions to the design space exploration arena (these dimensions exist for CPUs and GPUs resilience analysis too):

Type of silicon defects to analyze (fault models)
Methods to enhance the resilience of the systems (protection)
Scale of AI systems

For modeling defects and faults, the main challenge is the enhancement of microarchitecture simulators for AIAs with mechanisms for the accurate representation of silicon defect roots and behaviors (such as transient, permanent, or delay faults) and the necessary physical conditions that excite them (temperature, droops, etc.).

Protection schemes represent yet another design knob. When a redundancy technique (at space, time, or information) is employed to detect and correct hardware faults, it alters the design of the microarchitecture, software, or both, adding overheads in terms of area, power and performance. Applying simple calculations presented above in this context points to further expansion of the design space that the simulator is called to explore.

Increasing system scale of datacenters or HPC clusters (datacenters for “usual” or ML/AI workloads, HPC systems for scientific computing or ML/AI workloads) exacerbates the problem of device variability – chips that are designed to be equal but operate with variations. The scale of deployment of AIA chips, CPUs and GPUs, coupled with diversity of workloads, increases the number of failure mechanisms and scenarios that can happen in the field, but were never imagined at chip design or manufacturing time. Yet another dimension to analyze.

Microarchitecture modeling and simulation for joint performance and resilience exploration is now an integral part of brave development activities around the globe; for one, theDARE project, Europe’s most ambitious endeavor of all times in HPC and AI computing, heavily relies on microarchitectural simulation for all three different computing engines it builds (a high-performance general-purpose microprocessor, an aggressive vector processor, and an AI inference engine) to make design decisions for performance, power, and resilience. The expected FIT (failures-in-time) and SDCs rates when the designed chips are deployed at scale, will be estimated using microarchitecture simulation and protection schemes will be diligently implemented.

The Need for Validation

Like any abstraction, microarchitecture-level modeling is constantly questioned for the validity of its findings vs. simulation of more detailed layers it abstracts (there are examples for gem5 validation for performance measurements and reliability measurements compared to physical systems and experiments). Validation of simulators is among the most important and useful results expected by the computer architecture and systems research community as our computing systems increase in complexity and scale everywhere. Simulators are extremely important, but we need to continuously tune them to model, as accurately as possible, the properties of physical computing systems and behaviors of modern software stacks.

**About the Author: **Dimitris Gizopoulos is Professor of Computer Architecture at the University of Athens. His research team (Computer Architecture Lab) focuses on modeling, evaluating, and improving the performance, dependability, and energy-efficiency of computing systems based on CPUs, GPUs, and AIAs.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.

The Simulation Throughput Bet

Do Accelerator-rich Designs Make a Microarchitecture Simulator’s Life Easier?

Beyond Performance Exploration – Resilience at Scale

The Need for Validation

Similar Posts