TinyTinyTPU
A minimal 2Γ2 systolic-array TPU-style matrix-multiply unit, implemented in SystemVerilog and deployed on FPGA.
This project implements a complete TPU architecture including:
- 2Γ2 systolic array (4 processing elements)
- Full post-MAC pipeline (accumulator, activation, normalization, quantization)
- UART-based host interface
- Multi-layer MLP inference capability
- FPGA deployment on Basys3 (Xilinx Artix-7)
Resource Usage (Basys3 XC7A35T):
- LUTs: ~1,000 (5% utilization)
- Flip-Flops: ~1,000 (3% utilization)
- DSP48E1: 8 slices
- BRAM: ~10-15 blocks
- Estimated Gate Count: ~25,000 gates
Table of Contents
- Project Overview
- Quick Start
- Simulation & Testing
- [FPGA Build & Depβ¦
TinyTinyTPU
A minimal 2Γ2 systolic-array TPU-style matrix-multiply unit, implemented in SystemVerilog and deployed on FPGA.
This project implements a complete TPU architecture including:
- 2Γ2 systolic array (4 processing elements)
- Full post-MAC pipeline (accumulator, activation, normalization, quantization)
- UART-based host interface
- Multi-layer MLP inference capability
- FPGA deployment on Basys3 (Xilinx Artix-7)
Resource Usage (Basys3 XC7A35T):
- LUTs: ~1,000 (5% utilization)
- Flip-Flops: ~1,000 (3% utilization)
- DSP48E1: 8 slices
- BRAM: ~10-15 blocks
- Estimated Gate Count: ~25,000 gates
Table of Contents
- Project Overview
- Quick Start
- Simulation & Testing
- FPGA Build & Deployment
- Running Inference
- Project Structure
- Architecture Details
- Open Source Tooling (Yosys/nextpnr)
Project Overview
TinyTinyTPU is an educational implementation of Googleβs TPU architecture, scaled down to a 2Γ2 systolic array. It demonstrates:
- Systolic Array Architecture: Data flows horizontally (activations) and vertically (partial sums)
- Diagonal Wavefront Weight Loading: Staggered weight capture for proper systolic timing
- Full MLP Pipeline: Weight FIFO β MMU β Accumulator β Activation β Normalization β Quantization
- Multi-Layer Inference: Supports sequential layer processing with double-buffered activations
Design Philosophy
This is a minimal, educational-scale TPU designed for:
- Learning TPU architecture principles
- Understanding systolic array dataflow
- FPGA prototyping and experimentation
- Small-scale ML inference (2Γ2 matrices)
For production workloads, scale up the array size (e.g., 256Γ256 like Google TPU v1).
Quick Start
Prerequisites
For Simulation:
- Verilator 5.022 or later
- Python 3.8+
- cocotb
- GTKWave or Surfer (for waveform viewing)
For FPGA Build:
- Xilinx Vivado 2020.1 or later (for Basys3)
- OR Yosys + nextpnr (open source alternative, see Open Source Tooling)
For Running Inference:
- Basys3 FPGA board
- USB cable for programming
- Python 3.8+ with pyserial
Installation
# Clone the repository
git clone <repository-url>
cd tinytinyTPU-co
# Set up simulation environment
cd sim
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
Simulation & Testing
Running Tests
All simulation commands must be run from the sim/ directory:
cd sim
# Run all tests
make test
# Run all tests with waveform generation
make test WAVES=1
# Run specific module tests
make test_pe
make test_mmu
make test_mlp
make test_uart
make test_tpu_system
# Run with waveforms
make test_pe WAVES=1
Test Coverage
| Test File | Module | Coverage |
|---|---|---|
test_pe.py | Processing Element | Reset, MAC operations, weight capture |
test_mmu.py | 2Γ2 Systolic Array | Weight loading, matrix multiply |
test_weight_fifo.py | Weight FIFO | Push/pop, wraparound |
test_dual_weight_fifo.py | Dual Weight FIFO | Column independence, skew timing |
test_accumulator.py | Accumulator | Alignment, buffering, accumulate/overwrite modes |
test_activation_func.py | Activation Function | ReLU positive/negative/zero cases |
test_normalizer.py | Normalizer | Gain, bias, shift operations |
test_activation_pipeline.py | Activation Pipeline | Full pipeline, saturation handling |
test_mlp_integration.py | MLP Top | Multi-layer MLP inference |
test_uart_controller.py | UART Controller | Command parsing, response generation |
test_tpu_system.py | TPU Top | End-to-end system integration |
Viewing Waveforms
# List available waveforms
make waves
# Open specific waveform
make waves MODULE=pe
make waves MODULE=mmu
make waves MODULE=mlp_top
FPGA Build & Deployment
Building with Vivado
The project includes a TCL script for automated Vivado builds:
cd fpga
# Build bitstream (synthesis + implementation + bitgen)
vivado -mode batch -source build_vivado.tcl
# Expected build time: 5-10 minutes
# Output: basys3_top.bit
Build Script Details:
- Creates Vivado project:
vivado_project/tinytinyTPU_basys3 - Synthesizes all RTL files from
../rtl/ - Implements design with timing constraints
- Generates bitstream:
basys3_top.bit - Creates reports: utilization, timing, DRC
Resource Utilization (Post-Implementation):
- Check
vivado_project/tinytinyTPU_basys3.runs/impl_1/utilization_post_impl.rpt - Check
vivado_project/tinytinyTPU_basys3.runs/impl_1/timing_summary_post_impl.rpt
Programming the FPGA
Via Vivado Hardware Manager (GUI):
- Connect Basys3 board via USB
- Open Vivado
- Open Hardware Manager
- Auto-connect to target
- Program with
basys3_top.bit
Via Command Line:
vivado -mode tcl
open_hw_manager
connect_hw_server
open_hw_target
set_property PROGRAM.FILE {basys3_top.bit} [get_hw_devices xc7a35t_0]
program_hw_devices [get_hw_devices xc7a35t_0]
Via OpenOCD (Alternative):
# If using OpenOCD with Digilent cable
openocd -f interface/ftdi/digilent_jtag_hs3.cfg -f target/xc7a35t.cfg
# Then use GDB or other tools to program
Hardware Connections
Basys3 Pinout:
- UART RX (B18): Receives commands from PC
- UART TX (A18): Sends responses to PC
- Clock: 100 MHz (onboard oscillator)
- Reset: Center button (BTNC, U18)
- LEDs: Status display (see
fpga/README.mdfor LED modes)
UART Settings:
- Baud Rate: 115200
- Data Bits: 8
- Parity: None
- Stop Bits: 1
Running Inference
Python Host Interface
The project includes a Python driver for communicating with the FPGA:
cd host
# Basic inference demo
python3 inference_demo.py
# Gesture recognition demo (requires trained model)
python3 gesture_demo.py
# Interactive test
python3 test_tpu_driver.py
Inference Demo
The inference_demo.py script demonstrates:
- Loading weights into the TPU
- Loading input activations
- Executing inference
- Reading results
Example Usage:
from tpu_driver import TPUDriver
# Connect to FPGA (adjust port as needed)
tpu = TPUDriver('/dev/ttyUSB0') # Linux
# tpu = TPUDriver('COM3') # Windows
# Load 2Γ2 weight matrix
weights = [[1, 2], [3, 4]]
tpu.write_weights(weights)
# Load 2Γ2 activation matrix
activations = [[5, 6], [7, 8]]
tpu.write_activations(activations)
# Execute inference
tpu.execute()
# Read results
result = tpu.read_result()
print(f"Result: {result}")
Gesture Recognition Demo
The gesture_demo.py script implements a simple gesture classifier:
- Trains a 2-layer MLP on mouse movement data
- Classifies gestures as "Horizontal" or "Vertical"
- Real-time inference on FPGA
Running the Demo:
cd host
python3 gesture_demo.py
Model Training:
cd model
python3 train.py
# Generates: gesture_model.json
UART Protocol
The TPU uses a simple byte-based UART protocol:
Commands:
0x01: Write Weight (4 bytes: W00, W01, W10, W11)0x02: Write Activation (4 bytes: A00, A01, A10, A11)0x03: Execute (start inference)0x04: Read Result (returns 4 bytes: acc0[31:0])0x05: Read Result Column 1 (returns 4 bytes: acc1[31:0])0x06: Read Status (returns 1 byte: state[3:0] | cycle_cnt[3:0])
See host/tpu_driver.py for full protocol implementation.
Project Structure
tinytinyTPU-co/
βββ rtl/ # SystemVerilog RTL source files
β βββ pe.sv # Processing Element (MAC unit)
β βββ mmu.sv # 2Γ2 Matrix Multiply Unit (systolic array)
β βββ weight_fifo.sv # Single-column weight FIFO
β βββ dual_weight_fifo.sv # Dual-column weight FIFO with skew
β βββ accumulator.sv # Top-level accumulator
β βββ accumulator_align.sv # Column alignment logic
β βββ accumulator_mem.sv # Double-buffered accumulator memory
β βββ activation_func.sv # ReLU/ReLU6 activation
β βββ normalizer.sv # Gain/bias/shift normalization
β βββ loss_block.sv # L1 loss computation
β βββ activation_pipeline.sv # Full post-accumulator pipeline
β βββ unified_buffer.sv # Ready/valid output FIFO
β βββ mlp_top.sv # Top-level MLP integration
β βββ tpu_bridge.sv # UART-to-MLP bridge
β βββ uart_controller.sv # UART command processor
β βββ uart_rx.sv # UART receiver
β βββ uart_tx.sv # UART transmitter
β βββ tpu_top.sv # Complete TPU system
β
βββ sim/ # Simulation environment
β βββ Makefile # Build and test automation
β βββ requirements.txt # Python dependencies
β βββ tests/ # cocotb Python testbenches
β β βββ test_pe.py
β β βββ test_mmu.py
β β βββ test_weight_fifo.py
β β βββ test_dual_weight_fifo.py
β β βββ test_accumulator.py
β β βββ test_activation_func.py
β β βββ test_normalizer.py
β β βββ test_activation_pipeline.py
β β βββ test_mlp_integration.py
β β βββ test_uart_controller.py
β β βββ test_tpu_system.py
β βββ waves/ # Generated VCD waveforms
β
βββ fpga/ # FPGA deployment files
β βββ basys3_top.sv # Top-level FPGA wrapper
β βββ basys3.xdc # Pin constraints
β βββ build_vivado.tcl # Automated build script
β βββ basys3_top.bit # Generated bitstream
β βββ README.md # FPGA-specific documentation
β
βββ host/ # Python host interface
β βββ tpu_driver.py # TPU communication driver
β βββ tpu_compiler.py # Model compilation utilities
β βββ inference_demo.py # Basic inference demo
β βββ gesture_demo.py # Gesture recognition demo
β βββ test_tpu_driver.py # Driver unit tests
β
βββ model/ # ML model training
β βββ train.py # Model training script
β βββ gesture_model.json # Trained model (JSON format)
β
βββ README.md # This file
Architecture Details
Systolic Array Dataflow
PE00 -> PE01 Activations flow horizontally (right)
| |
PE10 -> PE11 Partial sums flow vertically (down)
| |
acc0 acc1 Outputs to accumulator
Weight Loading (Diagonal Wavefront):
- Cycle 0: W10 β col0, no capture
- Cycle 1: W00 β col0 (capture), W11 β col1 (no capture)
- Cycle 2: W01 β col1 (capture)
Activation Flow:
- Row 0: A00 β PE00 β PE01
- Row 1: A10 β PE10 β PE11 (with 1-cycle skew)
Pipeline Stages
- Weight FIFO: Stores weights, outputs with column skew
- MMU (Systolic Array): Matrix multiply-accumulate
- Accumulator: Aligns columns, double-buffered storage
- Activation Pipeline:
- Activation function (ReLU/ReLU6)
- Normalization (gain Γ bias + shift)
- Quantization (int8 with saturation)
- Unified Buffer: Output FIFO with ready/valid handshaking
Multi-Layer MLP
The MLP controller manages sequential layer processing:
State Machine:
IDLE β LOAD_WEIGHT β LOAD_ACT β COMPUTE β DRAIN β TRANSFER β NEXT_LAYER β WAIT_WEIGHTS β ...
- Double Buffering: Activations ping-pong between buffers for layer-to-layer transfer
- Weight Loading: Weights loaded per layer via UART
- Pipeline Overlap: While layer N drains, layer N+1 weights can be loaded
Open Source Tooling (Yosys/nextpnr)
Overview
While Vivado is the standard toolchain for Xilinx FPGAs, open-source alternatives exist:
- Yosys: Synthesis (RTL β netlist)
- nextpnr: Place & Route (netlist β bitstream)
Setup
Installation (Ubuntu/Debian):
# Install Yosys
sudo apt-get install yosys
# Install nextpnr (for Xilinx 7-series)
# Requires building from source - see nextpnr documentation
git clone https://github.com/YosysHQ/nextpnr.git
cd nextpnr
cmake . -DARCH=xilinx
make -j$(nproc)
sudo make install
Installation (macOS):
brew install yosys
# nextpnr requires manual build
Building with Yosys/nextpnr
Step 1: Synthesis (Yosys)
cd fpga
# Create synthesis script
cat > synth.ys << 'EOF'
# Read RTL files
read_verilog -sv ../rtl/pe.sv
read_verilog -sv ../rtl/mmu.sv
read_verilog -sv ../rtl/weight_fifo.sv
read_verilog -sv ../rtl/dual_weight_fifo.sv
read_verilog -sv ../rtl/accumulator_align.sv
read_verilog -sv ../rtl/accumulator_mem.sv
read_verilog -sv ../rtl/accumulator.sv
read_verilog -sv ../rtl/activation_func.sv
read_verilog -sv ../rtl/normalizer.sv
read_verilog -sv ../rtl/loss_block.sv
read_verilog -sv ../rtl/activation_pipeline.sv
read_verilog -sv ../rtl/unified_buffer.sv
read_verilog -sv ../rtl/mlp_top.sv
read_verilog -sv ../rtl/uart_rx.sv
read_verilog -sv ../rtl/uart_tx.sv
read_verilog -sv ../rtl/uart_controller.sv
read_verilog -sv ../rtl/tpu_bridge.sv
read_verilog -sv ../rtl/tpu_top.sv
read_verilog -sv basys3_top.sv
# Set top module
hierarchy -top basys3_top
# Synthesize
synth_xilinx -top basys3_top -family xc7
# Write netlist
write_verilog basys3_top_synth.v
write_json basys3_top.json
EOF
# Run synthesis
yosys synth.ys
Step 2: Place & Route (nextpnr)
# Generate bitstream
nextpnr-xilinx \
--xdc basys3.xdc \
--json basys3_top.json \
--write basys3_top_routed.json \
--fasm basys3_top.fasm
# Generate bitstream (requires Xilinx tools or open-source fasm2bit)
# Note: fasm2bit conversion may require Xilinx tools or open-source alternatives
Limitations & Considerations
Current Status:
- Yosys synthesis works well for most SystemVerilog constructs
- nextpnr supports Xilinx 7-series but may have timing/routing challenges
- Bitstream generation (fasm2bit) may require Xilinx tools or open-source alternatives
Recommendations:
- For development: Use Vivado for reliable builds
- For open-source exploration: Use Yosys for synthesis, verify with Vivado
- For production: Stick with Vivado until open-source toolchain matures
Future Work:
- Create automated Yosys/nextpnr build script
- Document fasm2bit conversion process
- Benchmark open-source vs. Vivado results
Troubleshooting
Simulation Issues
Verilator Errors:
- Ensure Verilator 5.022+ is installed
- Check SystemVerilog syntax (use
make lint)
Test Failures:
- Run with
WAVES=1to generate waveforms for debugging - Check
sim/test_output.logfor detailed error messages
FPGA Build Issues
Synthesis Errors:
- Check RTL files are in
rtl/directory - Verify SystemVerilog syntax (Vivado may be stricter than Verilator)
Timing Violations:
- Check
timing_summary_post_impl.rpt - May need to add pipeline stages or reduce clock frequency
Place & Route Failures:
- Check utilization reports
- Verify constraints in
basys3.xdc
Hardware Issues
UART Not Working:
- Verify COM port:
ls /dev/ttyUSB*(Linux) or Device Manager (Windows) - Check baud rate: 115200
- Verify TX/RX pins in constraints file
LEDs Not Responding:
- Check bitstream programmed correctly
- Verify reset button (center button)
- Check switch settings for LED modes (see
fpga/README.md)
Contributing
Contributions welcome! Areas for improvement:
- Additional test coverage
- Performance optimizations
- Documentation improvements
- Open-source toolchain support
- Larger array sizes
References
Acknowledgments
- Inspired by Googleβs TPU architecture (thank you Cliff and Richard for your time!)
- The boys from the TinyTPU team!!
- Edmund and the Yosys / Symbiotic EDA crew
- Stanford FAF for the support, funding, and community!
- Princeton ECE Dept for the Basys 3 to play around with :)