We needed to test our CUDA kernels on 15 different GPUs. The problem? Renting all of them costs $3,000 a month. Just for testing.
Thatβs when we thought: what if we could predict how a kernel runs on any GPU without actually owning it?
Not some rough guess. Real numbers. Like, your kernel takes 2.4ms on an RTX 4090 and 5.1ms on a V100. Within 1% of actual hardware.
Three months later, we built it. Now developers test kernels on 50+ GPUs before writing a single line. One team saved $18,000 in cloud costs. Another found a bug on an A100 theyβve never even touched.
Hereβs how it works:
The problem: Testing is expensive
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β The Multi-GPU Testing Problem β
ββββββββββββββββββ...
We needed to test our CUDA kernels on 15 different GPUs. The problem? Renting all of them costs $3,000 a month. Just for testing.
Thatβs when we thought: what if we could predict how a kernel runs on any GPU without actually owning it?
Not some rough guess. Real numbers. Like, your kernel takes 2.4ms on an RTX 4090 and 5.1ms on a V100. Within 1% of actual hardware.
Three months later, we built it. Now developers test kernels on 50+ GPUs before writing a single line. One team saved $18,000 in cloud costs. Another found a bug on an A100 theyβve never even touched.
Hereβs how it works:
The problem: Testing is expensive
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β The Multi-GPU Testing Problem β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Scenario: Test kernel on 15 different GPUs β
β β
β Cloud rental costs (per month): β
β β’ H100 (80GB): $2.50/hour β $1,800/month β
β β’ A100 (80GB): $1.10/hour β $792/month β
β β’ RTX 4090: $0.80/hour β $576/month β
β β’ V100 (32GB): $0.75/hour β $540/month β
β β’ RTX 3080: $0.40/hour β $288/month β
β β’ T4: $0.35/hour β $252/month β
β β’ RTX 2080 Ti: $0.30/hour β $216/month β
β ... and 8 more GPUs β
β β
β Total monthly cost: ~$7,500 β
β Annual cost: $90,000 β
β β
β For a startup? Impossible. β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Youβre building a CUDA library. Your users have everything from RTX 2060s to H100s. Your kernel runs great on your RTX 4090. Then someone with a V100 complains itβs slow. Youβve never even seen a V100.
The usual answer? Rent them all. $7,500 a month. For a small team, thatβs just not realistic. So you test on one or two GPUs and cross your fingers. Then the bug reports start coming in.
What we built instead
We built a simulator. You give it your kernel code. It tells you exactly how it runs on any GPU. H100, A100, RTX 4090, V100, whatever. Without running a single line of actual code.
NVIDIA has simulators. Theyβre internal only. There are academic tools too. But they all have the same problems: 1. You canβt use them 2. They take forever to set up 3. Theyβre slow (hours per kernel) 4. Theyβre wrong (20-30% error)
We wanted something that works in seconds, needs zero setup, and is actually right.
How it works
We built three different emulators. Each one gets more accurate but needs more information. The system picks the best one for your kernel:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Tier 1: NeuSight Tile-Based Emulator (99% accuracy) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Method: Decompose execution into tiles β
β β’ Break kernel into L2-cache-sized tiles β
β β’ Simulate each tile with architectural models β
β β’ Account for occupancy, bandwidth, latency β
β β’ Apply architecture-specific corrections β
β β
β Accuracy: 98-99% on most kernels β
β Speed: 100-500ms per emulation β
β Coverage: All kernels with source code β
β β
β Key insight: Tile size based on actual GPU L2 cache β
β β Hopper H100: 96MB L2 β Large tiles β
β β Pascal P100: 4MB L2 β Small tiles β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tier 1: NeuSight Emulator is the main one. It breaks your kernel into tiles that match the GPUβs L2 cache size. Then it simulates each tile. For each one, we figure out:
- How many warps can run at once (based on registers and shared memory)
- How fast memory moves (checking if your accesses are coalesced)
- How many TFLOPs youβre actually getting
- How blocks get scheduled across waves The trick is we use real GPU specs. When simulating an H100, we use its actual 132 SMs, 96MB L2 cache, 3.35TB/s bandwidth. For a GTX 1060, we use 10 SMs, 1.5MB L2, 192GB/s. No fake numbers. Everything comes from NVIDIAβs datasheets and our own measurements.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Tier 2: NCU Baseline Emulator (Hardware-authentic scaling) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Method: Scale from real NCU baseline measurements β
β β’ Start with real NCU data from a reference GPU β
β β’ Apply architectural scaling factors β
β β’ Adjust for compute capability differences β
β β
β Accuracy: 95-98% when baseline available β
β Speed: 50-200ms per emulation β
β Coverage: Kernels with NCU baseline data β
β β
β Architecture scaling factors: β
β β’ Hopper: 1.05x compute, 1.00x memory β
β β’ Ada Lovelace: 1.00x compute, 0.95x memory β
β β’ Ampere: 0.92x compute, 0.90x memory β
β β’ Turing: 0.85x compute, 0.85x memory β
β β’ Volta: 0.88x compute, 0.88x memory β
β β’ Pascal: 0.75x compute, 0.80x memory β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tier 2: NCU Baseline Emulator is different. If you already profiled your kernel on one GPU (say, your RTX 4090), we take those real numbers and scale them to other GPUs. We have scaling factors for every architecture. Hopper is 1.05x faster at compute than Ada. Ampere is 0.92x. We measured all of this.
This is fast and really accurate because we start with real hardware data, not a simulation.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Tier 3: Analytical Emulator (Fast estimates) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Method: Mathematical roofline model β
β β’ Calculate compute intensity (FLOPs/byte) β
β β’ Determine if memory or compute bound β
β β’ Apply quick heuristics for divergence, coalescing β
β β
β Accuracy: 85-92% (rougher but fast) β
β Speed: 10-50ms per emulation β
β Coverage: All kernels, even without code β
β β
β When used: Fallback when other tiers unavailable β
β β
β Ridge point calculation: β
β ridge = peakTFLOPS / memoryBandwidthGBps β
β if (arithmeticIntensity < ridge) β memory bound β
β if (arithmeticIntensity >= ridge) β compute bound β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tier 3: Analytical Emulator is the backup. It uses math (roofline model) to figure out if your kernel is memory-bound or compute-bound. Less accurate (85-92%) but super fast. And it works even if we donβt have your source code.
The database
All three emulators use the same database. We scraped specs for 50+ NVIDIA GPUs. Every generation from Hopper down to Pascal:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GPU Architecture Database (excerpt) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β H100 (Hopper, sm_90): β
β β’ 132 SMs, 96MB L2, 3.35TB/s bandwidth, 67 TFLOPs FP32 β
β β’ Max occupancy: 64 warps/SM, 2048 threads/SM β
β β’ Shared memory: 228KB/SM β
β β
β RTX 4090 (Ada Lovelace, sm_89): β
β β’ 128 SMs, 72MB L2, 1.01TB/s bandwidth, 82.6 TFLOPs FP32 β
β β’ Max occupancy: 48 warps/SM, 1536 threads/SM β
β β’ Shared memory: 100KB/SM β
β β
β A100 (Ampere, sm_80): β
β β’ 108 SMs, 40MB L2, 1.55TB/s bandwidth, 19.5 TFLOPs FP32 β
β β’ Max occupancy: 64 warps/SM, 2048 threads/SM β
β β’ Shared memory: 164KB/SM β
β β
β V100 (Volta, sm_70): β
β β’ 80 SMs, 6MB L2, 900GB/s bandwidth, 15.7 TFLOPs FP32 β
β β’ Max occupancy: 64 warps/SM, 2048 threads/SM β
β β’ Shared memory: 96KB/SM β
β β
β GTX 1060 (Pascal, sm_61): β
β β’ 10 SMs, 1.5MB L2, 192GB/s bandwidth, 4.4 TFLOPs FP32 β
β β’ Max occupancy: 64 warps/SM, 2048 threads/SM β
β β’ Shared memory: 96KB/SM β
β β
β ... and 45+ more GPUs β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Got all this from NVIDIAβs whitepapers and my own testing. For each GPU:
- How many SMs and what compute capability
- Peak FLOPs for FP32, FP16, INT8
- Memory bandwidth and cache sizes
- Max occupancy
- Special features
Proving it works
Building the emulator was hard. Proving itβs accurate was harder. We needed real data to compare against.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Validation Methodology β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Build test kernel suite β
β β’ Matrix multiplication (varying tile sizes) β
β β’ Reductions (varying patterns) β
β β’ Stencil operations β
β β’ Memory-bound kernels β
β β’ Compute-bound kernels β
β β’ Mixed workloads β
β Total: 47 representative kernels β
β β
β Step 2: Profile on real hardware β
β β’ Run each kernel on 12 different GPUs β
β β’ Capture NCU metrics: execution time, SM efficiency, etc. β
β β’ Record actual hardware measurements β
β β
β Step 3: Run emulator predictions β
β β’ Emulate each kernel on all 50+ GPUs β
β β’ Compare predicted vs actual for the 12 we have β
β β
β Step 4: Calculate error rates β
β β’ Mean Absolute Percentage Error (MAPE) β
β β’ Per-kernel accuracy breakdown β
β β’ Identify systematic biases β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
We wrote 47 test kernels. Matrix multiply, reductions, convolutions, all the common patterns. Then we profiled each one on 12 real GPUs (borrowed some, rented others, bought a few).
Then ran the emulator on all of them. Compared predictions to reality.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Accuracy Results (NeuSight Tier 1) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Execution Time Prediction: β
β β’ Mean error: 1.2% β
β β’ 95th percentile error: 4.8% β
β β’ Worst case: 8.2% (edge case: tiny kernel with overhead) β
β β
β SM Efficiency Prediction: β
β β’ Mean error: 2.1% β
β β’ 95th percentile error: 5.3% β
β β
β Memory Throughput Prediction: β
β β’ Mean error: 3.4% β
β β’ 95th percentile error: 7.1% β
β β
β Occupancy Prediction: β
β β’ Mean error: 0.8% (nearly perfect - this is analytical) β
β β’ 95th percentile error: 2.1% β
β β
β Overall: 98-99% accuracy on most kernels β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Results: 98-99% accurate on execution time. Occupancy prediction is basically perfect. SM efficiency within 2-3%.
What you can do with this
This changes how you develop CUDA code:
Test on GPUs you donβt own. You have an RTX 4090. Your customer has a V100. Emulate on V100 first. Find out your block size is wrong. Fix it before they ever see it.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Case Study: Library Maintainer β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Problem: Maintaining CUDA library for 20+ GPU models β
β Cost before emulator: $4,500/month in cloud rentals β
β β
β With emulator: β
β β’ Test all 20 GPUs in RightNow AI: $0 β
β β’ Only rent GPUs for final validation: $300/month β
β β’ Annual savings: $50,400 β
β β
β Bugs caught: β
β β’ Ampere occupancy issue (would've affected 30% of users) β
β β’ Pascal memory alignment bug (would've crashed on GTX 10x) β
β β’ Turing shared memory bank conflict (20% slowdown) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Optimize for expensive GPUs. H100s cost $30,000. Youβre not buying one. But emulate it, tune your kernel, and when your customer runs it on their H100 cluster, it already flies.
Catch regressions before commit. Changed your kernel? Emulate across 15 GPUs in 30 seconds. See your change killed Turing performance but helped Ampere. Decide if the tradeoff is worth it.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Developer Workflow Transformation β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Before Emulator: β
β 1. Write kernel β
β 2. Test on local GPU (RTX 4090) β
β 3. Deploy to production β
β 4. Users report issues on V100, A100, etc. β
β 5. Rent GPUs to debug β
β 6. Fix and re-deploy β
β Time: 2-3 days, Cost: $200-500 β
β β
β After Emulator: β
β 1. Write kernel β
β 2. Test on local GPU (RTX 4090) β
β 3. Emulate on 15 target GPUs (2 minutes) β
β 4. Fix issues found in emulation β
β 5. Deploy with confidence β
β 6. Zero user-reported GPU-specific bugs β
β Time: 30 minutes, Cost: $0 β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The hard parts
Reading your kernel automatically
The emulator needs to understand your code without you explaining it:
- Are your memory accesses coalesced?
- Do your branches diverge?
- Whatβs your arithmetic intensity?
- How do you use shared memory? We built a pattern matcher. It looks for common CUDA idioms:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kernel Pattern Detection β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Coalesced memory access: β
β Pattern: data[blockIdx.x * blockDim.x + threadIdx.x] β
β β High coalescing factor: 0.9 β
β β
β Strided access: β
β Pattern: data[threadIdx.y * stride + threadIdx.x] β
β β Medium coalescing: 0.4 β
β β
β Divergent branching: β
β Pattern: if (threadIdx.x < threshold) β
β β Divergence probability: 0.5 β
β β
β Shared memory usage: β
β Pattern: __shared__ float tile[TILE_SIZE][TILE_SIZE] β
β β Shared memory optimization detected β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Different architectures behave differently
Hopper has thread block clusters. Ampere has async memory copy. Volta has independent thread scheduling. Each needs its own model.
We use correction factors. Hopper kernels with shared memory get a 15% speedup in the simulation because Hopperβs shared memory is actually faster.
Wave scheduling is tricky
GPUs run blocks in waves. 100 blocks, but only 80 fit? Thatβs 2 waves. The second wave is smaller, so SMs sit idle. The emulator has to account for this waste.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Wave Scheduling Example β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β GPU: RTX 4090 (128 SMs) β
β Kernel: 512 blocks, 256 threads/block β
β Active blocks/SM: 3 β
β β
β Total concurrent blocks: 128 SMs Γ 3 = 384 blocks β
β Total blocks needed: 512 β
β β
β Wave 1: 384 blocks (100% utilization) β
β Wave 2: 128 blocks (33% utilization - imbalance!) β
β β
β Execution time: β
β = (wave1_time + wave2_time * (128/384)) β
β = (2.1ms + 2.1ms * 0.33) β
β = 2.79ms β
β β
β Emulator must account for this imbalance penalty β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
What it looks like in the editor
Write a CUDA kernel. Click a button. Pick your GPUs. Get results:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RightNow AI: GPU Emulation Interface β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β __global__ void myKernel(float* data, int n) { β
β // ... kernel code ... β
β } β
β β
β [Emulate Kernel βΌ] Select GPUs: [All] [Hopper] [Ampere] β
β β
β Results (sorted by performance): β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β H100 (80GB) 1.2ms π’ 92% efficiency β β
β β RTX 4090 1.8ms π’ 87% efficiency β β
β β A100 (80GB) 2.1ms π’ 84% efficiency β β
β β V100 (32GB) 3.4ms π‘ 68% efficiency β β
β β RTX 3080 4.2ms π‘ 61% efficiency β β
β β GTX 1060 12.8ms π΄ 34% efficiency β FIX! β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Click any GPU for detailed metrics and recommendations β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
See the whole landscape instantly. GTX 1060 is red? Click it. Probably low occupancy. Bump your block size. Re-emulate. Green.
What doesnβt work yet
Itβs not perfect:
Dynamic parallelism. Kernels launching kernels. Havenβt figured out how to trace the whole call graph yet.
Multi-GPU. Only does single-GPU kernels right now. No NCCL, no peer-to-peer transfers.
Tensor cores. We model them as fast FP16, but not perfectly. Hopper and Ada have tons of tensor tricks we donβt capture.
Tiny kernels. Under 1 microsecond, overhead dominates. Accuracy drops to 85-90%.
Working on all of these. Multi-GPU is next.
Try it
Itβs built into RightNow AI. Write a kernel, click emulate, see results for 50+ GPUs.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β What You Get β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β β Emulate 50+ GPU models without owning them β
β β 98-99% accuracy on execution time β
β β Results in 100-500ms per GPU β
β β Detailed metrics and recommendations β
β β Architecture database (Hopper β Pascal) β
β β Compare performance across generations β
β β Catch bugs before deployment β
β β Save $1000s in cloud costs β
β β
β Works on Windows & Linux (x64 & ARM64) β
β Free for personal use β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Zero setup. Runs on CPU (no GPU needed). Test anywhere.
Why it matters
Five years ago, everyone had a GTX 1080. Now? Someone has a 1060, someone has a 4090, cloud runs H100s. Your code needs to work on everything.
This tool makes that possible. Without spending $90,000 a year renting GPUs youβll never own.
Jaber