learning gpu programming basics till flash attention probably as first milestone

24 Mar, 2025 *

log 31 march (i fell sick last 3-4 days. continuing from yesterday.)

next_power_of_two - use this model

stride - how many steps to skip for program. can do this instead of using contigous and flattening.

mask - is useful to avoid reeading 1023 extra tensors lets say we end up with one extra block due to ceiling division so we avoid reading and writing those.

vector addition

Vector addition _ Triton GPU Kernels 101 Lesson #4 21-25 screenshot

When you launch

n_elements = output.numel()
grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']))

How many different kernels are gonna get launched i…

24 Mar, 2025 *

log 31 march (i fell sick last 3-4 days. continuing from yesterday.)

next_power_of_two - use this model

stride - how many steps to skip for program. can do this instead of using contigous and flattening.

mask - is useful to avoid reeading 1023 extra tensors lets say we end up with one extra block due to ceiling division so we avoid reading and writing those.

vector addition

Vector addition _ Triton GPU Kernels 101 Lesson #4 21-25 screenshot

When you launch

n_elements = output.numel()
grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']))

How many different kernels are gonna get launched in parallel

images from the video - gpu compute basics

GPU Architecture Basics _ Triton GPU Kernels 101 Lesson #2 2-47 screenshot

dram ~ vram hbm = high bandwidth memory

calculations happen on sram. point to note -> minimize transfers between sram and dram

DRAM is global SRAM is not - you have pools of SM core = smallest unit of compute -> don’t even touch in triton warp = 32 threads -> can access in triton

i am currently on a small employment break so trying to pick up some things. basic gpu programming / low level memory stuff is something where i always feel bad when my friends talk about it. i am still intimidated by not knowing how flash attention works. so i would essentially ramp up my to that.

the real competition starts from 1st wee

GPU mode server folks are hodling practice mode on their discord. I started learning some triton + gpu basic programming / compute structure to compete for this.

i initially started writing my first kernel totally with help of claude + o1. the discord bot gives error traces so i was able to provide feedback back to the editor.

>be me >doesn’t know shit about gpu kernels >gpu mode offering free gpu kernel benchmarking >vibe codes to 19th position on 2nd problem on A100 category >2 hours back and forth in docs >stuck on 5th on a100 and 19th on h100 pic.twitter.com/7bbkTfkAYZ

— sankalp (@dejavucoder) March 22, 2025

after sometime, i started hitting up roadblocks (5th position on A100 leaderboard) so i started to read up the docs and a couple of blogs i know. i asked gpt4o/ o1 / claude sonnet 3.7 to explain the code from first principles. what could be wrong. what’s the bottleneck etc.

openai models seem to be better at triton than sonnet...

i followed triton official documentation

went through a few lectures by cuda mode channel

Ch 1-3 PPMP Book Basics

Practitioner’s Guide to Triton

https://github.com/gpu-mode/lectures/blob/main/lecture_014/A_Practitioners_Guide_to_Triton.ipynb

Planning to go through category

Compute and Memory Basics

Going to skim through (very detailed)

https://alexzhang13.github.io/blog/2024/efficient-dl/

References

https://openai.com/index/triton/