🟩 CUDA - emulbasaka

💻OS Blog

blog.adafruit.com·

WarpGuard: Protected-Site Control-Flow Integrity for CUDA SASS Binaries

💻OS Academic

arxiv.org·

Less-relevant results

First Steps Toward Automated AI Research

💻OS

recursive.com··Hacker News

RightNow-AI/AutoMegaKernel: An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch-1 LLM decode.

💻OS Code

github.com··Hacker News

Exploiting GPU Tensor Cores from Java using Babylon [Juan Fumero]

🎮GPU Architecture

openjdk.org··Lobsters, r/java

Profiling in PyTorch (Part 2): From Nn.Linear to a Fused MLP

💻OS Blog

huggingface.co··Hacker News

Making FlashAttention-4 faster for inference

💻OS Blog

modal.com·

SoC FPGA advances wideband RF processing

🎮GPU Architecture

edn.com·

Vortex expands open RISC-V graphics

🎮GPU Architecture

jonpeddie.com·

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

⚙MLSys Code

github.com··Hacker News

AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

💻OS Academic

arxiv.org··Hacker News

Google unveils DiffusionGemma, delivering up to 4x faster inference on dedicated GPUs

💡FlashAttention

alternativeto.net·

Vortex 3.0 Released As Full-Stack, Open-Source RISC-V GPU Now With 3D Pipeline

💻OS

phoronix.com·

NVIDIA at Computex 2026: RTX Spark Gaming Hands-On, DLSS 4.5, and More

💻OS

techpowerup.com·

Huawei-led team claims it post-trained DeepSeek's 1.6-trillion-parameter model — 1,000 Ascend 910C chips used in training

📦TVM News

tomshardware.com

··Hacker News

DiffusionGemma is Google’s fastest AI yet, but it comes with a big trade-off

💡FlashAttention

androidauthority.com·

Big Banks Eye New AI Compute Trading Market

📦TVM

pymnts.com·

Google's new open-weights model brings image-generation tricks to AI text generation

⚙MLSys News

theregister.com·

Google’s DiffusionGemma is 4x faster than its other Gemma models

💡FlashAttention

thenewstack.io·

CUDA-Oxide 0.2 Brings Early Improvements To Pure Rust CUDA Kernels

GPUsnek is Python on nVidia’s CUDA

WarpGuard: Protected-Site Control-Flow Integrity for CUDA SASS Binaries

First Steps Toward Automated AI Research

RightNow-AI/AutoMegaKernel: An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch-1 LLM decode.

Exploiting GPU Tensor Cores from Java using Babylon [Juan Fumero]

Profiling in PyTorch (Part 2): From Nn.Linear to a Fused MLP

Making FlashAttention-4 faster for inference

SoC FPGA advances wideband RF processing

Vortex expands open RISC-V graphics

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

Google unveils DiffusionGemma, delivering up to 4x faster inference on dedicated GPUs

Vortex 3.0 Released As Full-Stack, Open-Source RISC-V GPU Now With 3D Pipeline

NVIDIA at Computex 2026: RTX Spark Gaming Hands-On, DLSS 4.5, and More

Huawei-led team claims it post-trained DeepSeek's 1.6-trillion-parameter model — 1,000 Ascend 910C chips used in training

DiffusionGemma is Google’s fastest AI yet, but it comes with a big trade-off

Big Banks Eye New AI Compute Trading Market

Google's new open-weights model brings image-generation tricks to AI text generation

Google’s DiffusionGemma is 4x faster than its other Gemma models