🧠 CUDA Memory Management - miterion · Scour

frankkk96/FlashQwen: From-scratch C++/CUDA inference engine for Qwen3-8B, with zero external libraries

📊CUDA Graphs Code

Training Cycle Halved: LoongForge End-to-End Optimization for GR00T N1.6 Delivers 2.3× Throughput

📊CUDA Graphs

baidu-baige.github.io··Hacker News

Less-relevant results

RATrain: A Resource-Aware Training Runtime for Large Language Models on Bandwidth-Constrained Heterogeneous Supercomputing Platforms

🌐Distributed Computing Academic

Making FlashAttention-4 faster for inference

🎯Tensor Cores Blog

modal.com··Hacker News, Hacker News

Bring-up and testing of systems with CXL Type 3 memory expanders

⏱️CUDA Events

Linux Kernel 7.1 Released with Rewritten NTFS Support

⚙️Systems Programming Release

massimo92/spark: CLI tool for serving LLMs with vLLM on NVIDIA DGX Spark. One file, zero friction.

🛠Ml-eng Code

github.com··Hacker News

Show HN: Flashback Booth, A tactile retro photo booth in the browser

🖥️Terminal Multiplexers Discussion Tutorial

flashbackbooth.me··Hacker News

The Parallel Revolution: A Comprehensive Guide to GPU Computing

🔥PyTorch Blog

fitservers.com·

Mojo Nightly

📈Occupancy Optimization Blog

mojolang.org··Hacker News

Introducing Piper: A Programmable Distributed Training System

🌊CUDA Streams Academic Blog

syfi.cs.washington.edu··Hacker News

Release ensu-v0.1.17 · ente-io/ente

🤖Automation Code

Local models in mid-2026: the engineering that closed the gap

👁️Attention Optimization

coles.codes··Hacker News, r/LocalLLaMA

Can't format my 2TB

vita.hacks.guide··r/VitaPiracy

8th June – Threat Intelligence Report

⚙️Systems Programming

sgl-project/sglang-omni: SGLang Omni: High-Performance Multi-Stage Pipeline Framework for Omni Models

📈Occupancy Optimization Code

github.com··Cited by 1 article

Coupling Complementary Simulations for Combined Performance and Energy Optimization

🌐Distributed Computing Academic

Homebrew, Again

🔄ONNX Blog

jerryz.bearblog.dev·

NetX-lab/Frontier: Frontier: A Discrete-Event Simulator for Modern LLM Serving

🔥PyTorch Code

github.com··Hacker News

KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.

👁️Attention Optimization Code

github.com··Hacker News

Log in to enable infinite scrolling