⚡ High Performance Computing - ahmedelgabri

Concepts in Practice: C++ MPI Bindings for the HPC Ecosystem. From a Standardizable Core to a Composable Interface

Release TorchCodec 0.14: HDR Video Decoding for CPU & CUDA, and Fast Wav Decoder · meta-pytorch/torchcodec

🤖AI Code

github.com··Hacker News

Introducing Piper: A Programmable Distributed Training System

🤖AI Academic Blog

syfi.cs.washington.edu··Hacker News

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

🤖AI News

newsletter.semianalysis.com

··Hacker News

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

🤖AI Code

github.com··Hacker News

Huawei-led team claims it post-trained DeepSeek's 1.6-trillion-parameter model — 1,000 Ascend 910C chips used in training

🤖AI News

tomshardware.com

··Hacker News

From Fork-Join to Asynchronous Tasks: Parallelizing Tiled Cholesky Decomposition with OpenMP and HPX

⚙️Systems Programming Academic

arxiv.org·

Less-relevant results

D-Wave Riding The Dual-Rail For Its Gate-Model Quantum Ambitions

🏠Home Lab News

nextplatform.com··Hacker News

Why Compiler Engineers Rarely Use Strassen's Algorithm for Fast Matrix Multiplications

⚙️Systems Programming News Blog

leetarxiv.substack.com··Substack, r/programming

bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss

🤖AI Code

github.com··r/LocalLLaMA

Unpacking AI: The Hardware Behind AI

🤖AI News

pathtostaff.com··Hacker News

WarpGuard: Protected-Site Control-Flow Integrity for CUDA SASS Binaries

⚙️Systems Programming Academic

arxiv.org·

NVIDIA and LG Group Build an AI Factory to Advance Physical AI, Mobility and AI Infrastructure

🤖AI Blog

blogs.nvidia.com··Hacker News

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

🤖AI Code

github.com··r/LocalLLaMA

A system programmer’s guide to LLM inference

🤖AI Blog

blog.xiangpeng.systems··Hacker News

Gerrymandering the Warp: Non-Control-Data Attacks on CUDA Collective Decision

⚙️Systems Programming Academic

arxiv.org·

Symbolica 2.0: programmable symbols, JIT evaluators, and type-erased callbacks in Rust

💻Programming

symbolica.io··Lobsters, Hacker News, r/rust

NetX-lab/Frontier: Frontier: A Discrete-Event Simulator for Modern LLM Serving

🤖AI Code

github.com··Hacker News

Concepts in Practice: C++ MPI Bindings for the HPC Ecosystem. From a Standardizable Core to a Composable Interface

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?

Release TorchCodec 0.14: HDR Video Decoding for CPU & CUDA, and Fast Wav Decoder · meta-pytorch/torchcodec

Introducing Piper: A Programmable Distributed Training System

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

Huawei-led team claims it post-trained DeepSeek's 1.6-trillion-parameter model — 1,000 Ascend 910C chips used in training

From Fork-Join to Asynchronous Tasks: Parallelizing Tiled Cholesky Decomposition with OpenMP and HPX

D-Wave Riding The Dual-Rail For Its Gate-Model Quantum Ambitions

Why Compiler Engineers Rarely Use Strassen's Algorithm for Fast Matrix Multiplications

Unpacking AI: The Hardware Behind AI

WarpGuard: Protected-Site Control-Flow Integrity for CUDA SASS Binaries

NVIDIA and LG Group Build an AI Factory to Advance Physical AI, Mobility and AI Infrastructure

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

A system programmer’s guide to LLM inference

Gerrymandering the Warp: Non-Control-Data Attacks on CUDA Collective Decision

Symbolica 2.0: programmable symbols, JIT evaluators, and type-erased callbacks in Rust

NetX-lab/Frontier: Frontier: A Discrete-Event Simulator for Modern LLM Serving