High Performance Computing

Feeds to Scour
SubscribedAll
Scoured 44 posts in 8.9 ms

Concepts in Practice: C++ MPI Bindings for the HPC Ecosystem. From a Standardizable Core to a Composable Interface

 ⚙️Systems Programming  Content type: Academic
arxiv.org·

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

 🤖AI
smolhub.com··r/LocalLLaMA

CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?

 🤖AI

Release TorchCodec 0.14: HDR Video Decoding for CPU & CUDA, and Fast Wav Decoder · meta-pytorch/torchcodec

 🤖AI  Content type: Code
github.com··Hacker News

Introducing Piper: A Programmable Distributed Training System

 🤖AI  Content type: Academic  Content type: Blog

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

 🤖AI  Content type: News

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

 🤖AI  Content type: Code
github.com··Hacker News

Huawei-led team claims it post-trained DeepSeek's 1.6-trillion-parameter model — 1,000 Ascend 910C chips used in training

 🤖AI  Content type: News

From Fork-Join to Asynchronous Tasks: Parallelizing Tiled Cholesky Decomposition with OpenMP and HPX

 ⚙️Systems Programming  Content type: Academic
arxiv.org·
Less-relevant results

D-Wave Riding The Dual-Rail For Its Gate-Model Quantum Ambitions

 🏠Home Lab  Content type: News

Why Compiler Engineers Rarely Use Strassen's Algorithm for Fast Matrix Multiplications

 ⚙️Systems Programming  Content type: News  Content type: Blog

bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss

 🤖AI  Content type: Code
github.com··r/LocalLLaMA

Unpacking AI: The Hardware Behind AI

 🤖AI  Content type: News

WarpGuard: Protected-Site Control-Flow Integrity for CUDA SASS Binaries

 ⚙️Systems Programming  Content type: Academic
arxiv.org·

NVIDIA and LG Group Build an AI Factory to Advance Physical AI, Mobility and AI Infrastructure

 🤖AI  Content type: Blog

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

 🤖AI  Content type: Code
github.com··r/LocalLLaMA

A system programmer’s guide to LLM inference

 🤖AI  Content type: Blog

Gerrymandering the Warp: Non-Control-Data Attacks on CUDA Collective Decision

 ⚙️Systems Programming  Content type: Academic
arxiv.org·

Symbolica 2.0: programmable symbols, JIT evaluators, and type-erased callbacks in Rust

 💻Programming

NetX-lab/Frontier: Frontier: A Discrete-Event Simulator for Modern LLM Serving

 🤖AI  Content type: Code
github.com··Hacker News

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help