🎮 GPU Programming - CWhiting · Scour

Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems 🏗️LLM Infrastructure

arxiv.org·6d·Hacker News

Atlas: An LLM inference engine written from scratch in Rust and CUDA 🏗️LLM Infrastructure

atlasinference.io·1d·Hacker News

CUDA Proves Nvidia Is a Software Company 🟩Nvidia

hardware.slashdot.org·2d

The cuda-oxide Book 🎮SIMT Execution

nvlabs.github.io·6d·Lobsters, Hacker News, Hacker News, r/rust

Show HN: I built a small repertoir of different computing systems 🖥️Hardware Architecture

computers.tugdual.fr·1d·Hacker News

Distributed Training in MLOps Break GPU Vendor Lock-In: Distributed MLOps across mixed AMD and NVIDIA Clusters 🟩Nvidia

mlops.community·1d

NVIDIA Releases CUDA-Oxide 0.1 For Experimental Rust-To-CUDA Compiler 🟩Nvidia

Efficient and Portable Support for Overdecomposition on Distributed Memory GPGPU Platforms 🎮SIMT Execution

CUDA Proves Nvidia Is a Software Company 🟩Nvidia

wired.com·3d·Hacker News, r/programming

CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs 🏗️LLM Infrastructure

TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments 🎮GPU Microarchitecture

arxiv.org·2d·Hacker News

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging 🏗️LLM Infrastructure

Stencil Computations on Cerebras Wafer-Scale Engine ⚡WebGPU Compute

ShardTensor: Domain Parallelism for Scientific Machine Learning 🏗️LLM Infrastructure

Data Path Fusion in GPU for Analytical Query Processing 💎Materialized Views

EULER-ADAS: Energy-Efficient & SIMD-Unified Logarithmic-Posit Engine for Precision-Reconfigurable Approximate ADAS Acceleration 🖥️Hardware Architecture

CCD-Level and Load-Aware Thread Orchestration for In-Memory Vector ANNS on Multi-Core CPUs ⚡WebGPU Compute

DICE: Enabling Efficient General-Purpose SIMT Execution with Statically Scheduled Coarse-Grained Reconfigurable Arrays 🎮SIMT Execution

Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP 🏗️LLM Infrastructure

TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines 🎯Emulation Accuracy

Log in to enable infinite scrolling