⚡ Hardware Acceleration - jhcha.oyo · Scour

Recent LLVM hash table improvements

🏗️LLVM Blog

maskray.me··Hacker News, r/cpp

🥇Top AI Papers of the Week

🎮Reinforcement Learning News

nlp.elvissaravia.com·

Symbolica 2.0: programmable symbols, JIT evaluators, and type-erased callbacks in Rust

🖥️GPU Programming

symbolica.io··Lobsters, Hacker News, r/rust

From Human Guidance to Autonomy: Agent Skill System for End-to-End LLM Deployment on Spatial NPUs

🐧Open Source Academic

Less-relevant results

Capabilities using Plain Traits

🎯AI Agents Blog

nadrieril.github.io·

maziyarpanahi/openmed: open-source healthcare ai

🤖AI Code

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

🖥️GPU Programming Academic

AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

🤖AI Academic

arxiv.org··Hacker News

Advanced Vector Extensions 512 Acceleration of LSH and LEA-GCM

🔐Cryptography

eprint.iacr.org·

CFRNet: Cycle-Consistent Fixed-Point Training for Real-Time Blind Face Restoration on Consumer Embedded NPUs

👁️Computer Vision Academic

Rayforce

⚙️Algorithms Code

github.com··Lobsters, Hacker News

Density Field State Space Models: 1-Bit Distillation, Efficient Inference, and Knowledge Organization in Mamba-2

🖥️GPU Programming Academic

Modeling, Optimizing and Exploring Multi-Die FPGA Routing Architectures

💾Computer Architecture Academic

Communication Strategy Selection for Multi-GPU 3D FDTD with Convolutional Perfectly Matched Boundary Layers

🖥️GPU Programming Academic

MailoHLS: Multi-Adapter Structure-Aware Learning for Pareto-Driven HLS Pragma Optimization

🎯Fine-Tuning Academic

Does anyone know what PCIe mode was used for these benchmarks?

💬LLMs Code

github.com··r/LocalLLaMA

Coset Ensemble Decoder for Quantum Error Correction with Algorithm-Hardware Co-Design

⚛️Quantum Computing Academic

GoodQ02/goodq4all: Local-first multimodal epistemic memory for scene-level video, audio, and text intelligence.

🔍Information Retrieval Code

github.com··Hacker News

LLM-Based Porting of Optimized C++ to CUDA Through Deoptimization and Reoptimization

🖥️GPU Programming Academic

bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss

💬LLMs Code

github.com··r/LocalLLaMA

No more posts from jhcha.oyo's subscribed feeds.

Scour all 25257 feeds Learn more about Feeds

Sign up or log in to see more results

Log in to enable infinite scrolling