👁️ Attention Optimization - miterion · Scour

Benchmarking llama.cpp's brand-new MTP support on Strix Halo 🔧PTX

calebcoffie.com·2d·Hacker News

Luce DFlash + PFlash on 7900XTX: Qwen3.6-27B at 2.24x decode and 3.05x prefill vs llama.cpp HIP ⏱️Benchmarking

lucebox.com·3d·r/LocalLLaMA

Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention 📊Gradient Accumulation

Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism ⏱️CUDA Events

mlsys.wuklab.io·2d·Hacker News

froggeric/Qwen3.6-27B-MTP-GGUF 📊Profiling Tools

huggingface.co·3d·DEV

Starchild-1: The First Real-Time Multimodal World Model 🏎️TensorRT

odyssey.ml·2d·Hacker News

DeepSeek V4 Flash: Bringing Frontier AI to the Home 🔍Nsight

blog.jonathanpage.com·2d·Hacker News

https://www.together.ai/blog/coding-agent-benchmarks ⚡Flash Attention

together.ai·6d

If you have the budget, this £2,649 Cyrus 40 ST music streamer is a must-buy ⚡Flash Attention

·1d

KV Cache Is Becoming the Memory Hierarchy of Inference 🧠CPU Architecture

touchdown-labs.com·3d

NVlabs/LongLive: Infra for Long Video Generation 🏎️TensorRT

HF downloader utility tampermonkey 🏎️TensorRT

greasyfork.org·3d·r/LocalLLaMA

QClaw: A Fully Local Agentic Assistant on the Arduino Uno Q 📜TorchScript

hackster.io·1d

Storage for the AI Factory Era A Discussion ⚙️Systems Programming

servethehome.com·6d

How I Shipped an Autonomous Agentic System on a 2026 Serverless-GPU Stack 🔧PTX

·2d

The Inference Bottleneck: Architecting Kubernetes Autoscaling for Production LLMs 🚀MLOps

cloudnativenow.com·6d

The Developer’s Guide to OpenCode on Google Cloud 🤖AI Coding Tools

·2d

How Do I Run AI Workloads on Kubernetes Without Wasting GPUs? 🚀MLOps

fairwinds.com·22h

Runtime-Certified Bounded-Error Quantized Attention 🧩Attention Kernels

奥赛金牌打包成两步配方 📜TorchScript

ai-brief.liziran.com·4d

Sign up or log in to see more results

Log in to enable infinite scrolling