⚡ FlashAttention - nayyara.airlangga · Scour

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

💰Inference Cost News Blog

blog.google··Hacker News

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

🧠Inference Engineering Academic

google/gemma-4-12B-it-qat-q4_0-gguf

🧠Inference Engineering

huggingface.co·

harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.

💾KV Cache Code

github.com··Hacker News

OpenCV 5.0 Computer Vision Library Released with Rewritten DNN Engine

🧠Inference Engineering

Efficient and Training-Free Single-Image Diffusion Models

⚗️Kernel Fusion

haojunqiu.github.io··Hacker News

How the UK Is Turning Sovereign AI Ambition Into Action With NVIDIA Technologies

🧠Inference Engineering Blog

blogs.nvidia.com·

Express Language Modeling

🧠Inference Engineering Academic

Issue #390 - The ML Engineer 🤖

💰Inference Cost News Blog

machinelearning.substack.com··Substack

OpenCV 5 release - New DNN engine with enhanced ONNX and LLM/VLM support, Intel, Arm, and RISC-V hardware optimizations - CNX Software

🧠Inference Engineering News

cnx-software.com·

princezuda/-RequiemGPT-: Fully open source and open weights built and trained by fable five with one prompt. An experience in how AI actually works

🧠Inference Engineering Code

github.com··Hacker News

End-to-End Context Compression at Scale

🧠Inference Engineering Academic

See, Act, Correct: three levers for working with a code agent

🧠Inference Engineering Blog

blog.owulveryck.info··Hacker News, Hacker News

Where to Host Your Open-Source Model (Under 10B Parameters)

🧠Inference Engineering

digitalocean.com·

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

🧠Inference Engineering Academic

arxiv.org··Hacker News

DeepSeek V4, LeCun's Bet Against LLMs, and Lovable's Self-Improving Agent - The Tokenizer Edition #30

🔢FP8 Training

newsletter.artofsaience.com·

bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss

🧠Inference Engineering Code

github.com··r/LocalLLaMA

Full Context on a Vulkan-Only Strix Halo: The Decode-Drop Reproduces, but the Sweet Spot Moves

⏱️Prefill Decoding

thefrontierlab.ai··Hacker News

Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels

⚗️Kernel Fusion Academic

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

💾KV Cache Code

github.com··Hacker News

Sign up or log in to see more results

Log in to enable infinite scrolling