🖥️ Inference Engineering - fungtion

Less-relevant results

Introducing Piper: A Programmable Distributed Training System

💰Compute Costs Academic Blog

syfi.cs.washington.edu··Hacker News

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

🗄️KV Cache News

newsletter.semianalysis.com

··Hacker News

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

🗄️KV Cache

local-llm.utop.workers.dev··Hacker News

Google’s DiffusionGemma is 4x faster than its other Gemma models

🗄️KV Cache

thenewstack.io·

Anatomy of a high-performance EP kernel

🗄️KV Cache Blog

fergusfinn.com··Hacker News

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

🗄️KV Cache

huggingface.co··r/LocalLLaMA

KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.

🗄️KV Cache Code

github.com··Hacker News

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

🗄️KV Cache Blog

dnhkng.github.io·

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

🗄️KV Cache News Blog

blog.google··Hacker News

WEKA software speeds long context AI inferencing on Oracle’s public cloud

🗄️KV Cache News

blocksandfiles.com·

Google open-sources speedy DiffusionGemma text diffusion model

🔍RAG

siliconangle.com·

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

💰Compute Costs Blog

lucebox.com··Hacker News

The Inference Alpha: Maximizing Frontier Models on AMD

🗄️KV Cache Blog

digitalocean.com·

DiffusionGemma is Google’s fastest AI yet, but it comes with a big trade-off

🗄️KV Cache

androidauthority.com·

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

🗄️KV Cache Academic

arxiv.org·

AMD Radeon RX 9070 GRE vs. Radeon RX 9070

💰Compute Costs

club386.com·

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

🎯Fine-tuning News Blog

kaitchup.substack.com··r/LocalLLaMA

Google's latest DiffusionGemma open AI model comes with a 4x speed boost

🗄️KV Cache News

arstechnica.com·

DiffusionGemma: The Developer Guide

Running LLM Inference on Kubernetes: What It Actually Takes

Introducing Piper: A Programmable Distributed Training System

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

Google’s DiffusionGemma is 4x faster than its other Gemma models

Anatomy of a high-performance EP kernel

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

WEKA software speeds long context AI inferencing on Oracle’s public cloud

Google open-sources speedy DiffusionGemma text diffusion model

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

The Inference Alpha: Maximizing Frontier Models on AMD

DiffusionGemma is Google’s fastest AI yet, but it comes with a big trade-off

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

AMD Radeon RX 9070 GRE vs. Radeon RX 9070

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

Google's latest DiffusionGemma open AI model comes with a 4x speed boost