⚡ LLM Inference - mgjain

🧠KV Cache Academic

arxiv.org·

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

📦Parquet Blog

ziraph.com··Hacker News

Less-relevant results

🇳🇱 Go/Golang job: Senior Backend Engineer (Go) | Studio AI at Creative Fabrica (Amsterdam, Netherlands)

🕸️Distributed Systems

golangprojects.com·

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

🧠KV Cache

smolhub.com··r/LocalLLaMA

NVIDIA releases Nemotron 3 Ultra, claiming five times the speed and 30 percent lower costs than prior modelsThe model delivers 300 tokens per second on benchmar...

⚡vLLM

digg.com·

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

⚡vLLM

deemwar-products.github.io··Hacker News

Vadzo Imaging Introduces HDR MIPI CSI-2 Embedded Cameras Recommended for Drone and UAV Applications

🌊Stream Processing News

einpresswire.com·

Nemotron 3 Ultra now available on AI Gateway

⚡vLLM

vercel.com·

Google open-sources speedy DiffusionGemma text diffusion model

⚡vLLM

siliconangle.com·

Mobile AI Compute Engine (MACE) inference framework — Vision SDK

🧠KV Cache Blog

mapbox.com·

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

🧠KV Cache

sleepingrobots.com·

No Token Left Behind: Demystifying Token-in-Token-Out in Miles

🌊Stream Processing Blog

lmsys.org··Hacker News

Google’s DiffusionGemma is 4x faster than its other Gemma models

🌲LSM Trees

thenewstack.io·

Making LLMs faster and more efficient across multiple languages

⚡vLLM

techxplore.com·

Which is faster: Gemini 3.5 Flash or Kimi K2.6 on Cerebras

🌊Stream Processing Blog

cerebras.ai·

146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb

🌊Stream Processing Blog

adambien.blog·

Why I care so much about energy per token

🧠KV Cache Blog

ziraph.com··Hacker News

libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.

🧠KV Cache Code

github.com·

3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1

🧠KV Cache Blog

databricks.com·

MLPerf and the rise of latency-aware LLM benchmarking

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

🇳🇱 Go/Golang job: Senior Backend Engineer (Go) | Studio AI at Creative Fabrica (Amsterdam, Netherlands)

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

NVIDIA releases Nemotron 3 Ultra, claiming five times the speed and 30 percent lower costs than prior modelsThe model delivers 300 tokens per second on benchmar...

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

Vadzo Imaging Introduces HDR MIPI CSI-2 Embedded Cameras Recommended for Drone and UAV Applications

Nemotron 3 Ultra now available on AI Gateway

Google open-sources speedy DiffusionGemma text diffusion model

Mobile AI Compute Engine (MACE) inference framework — Vision SDK

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

No Token Left Behind: Demystifying Token-in-Token-Out in Miles

Google’s DiffusionGemma is 4x faster than its other Gemma models

Making LLMs faster and more efficient across multiple languages

Which is faster: Gemini 3.5 Flash or Kimi K2.6 on Cerebras

146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb

Why I care so much about energy per token

libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.

3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1