🧠 LLM Inference - linbolin1230 · Scour

All sorts of famous Attention Layers

💬LLMs Blog

harsh-ps-2003.bearblog.dev·

Less-relevant results

Deploying NVIDIA Nemotron-3 Ultra 550B, with B200 GPUs, vLLM on Google Kubernetes Engine — Football…

⚡KV Cache Blog

ammettw.medium.com·

Google's DiffusionGemma generates 256 tokens in parallel and self-corrects as it goes

venturebeat.com··Covers: DiffusionGemma: 4x Faster Text Generation

Is anyone else not finding the Web UI on latest (b9680) of llama.cpp?

💬LLMs Discussion Code

github.com··r/LocalLLaMA

How Public AI delivers sovereign LLM inference on AWS and Intel

⚡KV Cache Blog

aws.amazon.com··Covers: Hugging Face – Fun chat with your own Artificial Intelligence, vLLM +1 more

How to Setup a Local Coding Agent on macOS

🔧MLOps Blog 3

ikyle.me··Hacker News·Cited by 3 articles·Covers 6 stories

DiffusionGemma: Discrete diffusion in a large language model

idlemachines.co.uk··Hacker News

zai-org/GLM-5.2 is here!

huggingface.co··Hacker News, Hacker News, r/LocalLLaMA·Cited by 9 articles·Covers 7 stories

Friday Five — June 12, 2026

[AINews] Satya on Loopcraft: Building Frontier Ecosystems

💬LLMs News

·

New comment by Greenpants in "Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?"

💬LLMs Discussion

news.ycombinator.com··Hacker News·Cited by 1 article·Covers: I Improved 15 LLMs at Coding in One Afternoon. Only the Harness Changed.

SwiftCache: Efficient LLM Serving for Multi-turn Conversations with Heterogeneous KV Cache Sharing

⚡KV Cache Academic

Speculative Decoding: How to Get Free Tokens

💬LLMs Blog

·

Rust port of transformers (1M lines of code)

💬LLMs Code

github.com··Hacker News

Built Uber aggregator that tracks top AI researchers and leaders

brightray.ai··Hacker News

12B Gemma 4 QAT Deployment with NVIDIA L4, Cloud Run, MCP, and Antigravity CLI

⚡KV Cache Blog

·

Running local models is good now

🤖AI Agents 8

vickiboykis.com··Lobsters, Hacker News, Hacker News·Cited by 8 articles·Covers 9 stories

How to fit Qwen 3.6 35B A3B into 16GB of VRAM, & run it with Llama.cpp on an RTX 3080

🗄️Storage Engines

autodidacts.io··Covers: Can your machine run AI models?

Coordinated Scheduling for MoE LLM Serving

⚡KV Cache Academic

I restarted a 10 year old Xeon 174 times to delete twelve flags and gain four tokens a second

🗄️Storage Engines Blog

point.free··Hacker News

Log in to enable infinite scrolling