⚡ KV Cache - linbolin1230 · Scour

GLM-5.2: Z.ai Ships 1M-Token Coding Model With Zero Benchmarks

💻Software Engineering Blog

wowhow.cloud··DEV·Covers: DEV Community

12B Gemma 4 QAT Deployment with NVIDIA L4, Cloud Run, MCP, and Antigravity CLI

🧠LLM Inference Blog

·

Mlx-optiq: per-layer mixed-precision LLM quantization for Apple Silicon

💬LLMs Video Discussion Tutorial

mlx-optiq.com··Hacker News·Cited by 2 articles

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

🔢Vector DBs Academic

Show HN: Quant Picker – which GGUF file fits your model and machine

vettedconsumer.com··Hacker News

Rebellions bets on memory-centric AI inference

🧠LLM Inference

jonpeddie.com·

zai-org/GLM-5.2 is here!

🧠LLM Inference 9

huggingface.co··Hacker News, Hacker News, r/LocalLLaMA·Cited by 9 articles·Covers 7 stories

Inference cost at scale with napkin math (13 minute read)

🧠LLM Inference Blog

injuly.in··Cited by 1 article·Covers: Fermi Problem

Native Inference Engine for macOS 14 or newer

🧠LLM Inference Code

github.com··Hacker News

Inside the LLM KV Cache: The Hidden System Behind Fast AI Inference

🧠LLM Inference Blog

fardinkai.medium.com·

I gave my gaming PC and phone the same local LLM tasks, and only one of them is still in my daily rotation

🧠LLM Inference

xda-developers.com·

vLLM Transformers Backend: Bridging Hugging Face Compatibility and High-Performance Inference

🧠LLM Inference Blog

odsc.medium.com·

SMEPilot: Characterizing and Optimizing LLM Inference with Scalable Matrix Extensions

🧠LLM Inference Academic

Running Local LLMs With Ollama For Private Development

🧠LLM Inference Tutorial

nazarboyko.com··DEV

Google OpenRL Tames AI Model Tuning, Kubernetes-Style

cloudnativenow.com··Covers: Best place for learning Kubernetes?, sgl-project/sglang +3 more

All sorts of famous Attention Layers

🧠LLM Inference Blog

harsh-ps-2003.bearblog.dev·

Lemonade SDK Adds Nvidia CUDA Support

🧠LLM Inference

i-programmer.info··Covers: Show HN: Lemonade: Run LLMs Locally with GPU and NPU Acceleration

Modular: Day Zero: MiniMax M3 Open Weights on Modular Cloud

🔧MLOps Blog

modular.com··Covers: MiniMax M3: Frontier Coding, 1M Context, Native Multimodality — All in One Model, Coding & Agentic Frontier, 1M Context, Multimodal +1 more

Parallelize speculative decoding with P-EAGLE on Amazon SageMaker AI

🧠LLM Inference Blog

aws.amazon.com·

Please Use My Free Software

🗄️Storage Engines Blog

artlu.bearblog.dev·

Sign up or log in to see more results

Log in to enable infinite scrolling