🧠 LLM Inference - emschwartz · Scour

🏗️LLM Infrastructure arxiv.org·

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

🤖AI GitHub·

fix(ollama): support GLM-5.2 cloud discovery

🏗️LLM Infrastructure medium.com

·

vLLM, Function Calling, and World Models explained

📋MCP medium.com

·

Debugging Deployments with Gemma 12B, TPU v6e-1, MCP, and Antigravity CLI

🔓Open Source AI fitservers.com·

The Complete Guide to Deploying DeepSeek R1 on a Dedicated Server

🏗️LLM Infrastructure networkworld.com·

Tether is shipping TurboQuant KV-cache quantization with Vulkan support into its QVAC SDK

🤖AI GitHub·

fix(ollama): honor memory embedding output dimensionality (#94811)

⚡Fast AI Inference youtube.comVideo·

Token Injection: Crashing LLM Inference With Special Tokens

🔓Open Source AI Anyscale blog posts·

High Performance Distributed Inference with Ray Serve LLM

Covered by Google Cloud Blog

Discussed on Hacker News

🤖AI GitHub·

fix(ollama): skip auto-discovery for remote/cloud base URLs (#93956)

🤖AI pypi.org·

Show HN: Subagent-fleet – AI coding subagents across local Ollama machines

Discussed on Hacker News

📱Edge AI Optimization medium.com

·

How I Shrunk a Plant Disease Classifier from 16MB to 5MB with Less Than 1% Accuracy Loss

Covered by habr.com

🤖AI medium.com

·

Don’t Use Ollama for Local LLMs

⚡Fast AI Inference thecybersidekick.beehiiv.com·

AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm

Discussed on DEV

🤖AI GitHub·

Show HN: Loqi, a "local-first" translation tool using Ollama/llama.cpp

Covers Ollama

Discussed on Hacker News

🆕New AI huggingface.co·

225B-A23B

Covered by news.smol.ai

Discussed on r/LocalLLaMA

🤖AI devashish.me·

Two Qwen3 models on one DGX Spark: the residency math

Discussed on Hacker News

🤖AI GitHub·

I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers!

Covers 2 stories including Language models are few-shot learners (2020)

Discussed on r/LocalLLaMA

📱Edge AI Optimization arxiv.org·

From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads

🏗️LLM Infrastructure Google Cloud Blog·

Scaling Ray Serve LLM on GKE: Performance without losing the developer experience

Log in to enable infinite scrolling