🤖 LLM Inference - unclamproot · Scour

OpenCV 5 release - New DNN engine with enhanced ONNX and LLM/VLM support, Intel, Arm, and RISC-V hardware optimizations - CNX Software

🤖LLM News

cnx-software.com·

Running LLM Inference on Kubernetes: What It Actually Takes

🤖LLM Blog

fairwinds.com·

China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)

🤖LLM News

NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

🤖LLM Blog

blogs.nvidia.com·

Making LLMs faster and more efficient across multiple languages

techxplore.com·

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

deemwar-products.github.io··Hacker News

Token4Token — pay-per-token inference on Gnosis + Swarm

t4t.eth.link··Hacker News

libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.

⚡Vllm Code

Youssof Altoukhi (@Youssofal_)

xcancel.com··r/LocalLLaMA

AI Serving Platform That Adapts to Your Model

🤖LLM Blog

databricks.com·

Unsloth Gemma 4 QAT

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

⚡Vllm Academic

WEKA software speeds long context AI inferencing on Oracle’s public cloud

🤖Agents News

blocksandfiles.com·

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

sleepingrobots.com·

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

🤖LLM Blog

ziraph.com··Hacker News

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

Machinic Psychopharmacology: Do LLMs Self-Medicate?

lesswrong.com··Hacker News

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

🤖LLM Blog

Latest technical articles & videos.

certdepot.net·

Ollama 0.30 delivers faster NVIDIA GPU performance and wider hardware support

alternativeto.net·

Log in to enable infinite scrolling