🤖 Inference - kelvinyu1117

🤖AI Blog Discussion

tildalice.io·

Renew

🔀Concurrency

flathub.org·

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

🧠LLMs Academic

arxiv.org·

China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)

☁️Cloud News

decrypt.co·

Gemma 4 12B: A unified, encoder-free multimodal model

🧠LLMs Discussion

news.ycombinator.com··Hacker News

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

🎮GPUs Blog

dnhkng.github.io·

Where to Host Your Open-Source Model (Under 10B Parameters)

☁️Cloud

digitalocean.com·

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

🎮GPUs

sleepingrobots.com·

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

🔧Hardware

gizchina.com·

Youssof Altoukhi (@Youssofal_)

🧠LLMs

xcancel.com··r/LocalLLaMA

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

🧠LLMs Code

github.com··Hacker News

Optimal Post-Training Quantization Scales and Where to Find Them

🧠LLMs Academic

arxiv.org·

Ideogram4 GGUF is out!

🔤PLT

huggingface.co··r/StableDiffusion

TGI(SG)F.

🧠LLMs News

theverge.com

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops

🧠LLMs Video

youtube.com·

Google releases Gemma 4 QAT models for local AI on enterprise laptops

🏗️MLSys

4sysops.com·

Latest technical articles & videos.

🧠LLMs

certdepot.net·

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

🧠LLMs Academic

arxiv.org·

Token4Token — pay-per-token inference on Gnosis + Swarm

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work

Renew

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)

Gemma 4 12B: A unified, encoder-free multimodal model

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

Where to Host Your Open-Source Model (Under 10B Parameters)

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

Youssof Altoukhi (@Youssofal_)

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

Optimal Post-Training Quantization Scales and Where to Find Them

Ideogram4 GGUF is out!

TGI(SG)F.

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops

Google releases Gemma 4 QAT models for local AI on enterprise laptops

Latest technical articles & videos.

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization