ML Inference

inference engine, model serving, inference optimization, runtime

Feeds to Scour
SubscribedAll
Scoured 151 posts in 7.2 ms

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

 🖥️Systems ML  Content type: Blog
medium.com
·

AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support

 🎮GPU Programming
phoronix.com··r/artificial

No Token Left Behind: Demystifying Token-in-Token-Out in Miles

 🧠Deep Learning  Content type: Blog
lmsys.org··Hacker News

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

 🔗Distributed Training  Content type: Blog
dnhkng.github.io·

magenta/magenta-realtime: Magenta RealTime 2: An Open-Weights Live Music Model

 🧠Deep Learning  Content type: Code
github.com·

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

 🧠Deep Learning

Vadzo Imaging Introduces HDR MIPI CSI-2 Embedded Cameras Recommended for Drone and UAV Applications

 🔄MLOps  Content type: News
einpresswire.com·

A system programmer’s guide to LLM inference

 🧠Deep Learning  Content type: Blog

DiffusionGemma: The Developer Guide- Google Developers Blog

 🎮GPU Programming  Content type: Blog

Build a Medical Report Analyzer on Dedicated Inference with Python

 🧠Deep Learning
digitalocean.com·

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

 🗜️Quantization  Content type: Academic
arxiv.org·

For Robotaxis, Safety Must Be Built In, Not Bolted On

 🎮GPU Programming  Content type: Blog
blogs.nvidia.com·

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

 🗜️Quantization  Content type: News  Content type: Blog
blog.google··Hacker News

China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)

 🗜️Quantization  Content type: News
decrypt.co··Hacker News

Google's new open model DiffusionGemma generates text from noise instead of word by word

 🧠Deep Learning
the-decoder.com
·

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

 🗜️Quantization

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

 🎮GPU Programming  Content type: Academic
arxiv.org·

OpenCV 5 Debuts with Improved ONNX Support and Native AI Upgrades

 🧠Deep Learning  Content type: News
hackster.io·

The Death of the Four Golden Signals: Designing Telemetry for Non-Deterministic Infrastructure

 🔄MLOps
devops.com·

Latest technical articles & videos.

 ⚙️Systems Programming
certdepot.net·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help