Inference

LLM inference, model serving, vLLM, TensorRT, latency

Feeds to Scour
SubscribedAll
Scoured 316 posts in 7.3 ms

TFLite Edge Model Quantizer Snippet

 🤖AI

Making LLMs faster and more efficient across multiple languages

 🧠LLMs
techxplore.com·

Machinic Psychopharmacology: Do LLMs Self-Medicate?

 🧠LLMs
lesswrong.com··Hacker News

Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss

 🧠LLMs  Content type: News
digg.com·

Re-quantizing a local LLM 14x faster by skipping the tensors that didn't change

 🧠LLMs  Content type: News  Content type: Blog

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

 🤖AI Agents  Content type: Blog
tilert.ai··Hacker News

Anatomy of a high-performance EP kernel

 🧠LLMs  Content type: Blog
fergusfinn.com··Hacker News

Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good

 🧠LLMs  Content type: Blog
towardsai.net·

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

 🧠LLMs  Content type: Blog
ziraph.com··Hacker News

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

 🧠LLMs

Token4Token — pay-per-token inference on Gnosis + Swarm

 🧠LLMs

Optimal Post-Training Quantization Scales and Where to Find Them

 🧠LLMs  Content type: Academic
arxiv.org·

Massive AI Storage Demand Creates a New Memory Wall

 🧠LLMs  Content type: News
eetimes.com·

Making Local LLM Go Brrr

 🧠LLMs

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

 🧠LLMs  Content type: Blog
dnhkng.github.io·

On-device AI is a margin decision

 🧠LLMs  Content type: Blog
ziraph.com··Hacker News

Where to Host Your Open-Source Model (Under 10B Parameters)

 📊AI Models
digitalocean.com·

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

 ☸️K8S  Content type: Blog
jimmysong.io·

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

 🧠LLMs
sleepingrobots.com·

Google’s DiffusionGemma is 4x faster than its other Gemma models

 🧠LLMs
thenewstack.io·
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help