Inference

LLM inference, vLLM, TensorRT, model serving, inference optimization

Feeds to Scour
SubscribedAll
Scoured 95 posts in 5.5 ms

harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.

馃LLMsContent type: Code
github.comHacker News

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

馃LLMsContent type: News

Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%

馃LLMs
zozo123.github.ioHacker News

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

馃AI EngineeringContent type: Academic
arxiv.org

CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?

馃AI Engineering
uccl-project.github.ioHacker News

A system programmer鈥檚 guide to LLM inference

馃AI EngineeringContent type: Blog

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

馃Memory ManagementContent type: Blog
medium.com

NVIDIA Accelerates Google DeepMind鈥檚 DiffusionGemma for Local AI

馃LLMsContent type: Blog
blogs.nvidia.com

How we fight GPU scarcity without compromise

馃LLMsContent type: Blog
equixly.comHacker News

Qwen 3.6 27B AutoRound GGUF, need your feedback

馃LLMs

AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support

馃LLMs
phoronix.comr/artificial

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

馃AI EngineeringContent type: NewsContent type: Blog
blog.googleHacker News

DiffusionGemma: The Developer Guide- Google Developers Blog

馃LLMsContent type: Blog

China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)

馃LLMsContent type: News
decrypt.coHacker News

Making LLMs faster and more efficient across multiple languages

馃AI Engineering
techxplore.com

Machinic Psychopharmacology: Do LLMs Self-Medicate?

馃AI Engineering
lesswrong.comHacker News

AI Serving Platform That Adapts to Your Model

馃敡MLOpsContent type: Blog
databricks.com

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

馃AI Engineering

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

馃AI Engineering
gizchina.com

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

馃AI EngineeringContent type: NewsContent type: Blog

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help