Systems-level optimizations for LLM serving

Feeds to Scour
SubscribedAll
Scoured 139 posts in 7.0 ms

High-end Hitachi Vantara arrays and Nvidia AI support

 🤖Agents using LLMs  Content type: News
blocksandfiles.com·

Qwen 3.6 27B AutoRound GGUF, need your feedback

 ✨Model optimizations in LLMs
huggingface.co··r/LocalLLaMA

High Bandwidth Flash | A New Memory for AI Data Centers and Edge Computing | Sandisk

 📊AI Performance Profiling
ncnonline.net·

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

 🧠Large Language Models (LLMs)
smolhub.com··r/LocalLLaMA

How to Measure Time To First Token (TTFT) in AI Systems

 💬Prompt optimizations for LLM serving
qainsights.com··Hacker News

VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

 💬Prompt optimizations for LLM serving  Content type: Academic
arxiv.org·

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

 ✨Model optimizations in LLMs  Content type: News  Content type: Blog
blog.google··Hacker News

Machinic Psychopharmacology: Do LLMs Self-Medicate?

 🚀LLM serving frameworks
lesswrong.com··Hacker News

China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)

 🧠Large Language Models (LLMs)  Content type: News
decrypt.co··Hacker News

Making Local LLM Fast

 🧠Large Language Models (LLMs)
bogdan.nimblex.net··Hacker News

libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.

 📊AI Performance Profiling  Content type: Code
github.com·

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

 ✨Model optimizations in LLMs  Content type: Blog

Claude Fable 5 🚀, Gemini 3.5 Live Translate 📱, scaling test time compute 📈

 🤖Agents using LLMs
tldr.tech·

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

 🧠Large Language Models (LLMs)  Content type: Blog
dnhkng.github.io·

Youssof Altoukhi (@Youssofal_)

 🧠Large Language Models (LLMs)
xcancel.com··r/LocalLLaMA

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

 🚀LLM serving frameworks  Content type: Academic
arxiv.org·

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

 🚀LLM serving frameworks  Content type: Code
github.com··Hacker News

LLM Research Papers: The 2026 List (January to May)

 🧠Large Language Models (LLMs)  Content type: News

Rebellions Bets on Memory-Centric Architecture as it Weighs IPO Options

 ⚡Real-time AI Systems  Content type: News
eetimes.com·

iOS Security SDKs & Audits for Production Teams

 ✨Model optimizations in LLMs  Content type: Discussion
sentinelden.com··Hacker News
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help