Model optimizations in LLMs

Feeds to Scour
SubscribedAll
Scoured 187 posts in 7.7 ms

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

 📊AI Performance Profiling

Apple WWDC On-Device AI Deep Dive - Google Docs

 🧠Large Language Models (LLMs)
gist.is··Hacker News

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

 🔧Systems-level optimizations for LLM serving  Content type: Blog
dnhkng.github.io·

Two old GPUs I salvaged are doing more AI work than a brand new $2000 card, and I won't be upgrading anytime soon

 🧠Large Language Models (LLMs)
xda-developers.com·

Create Your Own Programming Language with Rust

 🧠Large Language Models (LLMs)
createlang.rs··Hacker News

NetX-lab/Frontier: Frontier: A Discrete-Event Simulator for Modern LLM Serving

 🔧Systems-level optimizations for LLM serving  Content type: Code
github.com··Hacker News

HNSW vs LSH: How Elasticsearch hits 0.99 recall@10 at 15,000 QPS — and what it costs

 🔍Retrieval-augmented generation  Content type: Blog
elastic.co·

SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving

 💬Prompt optimizations for LLM serving  Content type: Academic
arxiv.org·

Alduin 4B, an uncensored Vision LLm just released.

 🚀LLM serving frameworks

TurboQuant in PostgreSQL

 🔍Retrieval-augmented generation  Content type: Blog
blog.mayflower.de·

Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss

 🚀LLM serving frameworks  Content type: News
digg.com·

[AINews] FrontierCode: Benchmarking for Code Quality over Slop

 🧠Large Language Models (LLMs)  Content type: News
latent.space
·

What's in the Box? A Field Guide to AI Models

 🧠Large Language Models (LLMs)  Content type: Blog
iankduncan.com·

Google’s DiffusionGemma is 4x faster than its other Gemma models

 🧠Large Language Models (LLMs)
thenewstack.io·

A system programmer’s guide to LLM inference

 🔧Systems-level optimizations for LLM serving  Content type: Blog

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

 🔧Systems-level optimizations for LLM serving
gizchina.com·

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

 🚀LLM serving frameworks  Content type: Blog
ziraph.com··Hacker News

Complexifying the Complex

 🤖Agents using LLMs  Content type: Academic
math.columbia.edu·

How One MSAI Student Built an AI Tool to Predict Supply Chain Disruptions

 🔢Quantization of LLMs  Content type: Academic
cs.utexas.edu·

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

 🧠Large Language Models (LLMs)  Content type: News  Content type: Blog
developer.nvidia.com·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help