LLM Inference

Feeds to Scour
SubscribedAll
Scoured 85 posts in 6.7 ms

On-device AI is a margin decision

 🧠Local llm  Content type: Blog
ziraph.com··Hacker News

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

 🤖Machine Learning  Content type: Code
github.com··Hacker News

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

 🤖Qwen
smolhub.com··r/LocalLLaMA

Google’s DiffusionGemma is 4x faster than its other Gemma models

 LLM Quantization
thenewstack.io·

LLM Research Papers: The 2026 List (January to May)

 🧠Local llm  Content type: News

defai-digital/ax-engine: Apple Silicon LLM runtime supporting Gemma 4 and Qwen 3.6 MTP modes

 🤖Qwen  Content type: Code
github.com··Hacker News

DiffusionGemma: 4x Faster Text Generation

 LLM Quantization  Content type: News  Content type: Blog

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

 🦀Rust

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

 🧠Local llm

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

 LLM Quantization  Content type: Academic
arxiv.org··Hacker News

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

 LLM Quantization  Content type: Code
github.com··Hacker News

Magenta RealTime 2: Open and Local Live Music Models

 LLM Quantization

Here's a llama.cpp CLI Command builder.

 🧠Local llm

heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.

 🤖Qwen  Content type: Code
github.com··r/LocalLLaMA

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

 LLM Quantization  Content type: Blog

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

 🧠Local llm  Content type: Blog
ziraph.com··Hacker News

Anatomy of a high-performance EP kernel

 👁️Observability  Content type: Blog

Introducing Granite Libraries and Project Granite Switch

 🔌Model Context Protocol  Content type: Blog

bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss

 🧠Local llm  Content type: Code
github.com··r/LocalLLaMA

How to Measure Time To First Token (TTFT) in AI Systems

 🧠Local llm

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help