Prefill Decoding

Feeds to Scour
SubscribedAll
Scoured 178 posts in 7.7 ms

3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1

 💰Inference Cost  Content type: Blog
databricks.com·

LLM Observability: What To Instrument and How To Act on It

 🔭Observability  Content type: Blog
blog.n8n.io·

Apple rebuilt its on-device AI stack at WWDC 2026

 🔢GEMM Optimization  Content type: Blog
ziraph.com··Hacker News

Breaking the Ice: Analyzing Cold Start Latency in vLLM

 💾KV Cache  Content type: Academic
arxiv.org··Hacker News

heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.

 💾KV Cache  Content type: Code
github.com··r/LocalLLaMA

Token4Token — pay-per-token inference on Gnosis + Swarm

 🧠Inference Engineering

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

 🧠Inference Engineering  Content type: Blog
lucebox.com··Hacker News

"North Mini Code"; open weights, 30B param, Canadian coding model

 🎮GPU Computing  Content type: Blog
cohere.com··Hacker News

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops

 🧠Inference Engineering  Content type: Video
youtube.com·

Machinic Psychopharmacology: Do LLMs Self-Medicate?

 💾KV Cache
lesswrong.com··Hacker News

Youssof Altoukhi (@Youssofal_)

 🧠Inference Engineering
xcancel.com··r/LocalLLaMA

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

 🧠Inference Engineering  Content type: Academic
arxiv.org·

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

 💾KV Cache  Content type: Code
github.com··Hacker News

Architecting the Control Plane for Intelligence: System Design of an Enterprise AI Gateway

 ☁️Cloud Infrastructure  Content type: Blog
medium.com·

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

 🎮GPU Computing  Content type: Blog
dnhkng.github.io·

Build a local voice agent with Red Hat OpenShift AI

 🎮GPU Computing
developers.redhat.com·

The Memory Problem is Solved: How Google’s Memory Caching Makes RNNs Smart Again

 FlashAttention  Content type: Blog
medium.com·

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

 🧠Inference Engineering  Content type: Academic
arxiv.org·

bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss

 🧠Inference Engineering  Content type: Code
github.com··r/LocalLLaMA

Benchmarking dots.tts on Strix Halo

 🎮GPU Computing
sleepingrobots.com·
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help