Model Serving

Feeds to Scour
SubscribedAll
Scoured 197 posts in 6.4 ms

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

 🖨️3D Printing  Content type: Blog
dnhkng.github.io·

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

 💾CPU Architecture  Content type: Code
github.com··Hacker News

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

 🧠LLM Internals  Content type: Academic
arxiv.org·

Show HN: Taliesin – bit-exact KV-cache restore, 21x faster, cross-GPU verified

 💾CPU Architecture  Content type: Blog
medium.com
··Hacker News

OpenEnv is now owned by HF, Torch, Prime Intellect, Unsloth, Modal, Mercor, and more! Use it for training agents.

 🧠LLM Internals  Content type: Blog

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

 🧠LLM Internals  Content type: Academic
arxiv.org·

Build a local voice agent with Red Hat OpenShift AI

 🧠LLM Internals
developers.redhat.com·

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

 🖥️Systems Programming  Content type: Code
github.com··Hacker News

The economics of speculative decoding

 🧠LLM Internals  Content type: Blog

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

 🧠LLM Internals  Content type: News  Content type: Blog
blog.google··Hacker News

Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good

 💾CPU Architecture  Content type: Blog
towardsai.net·

heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.

 🖥️Systems Programming  Content type: Code
github.com··r/LocalLLaMA

End-to-End Context Compression at Scale

 🧠LLM Internals  Content type: Academic
arxiv.org·

Report: GKE Inference Gateway delivers up to 92% faster AI responses

 🏗️System Design  Content type: Blog

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

 🧠LLM Internals

zhongkaifu/TensorSharp: A C# inference engine for running large language models (LLMs) locally using GGUF model files. TensorSharp provides a console application, a web-based chatbot interface, and Ollama/OpenAI-compatible HTTP APIs for programmatic access. It supports Windows/MacOS/Linux with full GPU capability

 🖥️Systems Programming  Content type: Code
github.com··Hacker News

Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

 🧠LLM Internals  Content type: Academic
arxiv.org·

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

 🖥️Systems Programming  Content type: Blog
tilert.ai··Hacker News

RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

 🧠LLM Internals  Content type: Academic
arxiv.org·

fix(gateway): fail closed for unknown model auth · openclaw/openclaw@85343ea

 🦀Rust  Content type: Code
github.com·
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help