🔧 Model Serving - meghanipankaj5scour · Scour

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

🖨️3D Printing Blog

dnhkng.github.io·

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

💾CPU Architecture Code

github.com··Hacker News

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

🧠LLM Internals Academic

Show HN: Taliesin – bit-exact KV-cache restore, 21x faster, cross-GPU verified

💾CPU Architecture Blog

··Hacker News

OpenEnv is now owned by HF, Torch, Prime Intellect, Unsloth, Modal, Mercor, and more! Use it for training agents.

🧠LLM Internals Blog

huggingface.co··Hacker News, r/LocalLLaMA

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

🧠LLM Internals Academic

Build a local voice agent with Red Hat OpenShift AI

🧠LLM Internals

developers.redhat.com·

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

🖥️Systems Programming Code

github.com··Hacker News

The economics of speculative decoding

🧠LLM Internals Blog

fergusfinn.com··Hacker News

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

🧠LLM Internals News Blog

blog.google··Hacker News

Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good

💾CPU Architecture Blog

towardsai.net·

heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.

🖥️Systems Programming Code

github.com··r/LocalLLaMA

End-to-End Context Compression at Scale

🧠LLM Internals Academic

Report: GKE Inference Gateway delivers up to 92% faster AI responses

🏗️System Design Blog

cloud.google.com··Hacker News

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

🧠LLM Internals

vettedconsumer.com··Hacker News

zhongkaifu/TensorSharp: A C# inference engine for running large language models (LLMs) locally using GGUF model files. TensorSharp provides a console application, a web-based chatbot interface, and Ollama/OpenAI-compatible HTTP APIs for programmatic access. It supports Windows/MacOS/Linux with full GPU capability

🖥️Systems Programming Code

github.com··Hacker News

Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

🧠LLM Internals Academic

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

🖥️Systems Programming Blog

tilert.ai··Hacker News

RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

🧠LLM Internals Academic

fix(gateway): fail closed for unknown model auth · openclaw/openclaw@85343ea

🦀Rust Code

Sign up or log in to see more results

Log in to enable infinite scrolling