💬 Prompt optimizations for LLM serving

SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

🧠Large Language Models (LLMs) Academic

arxiv.org·

aussiealex/agentmeter: Know what your agents cost. Cost intelligence for AI coding agents.

🤖Agents using LLMs Code

github.com··Hacker News

Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving

🔧Systems-level optimizations for LLM serving Academic

arxiv.org··Hacker News

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

🔧Systems-level optimizations for LLM serving Code

github.com··Hacker News

OpenPCC: Open and Confidential LLM Serving on Commodity TEEs

🤖Agents using LLMs Academic

arxiv.org·

TjWheeler/deep-memory: A GraphRAG implementation with a Vocabulary system to optimise AI integration

🤖Agents using LLMs Code

github.com··Hacker News

Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production

🤖Agents using LLMs Academic

arxiv.org·

Week 1 of building Quantamind: Ditching Electron for Rust & Tauri 🦀

🚀LLM serving frameworks Code

github.com··DEV

How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

🧠Large Language Models (LLMs) Academic

arxiv.org·

hansstam86/wibeos

🚀LLM serving frameworks Code

github.com··Hacker News

Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

🤖Agents using LLMs Academic

arxiv.org·

kenn-io/agentsview: Local-first session intelligence and analytics for coding agents, supporting Claude Code, Codex, and more than 20 other agents. Also: 100x faster replacement for ccusage!

🤖Agents using LLMs Code

github.com·

AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

🔧Systems-level optimizations for LLM serving Academic

arxiv.org·

tigerless-labs/cost-xray: See what Claude Code and Codex actually send to the API — and what each part costs.

🧠Large Language Models (LLMs) Code

github.com··Hacker News

Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder

⚡Real-time AI Systems Academic

arxiv.org·

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

Enabling KV Caching of Shared Prefix for Diffusion Language Models

Show HN: Kikubot – Each AI agent is an inbox

Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design

RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

aussiealex/agentmeter: Know what your agents cost. Cost intelligence for AI coding agents.

Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

OpenPCC: Open and Confidential LLM Serving on Commodity TEEs

TjWheeler/deep-memory: A GraphRAG implementation with a Vocabulary system to optimise AI integration

Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production

Week 1 of building Quantamind: Ditching Electron for Rust & Tauri 🦀

How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

hansstam86/wibeos

Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

kenn-io/agentsview: Local-first session intelligence and analytics for coding agents, supporting Claude Code, Codex, and more than 20 other agents. Also: 100x faster replacement for ccusage!

AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

tigerless-labs/cost-xray: See what Claude Code and Codex actually send to the API — and what each part costs.

Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder