💾 Prompt Caching - emschwartz · Scour

🔓Open Source AI Anyscale blog posts·

High Performance Distributed Inference with Ray Serve LLM

Covered by Google Cloud Blog

Discussed on Hacker News

🔓Open Source AI fitservers.com·

The Complete Guide to Deploying DeepSeek R1 on a Dedicated Server

⚡Fast AI Inference arxiv.org·

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

🔓Open Source AI GitHub·

Sors: a Rust proxy that reorders prompts to maximize vLLM prefix cache hits

Discussed on Hacker News

🏗️LLM Infrastructure medium.com

·

vLLM, Function Calling, and World Models explained

🔄LLM RAG Pipelines pyimagesearch.com·

RAG Observability with Langfuse, vLLM, and FAISS

🔌Claude Plugins code.claude.com·

How Claude Code uses prompt caching

Covers How I built a three-tier content quality ladder for programmatic directory ETL

Covered by DEV Community

🧠Memory Management thecomputersciencebook.com·

PagedAttention is more than virtual memory

Covers Efficient Memory Management for Large Language Model Serving with PagedAttention

Discussed on Hacker News

🏗️LLM Infrastructure GitHub·

Profile(v2.1.4) physics-aware optimizer for vLLM (31→470 tok/s on A100)

Discussed on Hacker News

🏗️LLM Infrastructure vettedconsumer.com·

The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)

Covers 2 stories including Efficient Memory Management for Large Language Model Serving with PagedAttention

Discussed on Hacker News

🧠LLM Inference arxiv.org·

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

🏗️LLM Infrastructure abhishek.it·

Running GLM-5.2 5x faster at 500tps with limitation

Discussed on Hacker News

⚡Fast AI Inference thecybersidekick.beehiiv.com·

AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm

Discussed on DEV

🏗️LLM Infrastructure GitHub·

I got tired of not understanding how vLLM works under the hood, so I built my own mini inference engine from scratch.

Discussed on r/LLM

🔓Open Source AI medium.com

·

Deploying NVIDIA Nemotron-3 Ultra 550B, with B200 GPUs, vLLM on Google Kubernetes Engine — Football…

🏗️LLM Infrastructure Google Cloud Blog·

Scaling Ray Serve LLM on GKE: Performance without losing the developer experience

🔓Open Source AI GitHub·

datalab-to/lift: Extract structured data from documents quickly and accurately.

Covered by habr.com

🏗️LLM Infrastructure Anyscale blog posts·

67% Cost Savings with PD Disaggregation Using Ray and vLLM on AMD MI325X

Discussed on Hacker News

🏗️LLM Infrastructure arxiv.org·

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

🤖AI GitHub·

ahwurm/localharness: Model-agnostic agent harness for local LLMs — configure agents in YAML and run them on your own hardware (vLLM, Ollama, LM Studio, llama.cpp).

Covers uv

Discussed on Hacker News

Log in to enable infinite scrolling