⚡ Fast AI Inference - emschwartz · Scour

How the hell is Groq raising more money? 🧬Mythos

zach.be·3d·Hacker News

Free vLLM Course: Inference, Compression, Benchmarks 🧠Inference Serving

deeplearning.ai·2d·Hacker News, r/selfhosted

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag. 🏗️LLM Infrastructure Code

github.com·15h·Hacker News

Fast and Efficient LLM Inference with vLLM: A New Course with Deeplearning.ai 🧠Inference Serving Blog

vllm.ai·2d·Hacker News

NVIDIA releases Nemotron 3 Ultra, claiming five times the speed and 30 percent lower costs than prior modelsThe model delivers 300 tokens per second on benchmar... 🗄️Web Datasets

Serving vLLM for LLM Inference 🏗️LLM Infrastructure Blog

DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference 🧠Inference Serving Academic

What Actually Happens When You Send a Prompt to Claude A Full Breakdown 🪄Prompt Engineering

pub.towardsai.net

·1h

Intel's mysterious new datacenter GPU is what Nvidia's Rubin CPX nearly was 🖥️Hardware Architecture

theregister.com·18h

Making Local LLM Go Brrr 🤖AI

seanpedersen.github.io·1d

Sources: ByteDance has partnered with chipmaker InnoStar to develop an AI inference chip modeled after Groq's LPUs, which are built to run AI models at low cost... 🏗️LLM Infrastructure

·6d

mirkolenz/llmhop: Tiny, stateless Go router that dispatches OpenAI-compatible requests to single-model vLLM and sglang backends with zero external dependencies 🤖AI Code

github.com·6h·Hacker News

Experimenting with TPUs, GKE Managed DRANET, and Multi-cluster Inference Gateway 🌍Distributed Systems Blog

cloud.google.com·2d

Step 3.7 Flash – 198B-A11B MoE vision-language model 🤖AI

huggingface.co·5d·Hacker News

Nemotron 3 Ultra now available on AI Gateway 🪄Prompt Engineering

Introducing Granite Libraries and Project Granite Switch 🏗️LLM Infrastructure Blog

research.ibm.com·18h

Llama.cpp now has an official website: llama.app 🤖AI

llama.app·6d·Hacker News

Qwen3.7 Plus - Intelligence, Performance & Price Analysis 💰Tokenomics

artificialanalysis.ai·1d·Hacker News

Build Personal AI Agents on Windows PCs with New Tools from Microsoft and Nvidia 🤖AI Blog

developer.nvidia.com·2d·Hacker News

Lodestar: An Online-Learning LLM Inference Router 🏗️LLM Infrastructure Academic

Log in to enable infinite scrolling