⚡ Fast AI Inference - emschwartz · Scour

Nvidia paid Groq $20 billion and took its top engineers. Now Groq is raising $650 million for what’s left. 🇨🇳Chinese AI

thenextweb.com·5d

Part 3 — Implementation/Engine-Level: Choosing the Runtime That Gives You These for Free 🏗️LLM Infrastructure

towardsai.net·1d

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag. 🏗️LLM Infrastructure Code

github.com·17h·Hacker News

Free vLLM Course: Inference, Compression, Benchmarks 🧠Inference Serving

deeplearning.ai·2d·Hacker News, r/selfhosted

Build a Medical Report Analyzer on Dedicated Inference with Python 🇨🇳Chinese AI

digitalocean.com·21h

How the hell is Groq raising more money? 🧬Mythos

zach.be·3d·Hacker News

Fast and Efficient LLM Inference with vLLM: A New Course with Deeplearning.ai 🧠Inference Serving Blog

vllm.ai·2d·Hacker News

NVIDIA releases Nemotron 3 Ultra, claiming five times the speed and 30 percent lower costs than prior modelsThe model delivers 300 tokens per second on benchmar... 🗄️Web Datasets

LLM Inference Engineering Room — Part 3: The Orchestration Layer 🧠LLM Inference Blog

vimal-dwarampudi.medium.com·1d

Serving vLLM for LLM Inference 🏗️LLM Infrastructure Blog

Scale On-Prem AI with Foundry Local on Azure Local: Multi-Node Inference and vLLM Support 🧠Inference Serving

techcommunity.microsoft.com·2d

New comment by tjsawyer in "Ask HN: Who wants to be hired? (June 2026)" 🤖AI Discussion

news.ycombinator.com·19h·Hacker News

After Nvidia’s $20B not-aqui-hire, AI chip startup Groq reportedly raising $650M 🖥GPUs

techcrunch.com·6d

mirkolenz/llmhop: Tiny, stateless Go router that dispatches OpenAI-compatible requests to single-model vLLM and sglang backends with zero external dependencies 🤖AI Code

github.com·8h·Hacker News

DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference 🧠Inference Serving Academic

Accelerate autoscaling inference in Red Hat AI with Everpure 🏗️LLM Infrastructure

Intel's mysterious new datacenter GPU is what Nvidia's Rubin CPX nearly was 🖥️Hardware Architecture

theregister.com·19h

How attackers are gaining access to LLM inference 🤖AI Blog

intezer.com·1d

Speculators v0.5.0: DFlash support and online training 🏗️LLM Infrastructure

developers.redhat.com·1d

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops 🤖AI Video

youtube.com·22h

Log in to enable infinite scrolling