⚡ Fast AI Inference - emschwartz · Scour

🏗️LLM Infrastructure groq.com·

Groq Raises Another $650M

Covered by 6 sources including TechCrunch, TNW | Artificial-Intelligence

Discussed on Hacker News

🤖AI NVIDIA Technical Blog·

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding

Covers 3 stories including NVIDIA/TensorRT-LLM

🏗️LLM Infrastructure GitHub·

For users with 4x-8x 6000 PROs, how is your experience with bigger models lately? (GLM 5.2, Kimi 2.7, DeepSeek V4 Pro)

Discussed on r/LocalLLaMA

⚡Hardware Acceleration The Register·

ZTE builds a TCO-optimal AI factory to fuel token economy

🏗️LLM Infrastructure blog.skypilot.co·

SkyPilot Endpoints: Production-Ready Inference on Every Cluster You Own

Discussed on Hacker News

🔓Open Source AI Anyscale blog posts·

High Performance Distributed Inference with Ray Serve LLM

Covered by Google Cloud Blog

Discussed on Hacker News

🏗️LLM Infrastructure primeintellect.ai·

RL at 1T Scale: prime-rl Performance Deep Dive

Covers 6 stories including Kimi K2.7-Code: open-source coding model with better token efficiency

🧩MoE Modal·

Achieve state-of-the-art inference latencies with speculative decoding

Covers DFlash: Block Diffusion for Flash Speculative Decoding

🤖AI cerebras.ai·

Gemma 4 on Cerebras—The Fastest Inference is Now Multimodal

Covers Home | ArtificialAnalysis.ai

Covered by habr.com

🏗️LLM Infrastructure Baseten·

We built the fastest API for GLM-5.2 (280 TPS)

Covers GLM-5.2 (6 minute read)

Discussed on Hacker News

🏗️LLM Infrastructure GitHub·

Show HN: ParseHawk – 100% Local Document AI with API, CLI, and Web UI

Covers 2 stories including Installation

Discussed on Hacker News

🔓Open Source AI portal.neuralwatt.com·

Neuralwatt: Energy-based pricing for AI inference. Efficient prompts cost less

Discussed on Hacker News

🔓Open Source AI IBM Research·

Running AI on mixed hardware for speed and affordability

Covers Introduction to llm-d Open-source Kubernetes-native Framework for Distributed LLM Inference | Ep 140 #cloudnativefm

🏗️LLM Infrastructure Towards AI

·

Stop Crashing and Start Cooking with vLLM on AMD and Lemonade Server

🆕New AI Hugging Face·

225B-A23B

Covered by mail.bycloud.ai, news.smol.ai

Discussed on r/LocalLLaMA

🧠LLM Inference arXiv·

Recency/Frequency Adaptive KV Caching for Large Language Model Serving

⚡Performance graphsignal.com·

CUDA Profiler for Production Inference

Discussed on Hacker News

🏗️LLM Infrastructure GitHub·

Generate per-session LoRA adapters in <1s for agentic inference efficiency

Discussed on Hacker News

⚡Performance GitHub·

Show HN: CUDA Profiler for Production Inference

Covered by tldr.tech

Discussed on Hacker News

🏗️LLM Infrastructure GitHub·

Profile(v2.1.4) physics-aware optimizer for vLLM (31→470 tok/s on A100)

Discussed on Hacker News

Log in to enable infinite scrolling