🤖 LLM Inference - teslartifex · Scour

🤖AI GitHub·

I got tired of not understanding how vLLM works under the hood, so I built my own mini inference engine from scratch.

Discussed on r/LLM

🔮Speculative Decoding NVIDIA Technical Blog·

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding

Covers 4 stories including NVIDIA Blackwell Architecture

⏱️Latency Engineering Phoronix·

AMD Contributes ONNX Runtime Backend To FFmpeg DNN Filter

✂️Prefill Disaggregation arXiv·

SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference

🤖AI Hugging Face·

Could you help me test MTP for GLM-4.7-Flash?

Discussed on r/LocalLLaMA

🤖AI David Noel Ng·

2x GH200 for LLM inference, Part 3: GLM-5.2, expert offload, and the CPU question

⚡Flash Attention vucense.com·

TurboQuant on Windows and LM Studio 2026: Complete Setup Guide

Covers 2 stories including Discover and run local LLMs

✂️Prefill Disaggregation medium.com

·

The Hidden Memory Problem Behind Fast LLM Inference

✂️Prefill Disaggregation fitservers.com·

The Complete Guide to Deploying DeepSeek R1 on a Dedicated Server

🔮Speculative Decoding Modal·

Achieve state-of-the-art inference latencies with speculative decoding

Covers DFlash: Block Diffusion for Flash Speculative Decoding

Less-relevant results

✂️Prefill Disaggregation nextbigfuture.com·

Optimus Teslabot Would Be an Edge Computing Beast

✂️Prefill Disaggregation medium.com

·

Debugging Deployments with Gemma 12B, TPU v6e-1, MCP, and Antigravity CLI

✂️Prefill Disaggregation lemmy.ml·

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

✂️Prefill Disaggregation blog.skypilot.co·

SkyPilot Endpoints: Production-Ready Inference on Every Cluster You Own

Discussed on Hacker News

🤖Agentic AI medium.com

·

vLLM, Function Calling, and World Models explained

✂️Prefill Disaggregation Red Hat Developer·

Optimizing distributed AI inference: Advanced deployment patterns

Covers 3 stories including DeepSeek-V3 Technical Report

✂️Prefill Disaggregation Ubuntu·

Developing web apps with local LLM inference

✂️Prefill Disaggregation OpenAI News·

OpenAI and Broadcom unveil LLM-optimized inference chip

Covered by Mark Smith's Blog Feed

🤖AI Hugging Face·

Run a vLLM Server on HF Jobs in One Command

Covers 2 stories including Pi.dev: There are many coding agents, but this one is mine

✂️Prefill Disaggregation supercomputing-system-ai-lab.github.io·

VoltanaLLM: Energy-Efficient LLM Serving

Discussed on Hacker News

Log in to enable infinite scrolling