⚡ LLM Serving - rdksupe · Scour

🤖AI Agents medium.com

·

The Context Budget That Will Decide Everyday AI

Less-relevant results

🏗️Systems Design blocksandfiles·

Dell and data physics

📈LLM Scaling arXiv·

Human-Less LLM Serving: Quantifying the Human Tax on Throughput

📊Machine Learning jimmysong.io·

Why GPUs Became the Foundation of AI: A GPU Primer for K8s Veterans

📈LLM Scaling medium.com

·

LLM Inference Optimization: The Difference Between an AI Demo and an AI Business

🖥️GPU Computing Hugging Face·

GLM-5.2: Built for Long-Horizon Tasks

Covers 5 stories including New model GLM-Experimental is quite good (not local so far)

Covered by 3 sources including vettedconsumer.com, DEV Community

Discussed on Hacker News and r/LocalLLaMA

🔬Deep Learning GitHub·

100+ t/s on Qwen3.6-27B Q8 across a 5090 + 3090 Ti — switching to tensor split-mode got me from 70 to 100+

Covered by imil.net, NVIDIA Technical Blog

Discussed on r/LocalLLaMA

⚙️MLOps AWS·

Monitor and debug generative AI inference with SageMaker detailed metrics and Insights dashboard on CloudWatch

📈LLM Scaling arXiv·

Geometry-Aware Online Scheduling for LLM Serving: From Theoretical Bound to System Practice

🖥️GPU Computing digitalocean.com·

Efficient LLM Compression with SparseGPT and Wanda on GPU Cloud

Covers NVIDIA Triton Inference Server — NVIDIA Triton Inference Server

🏗️Systems Design ServeTheHome·

This is the Storage of Spaceborne Computer 4 Bringing AI Compute to the Moon

🔬Deep Learning certdepot.net·

Recent Technical

🖥️GPU Computing arXiv·

HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

🧠Transformer Architecture news.smol.ai·

not much happened today | AINews

🔬Deep Learning GitHub·

DeepSeek V4 Flash optimized framework and model variants for DGX Spark

Covers Nvidia RTX Spark

Discussed on Hacker News

⚙️MLOps alternativeto.net·

Z.ai debuts GLM-5.2 with stable 1M-token context and top coding scores

🧠Transformer Architecture arXiv·

Delay-Adaptive Speculation Control for Low-Latency Edge-Cloud LLM Inference

🧠Transformer Architecture XDA·

I tested Google's new Gemma 4 12B on my 8GB GPU, and now I don't want to go back to smaller models

Discussed on Hacker News

🔬Deep Learning together.ai·

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

Covers 3 stories including Show HN: Mini-swe-agent achieves 65% on SWE-bench in 100 lines of python

🧠Transformer Architecture Martin Alderson·

Expert-aware quantisation: near-Q4 quality at near-Q2 size?

Discussed on Hacker News

Sign up or log in to see more results

Log in to enable infinite scrolling