📊 AI Performance Profiling - pleto · Scour

harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.

🔧Systems-level optimizations for LLM serving Code

github.com··Hacker News, r/LLM

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

⚙️AI Infrastructure Automation Blog

Beyond Per-Token Pricing: A Concurrency-Aware Methodology for LLM Infrastructure Cost Estimation

✨Model optimizations in LLMs Academic

Google's new open-weights model brings image-generation tricks to AI text generation

🧠Large Language Models (LLMs) News

theregister.com·

LLM Routing: From Strategy Selection to Production Architecture

🧠Large Language Models (LLMs) Blog

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

🧠Large Language Models (LLMs)

local-llm.utop.workers.dev··Hacker News

Stop Treating Your Models Like Microservices

⚙️AI Infrastructure Automation

cloudnativenow.com·

Bare-metal MSX2+ Emulator for ESP32-S3 offers custom LCD_CAM VGA implementation & Z80 optimizations - CNX Software

🔧Systems-level optimizations for LLM serving News

cnx-software.com·

Big Blue’s Redbook on Storage Scale KV Cache management

🔧Systems-level optimizations for LLM serving News

blocksandfiles.com·

Optimize EC2 costs with AWS Compute Optimizer right sizing

⚙️AI Infrastructure Automation Blog

aws.amazon.com·

Running LLM Inference on Kubernetes: What It Actually Takes

🚀LLM serving frameworks Blog

fairwinds.com·

Monitor Nebius AI Cloud with Datadog

⚙️AI Infrastructure Automation Blog

datadoghq.com·

AI Serving Platform That Adapts to Your Model

🚀LLM serving frameworks Blog

databricks.com·

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

🔧Systems-level optimizations for LLM serving Blog

tilert.ai··Hacker News

Apple rebuilt its on-device AI stack at WWDC 2026

✨Model optimizations in LLMs Blog

ziraph.com··Hacker News

NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

🚀LLM serving frameworks Blog

blogs.nvidia.com·

Profiling in PyTorch (Part 2): From Nn.Linear to a Fused MLP

🔧Systems-level optimizations for LLM serving Blog

huggingface.co··Hacker News

CoreML vs TFLite: iPhone 15 Pro GPU 2.3x Faster

✨Model optimizations in LLMs Blog Discussion

Anatomy of a high-performance EP kernel

🔧Systems-level optimizations for LLM serving Blog

fergusfinn.com··Hacker News

Infrastructure Options for Scalable AI Inference

⚙️AI Infrastructure Automation Blog

Log in to enable infinite scrolling