AI Performance Profiling

Feeds to Scour
SubscribedAll
Scoured 90 posts in 4.3 ms

harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.

 🔧Systems-level optimizations for LLM serving  Content type: Code
github.com··Hacker News, r/LLM

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

 ⚙️AI Infrastructure Automation  Content type: Blog
jimmysong.io·

Beyond Per-Token Pricing: A Concurrency-Aware Methodology for LLM Infrastructure Cost Estimation

 Model optimizations in LLMs  Content type: Academic
arxiv.org·

Google's new open-weights model brings image-generation tricks to AI text generation

 🧠Large Language Models (LLMs)  Content type: News
theregister.com·

LLM Routing: From Strategy Selection to Production Architecture

 🧠Large Language Models (LLMs)  Content type: Blog
blog.n8n.io·

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

 🧠Large Language Models (LLMs)

Stop Treating Your Models Like Microservices

 ⚙️AI Infrastructure Automation
cloudnativenow.com·

Bare-metal MSX2+ Emulator for ESP32-S3 offers custom LCD_CAM VGA implementation & Z80 optimizations - CNX Software

 🔧Systems-level optimizations for LLM serving  Content type: News
cnx-software.com·

Big Blue’s Redbook on Storage Scale KV Cache management

 🔧Systems-level optimizations for LLM serving  Content type: News
blocksandfiles.com·

Optimize EC2 costs with AWS Compute Optimizer right sizing

 ⚙️AI Infrastructure Automation  Content type: Blog
aws.amazon.com·

Running LLM Inference on Kubernetes: What It Actually Takes

 🚀LLM serving frameworks  Content type: Blog
fairwinds.com·

Monitor Nebius AI Cloud with Datadog

 ⚙️AI Infrastructure Automation  Content type: Blog
datadoghq.com·

AI Serving Platform That Adapts to Your Model

 🚀LLM serving frameworks  Content type: Blog
databricks.com·

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

 🔧Systems-level optimizations for LLM serving  Content type: Blog
tilert.ai··Hacker News

Apple rebuilt its on-device AI stack at WWDC 2026

 Model optimizations in LLMs  Content type: Blog
ziraph.com··Hacker News

NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

 🚀LLM serving frameworks  Content type: Blog
blogs.nvidia.com·

Profiling in PyTorch (Part 2): From Nn.Linear to a Fused MLP

 🔧Systems-level optimizations for LLM serving  Content type: Blog
huggingface.co··Hacker News

CoreML vs TFLite: iPhone 15 Pro GPU 2.3x Faster

 Model optimizations in LLMs  Content type: Blog  Content type: Discussion
tildalice.io·

Anatomy of a high-performance EP kernel

 🔧Systems-level optimizations for LLM serving  Content type: Blog

Infrastructure Options for Scalable AI Inference

 ⚙️AI Infrastructure Automation  Content type: Blog
mirantis.com·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help