Model Serving Economics

Feeds to Scour
SubscribedAll
Scoured 75 posts in 40.1 ms

Characterizing Software Aging in GPU-Based LLM Serving Systems

 🏗️LLM Infrastructure  Content type: Academic
arxiv.org·

harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.

 🧠LLM Inference  Content type: Code
github.com··Hacker News, r/LLM

NetX-lab/Frontier: Frontier: A Discrete-Event Simulator for Modern LLM Serving

 🏗️LLM Infrastructure  Content type: Code
github.com··Hacker News

BUDDY: BUdget-Driven DYnamic Depth Routing for Adaptive Large Language Model Inference

 🤖AI  Content type: Academic
arxiv.org·

SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving

 🔬RaBitQ  Content type: Academic
arxiv.org·

Beyond Per-Token Pricing: A Concurrency-Aware Methodology for LLM Infrastructure Cost Estimation

 🏗️LLM Infrastructure  Content type: Academic
arxiv.org·
Less-relevant results

TabSwift: An Efficient Tabular Foundation Model with Row-Wise Attention

 🦀New Rust Features  Content type: Academic
arxiv.org·

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

 Fast AI Inference  Content type: Academic
arxiv.org·

Heterophily-Aware Adaptive Knowledge Distillation for Hypergraph Neural Networks

 📱Edge AI Optimization  Content type: Academic
arxiv.org·

AVIS: Adaptive Test-Time Scaling for Vision-Language Models

 🪄Prompt Engineering  Content type: Academic
arxiv.org·

VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

 📐TLA+  Content type: Academic
arxiv.org·

Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination

 Fast AI Inference  Content type: Academic
arxiv.org·

FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

 🔄Incremental Computation  Content type: Academic
arxiv.org·

Beyond Raw Signals: Undecoded Generative Latents as Privileged Synthetic Data

 Gemini  Content type: Academic
arxiv.org·

Teaching Diffusion to Speculate Left-to-Right

 Fast AI Inference  Content type: Academic
arxiv.org·

Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

 🧠Agent Memory  Content type: Academic
arxiv.org·

Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

 🏗️LLM Infrastructure  Content type: Academic
arxiv.org·

FAME: Forecastability-Aware Mixture of Experts for Heterogeneous Time Series Forecasting

 🧩MoE  Content type: Academic
arxiv.org·

Stop Early, Spend Less: Hidden-State Probes as a Practical Recipe for Streaming Moderation of LLM Outputs

 🧠LLM Inference  Content type: Academic
arxiv.org·

Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

 🎭Claude  Content type: Academic
arxiv.org·

No more posts from emschwartz's subscribed feeds.

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help