🧠 Inference Serving - emschwartz · Scour

Introducing Dedicated Container Inference: Delivering 2.6x faster inference for custom AI models

together.ai·16h

🏗️LLM Infrastructure

PARD: Enhancing Goodput for Inference Pipeline via Proactive Request Dropping

arxiv.org·2d

📱Edge AI Optimization

AI Inference Needs A Mix-And-Match Memory Strategy

semiengineering.com·8h

🏗️LLM Infrastructure

Show HN: A header-only C++ benchmark for predictive models on raw binary streams

github.com·8h·

Discuss: Hacker News

Real-Time AI Streaming in Production: What We Built at Helpmaton

metaduck.com·1h

💾Prompt Caching

Supercharging Inference for AI Factories: KV Cache Offload as a Memory-Hierarchy Problem

blog.min.io·2h

🏗️LLM Infrastructure

Leading Inference Providers Cut AI Costs by up to 10x With Open Source Models on NVIDIA Blackwell

blogs.nvidia.com·52m

📊Model Serving Economics

Mastering Amazon Bedrock throttling and service availability: A comprehensive guide

aws.amazon.com·1d

🛡️DDoS Mitigation

Compute Only Once: UG-Separation for Efficient Large Recommendation Models

arxiv.org·11h

🎛️Feed Filtering

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization

machinelearning.apple.com·2d

📦Batch Embeddings

Uber’s Rate Limiting System

uber.com·2h·

Discuss: Hacker News

🛡️DDoS Mitigation

Scheduling in a changing world: Maximizing throughput with time-varying capacity

research.google·1d

📅Resource Scheduling

Cache-aware disaggregated inference for up to 40% faster long-context LLM serving

together.ai·1d·

Discuss: Hacker News, r/LocalLLaMA

💾Prompt Caching

Deterministic Inference with EigenAI

deterministicinference.com·22h

🧠LLM Inference

Fine Grained Everything, and what comes after React Server Components

blog.logrocket.com·1d

🦀Rust Web Services

Introducing AutoDiscovery: Automated scientific discovery, now in AstaLabs

allenai.org·57m

📊IVF Indexes

Kong Context Mesh prepares enterprise APIs for AI agents

techzine.eu·1d

How neoclouds meet the demands of AI workloads

infoworld.com·7h

ShareChat hit a billion features per second, then it had to make it 10x cheaper

thenewstack.io·2h

🏗️Infrastructure Economics

Functional Optics for Modern Java

blog.scottlogic.com·16h

🌊Async Patterns

Loading more...