🤖 LLM Inference - touyou · Scour

Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees ⚡Inference Optimization

arxiv.org·1d

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference ⚡Inference Optimization

vldb.org·6d

Introducing dotLLM - Building an LLM Inference Engine in C# ⚙️AI Infrastructure

kokosa.dev·13h·Hacker News

I-DLM: Introspective Diffusion Language Models 👁️Multimodal LLMs

introspective-diffusion.github.io·21h·Hacker News, r/LocalLLaMA

amitshekhariitbhu/llm-internals: Learn LLM internals step by step - from tokenization to attention to inference optimization. 👁️Multimodal LLMs

github.com·1d·Hacker News

AMD makes a big splash with the MI355X in MLPerf Inference 6.0: Over one million tokens per second in multi-node inference ⚡Inference Optimization

igorslab.de·1h

The Engine Behind Modern LLM Inference, Part 1: Continuous Batching, PagedAttention, and the End of… ⚡Inference Optimization

medium.com·5d

Stop benchmarking inference providers, a guide to easy evaluation ⚡Inference Optimization

huggingface.co·14h·r/LocalLLaMA

Three AIs enter. One survives. What a SIGKILL race reveals about inference speed ⚙️AI Infrastructure

cline.ghost.io·1d

MiniLM-L6-v2 on the JVM: ⚡Inference Optimization

medium.com·5h

Four Reasons Why FPGAs Hit the Sweet Spot for LLM Inference ⚡Inference Optimization

pub.towardsai.net

·14h

OxiBonsai: The World’s First Pure Rust 1-Bit LLM Inference Engine ⚡Inference Optimization

kitasanio.medium.com·2d

Quantization, LoRA, and the 8% Problem: Benchmarking Local LLMs for Production AI ⚙️AI Infrastructure

walsenburgtech.com·3d·Hacker News

Model API Performance ⚡Inference Optimization

news.ycombinator.com·19h·Hacker News

Google Released Gemma 4 with a Focus On Local-First, On-Device AI Inference ⚙️AI Infrastructure

infoq.com·1d

The Global Optimum: An In-Depth Look at TurboQuant and KV Cache Compression ⚡Inference Optimization

thegradientdescent.medium.com·7h

Inside the Token Factory: A First-Principles Comparison of vLLM and SGLang ⚙️AI Infrastructure

hxu296.github.io·3d·Hacker News

Beyond Helpfulness: Specialized Fine-Tuning for Empathetic AI with Gemma 2B and QLoRA ⚙️AI Infrastructure

ecorbari.medium.com·2d

Taalas bets on hard‑wired models to beat GPUs at inference ⚙️AI Infrastructure

jonpeddie.com·11h

From AGI to LLMs and hallucinations: unpacking confusing AI terms ⚙️AI Infrastructure

digitaltoday.co.kr·1d

Loading more...