⚡ Inference - jobz · Scour

LLM inference engine from scratch in C++ 🧠LLMs

anirudhsathiya.com·4d·Hacker News

The Engine Behind Modern LLM Inference, Part 1: Continuous Batching, PagedAttention, and the End of… 🧠LLMs

medium.com·23h

Inside LLM Inference: KV Cache, Prefill, and the Decode Bottleneck 🧠LLMs

pub.towardsai.net

·1d

AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention 🧠LLMs

arxiv.org·12h

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference 🧠LLMs

vldb.org·1d

Semidynamics Secures SK hynix Investment to Advance Memory-Centric AI Inference Architecture 💾Agent Memory

hpcwire.com·5h·Hacker News

Overcoming inference challenges 🧠Reasoning Models

redhat.com·3d

I Ran My KYB Engine at Three Quantization Levels. Accuracy Didn't Move. Cost Dropped 6x. 📊Model Evaluation

walsenburgtech.com·23h·Hacker News

We Put a Gaming Box in the Inference Loop 🧠Reasoning Models

write.as·2d

Prediction: The "Inference Supercycle" Could Be Bigger Than the Training Boom. 1 Growth Stock to Own. 🧠Reasoning Models

finance.yahoo.com·23h

milanm/AutoGrad-Engine: A complete GPT language model (training and inference) in ~600 lines of pure C#, zero dependencies 🧠LLMs

github.com·1d·Hacker News

Inside the LLM Black Box: The True Architecture of Latency and Cost 🧠LLMs

akanuri.medium.com·6d

New course: Efficient Inference with SGLang: Text and Image Generation, built in partnership with LMSys @lmsysorg and RadixArk @radixark, and taught by Richard ... 🧠LLMs

twitter.macworks.dev

·23h

UCCL-EP: Portable Expert-Parallel Communication 🔌MCP

uccl-project.github.io·2d·Hacker News

How to achieve P90 sub-microsecond latency in a C++ FIX engine 🔌MCP

akinocal1.substack.com·19h·Substack

TurboQuant Is Quietly Solving LLM Inference’s Worst Memory Problem 🧠LLMs

medium.com·5d

Attn-QAT: Making 4-Bit Attention Actually Work 🎛️Fine-tuning

haoailab.com·1d

Better MoE model inference with warp decode 🧠LLMs

cursor.com·4d·Hacker News

Building the Blueprint for Premium Inference 🧪Synthetic Data

sambanova.ai·1d

GPU Memory for LLM Inference: Why Llama-70B Doesn't Fit 🧠LLMs

darshanfofadiya.com·4d·Hacker News

Loading more...