🤖 LLM Inference - buckman · Scour

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference 🧠LLM

vldb.org·6d

Introducing dotLLM - Building an LLM Inference Engine in C# 🦙Ollama

kokosa.dev·14h·Hacker News

amitshekhariitbhu/llm-internals: Learn LLM internals step by step - from tokenization to attention to inference optimization. 🧠LLM

github.com·1d·Hacker News

I-DLM: Introspective Diffusion Language Models 🧠LLM

introspective-diffusion.github.io·22h·Hacker News, r/LocalLLaMA

AMD makes a big splash with the MI355X in MLPerf Inference 6.0: Over one million tokens per second in multi-node inference 🚀Performance

igorslab.de·2h

The Engine Behind Modern LLM Inference, Part 1: Continuous Batching, PagedAttention, and the End of… 🔀Model Routing

medium.com·5d

Stop benchmarking inference providers, a guide to easy evaluation 📊Performance Tools

huggingface.co·15h·r/LocalLLaMA

Four Reasons Why FPGAs Hit the Sweet Spot for LLM Inference ⚡Hardware Acceleration

pub.towardsai.net

·15h

Quantization, LoRA, and the 8% Problem: Benchmarking Local LLMs for Production AI ⚙️MLOps

walsenburgtech.com·3d·Hacker News

Model API Performance 🦙Ollama

news.ycombinator.com·19h·Hacker News

patilyashvardhan2002-byte/lazy-moe: The GPU-free LLM inference engine. Combines lazy expert loading + TurboQuant KV compression to run models that shouldn't fit on your hardware. Built from scratch, fully local, zero cloud. 🦙Ollama

github.com·2d·r/LocalLLaMA

Inside the Token Factory: A First-Principles Comparison of vLLM and SGLang 🔌LSP

hxu296.github.io·3d·Hacker News

LLM inference, optimized for your Mac 🦙Ollama

omlx.ai·4d·Hacker News

LLM inference engine written ground-up natively in C#/.NET 🦙Ollama

dotllm.dev·13h·Hacker News

Tutorial: ZML Understanding Deep Learning Inference: From Black Box to Bare Metal with ResNet-18 🧠Deep Learning

neudinger.medium.com·4d

We Put a Gaming Box in the Inference Loop 💸Inference Costs

write.as·6d

Inside LLM Inference: KV Cache, Prefill, and the Decode Bottleneck 💸Inference Costs

pub.towardsai.net

·6d

milanm/AutoGrad-Engine: A complete GPT language model (training and inference) in ~600 lines of pure C#, zero dependencies 🧠LLM

github.com·5d·Hacker News

I Ran My KYB Engine at Three Quantization Levels. Accuracy Didn't Move. Cost Dropped 6x. 💸Inference Costs

walsenburgtech.com·5d·Hacker News

Beledarian/wgpu-llm: A from-scratch LLM inference engine that uses wgpu (the cross-platform WebGPU implementation) to dispatch WGSL compute shaders for every math operation a Transformer needs. No CUDA. No Python. No massive framework dependencies. Just Rust, raw shaders, and your GPU. 🦙Ollama

github.com·3d·Hacker News