🧠 LLM Inference - emschwartz

Discussed on Hacker News

🏗️LLM Infrastructure medium.com

The Transformer Pipeline: A Complete Mathematical and Visual Guide

🔓Open Source AI GitHub·

datalab-to/lift: Extract structured data from documents quickly and accurately.

Covered by habr.com

🏗️LLM Infrastructure Anyscale blog posts·

67% Cost Savings with PD Disaggregation Using Ray and vLLM on AMD MI325X

Discussed on Hacker News

📱Edge AI Optimization arxiv.org·

Quantization as a Malicious Task: Removing Quantization-Conditioned Backdoors via Task Arithmetic

🏗️LLM Infrastructure arxiv.org·

SwiftCache: Efficient LLM Serving for Multi-turn Conversations with Heterogeneous KV Cache Sharing

🏗️LLM Infrastructure arxiv.org·

AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor

🤖AI GitHub·

fix(ollama): preserve configured API during discovery (#93729)

🤖AI GitHub·

[Bug]: ollama-cloud runtime fails DNS lookup for ai.ollama.com, while…

🏗️LLM Infrastructure arxiv.org·

ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training

🤖AI GitHub·

[Bug]: ollama-cloud runtime fails DNS lookup for ai.ollama.com, while…

🏗️LLM Infrastructure arxiv.org·

Unified KV Pooling to Accelerate Long-Context LLM Serving

🤖AI GitHub·

How I Architected a Multi-Provider Fallback for Local RAG

Discussed on DEV

🧠Inference Serving arxiv.org·

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

🤖AI GitHub·

Keep key-free web search providers opt-in (#93616)

🏗️LLM Infrastructure arxiv.org·

SAC: Disaggregated KV Cache System for Sparse Attention LLMs with CXL

🤖AI GitHub·

Building a Safe, Local AI Coding Agent with Node.js

Discussed on DEV

🏗️LLM Infrastructure arxiv.org·

SMEPilot: Characterizing and Optimizing LLM Inference with Scalable Matrix Extensions

🏗️LLM Infrastructure GitHub·

Profile(v2.1.4) physics-aware optimizer for vLLM (31→470 tok/s on A100)

Discussed on Hacker News

Alpaca doesn't work with Ollama Cloud

Pipeline-parallel LLM inference across GPUs on separate machines

The Transformer Pipeline: A Complete Mathematical and Visual Guide

datalab-to/lift: Extract structured data from documents quickly and accurately.

67% Cost Savings with PD Disaggregation Using Ray and vLLM on AMD MI325X

Quantization as a Malicious Task: Removing Quantization-Conditioned Backdoors via Task Arithmetic

SwiftCache: Efficient LLM Serving for Multi-turn Conversations with Heterogeneous KV Cache Sharing

AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor

fix(ollama): preserve configured API during discovery (#93729)

[Bug]: ollama-cloud runtime fails DNS lookup for ai.ollama.com, while…

ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training

[Bug]: ollama-cloud runtime fails DNS lookup for ai.ollama.com, while…

Unified KV Pooling to Accelerate Long-Context LLM Serving

How I Architected a Multi-Provider Fallback for Local RAG

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

Keep key-free web search providers opt-in (#93616)

SAC: Disaggregated KV Cache System for Sparse Attention LLMs with CXL

Building a Safe, Local AI Coding Agent with Node.js

SMEPilot: Characterizing and Optimizing LLM Inference with Scalable Matrix Extensions

Profile(v2.1.4) physics-aware optimizer for vLLM (31→470 tok/s on A100)