🧠 LLM Inference - emschwartz · Scour

🤖AI GitHub·

ahwurm/localharness: Model-agnostic agent harness for local LLMs — configure agents in YAML and run them on your own hardware (vLLM, Ollama, LM Studio, llama.cpp).

Covers uv

Discussed on Hacker News

🤖AI unsloth.ai·

GLM-5.2 – How to Run Locally

Covers 2 stories including GitHub here . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inferen...

Covered by news.smol.ai

Discussed on Hacker News

🧠Memory Management thecomputersciencebook.com·

PagedAttention is more than virtual memory

Covers Efficient Memory Management for Large Language Model Serving with PagedAttention

Discussed on Hacker News

🏗️LLM Infrastructure Towards AI

·

“Running Local Models Is Good Now” Was Written on a 64GB Mac. Half of You Have 16GB or Less

🏗️LLM Infrastructure arxiv.org·

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

Less-relevant results

🤖AI threadreaderapp.com·

A YouTuber just did what $60 billion in funding could not stop.

Covers 2 stories including Ollama

🏗️LLM Infrastructure vettedconsumer.com·

The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)

Covers 2 stories including Efficient Memory Management for Large Language Model Serving with PagedAttention

Discussed on Hacker News

🤖AI GitHub·

I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers!

Covers 2 stories including Language models are few-shot learners (2020)

Discussed on r/LocalLLaMA

🔓Open Source AI Anyscale blog posts·

High Performance Distributed Inference with Ray Serve LLM

Covered by Google Cloud Blog

Discussed on Hacker News

🆕New AI huggingface.co·

225B-A23B

Covered by news.smol.ai

Discussed on r/LocalLLaMA

🤖AI GitHub·

How do I set the right llama.cpp parameters?

Covers JSON Schema

Covered by DEV Community, Alex Ewerlöf Notes

Discussed on r/LocalLLaMA

🤖AI devashish.me·

Two Qwen3 models on one DGX Spark: the residency math

Discussed on Hacker News

🏗️LLM Infrastructure Google Cloud Blog·

Scaling Ray Serve LLM on GKE: Performance without losing the developer experience

🔓Open Source AI mstar.stanford.edu·

M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models

Discussed on Hacker News

🏗️LLM Infrastructure GitHub·

Pipeline-parallel LLM inference across GPUs on separate machines

Discussed on Hacker News

🏗️LLM Infrastructure abhishek.it·

Running GLM-5.2 5x faster at 500tps with limitation

Discussed on Hacker News

📱Edge AI Optimization arxiv.org·

From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads

🤖AI GitHub·

Running a 35B MoE model on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)

Discussed on Hacker News

🤖AI rocm.blogs.amd.com·

Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization

Discussed on Hacker News

🤖AI lmsys.org·

DFlash and Spec V2 Decoding (14 minute read)

Covers 5 stories including Looking for a self-hosted alternative to Modal.com for running ML workloads

Discussed on Hacker News

Log in to enable infinite scrolling