🧠 LLM Inference - emschwartz · Scour

🏗️LLM Infrastructure GitHub·

Pipeline-parallel LLM inference across GPUs on separate machines

Discussed on Hacker News

🏗️LLM Infrastructure Anyscale blog posts·

67% Cost Savings with PD Disaggregation Using Ray and vLLM on AMD MI325X

Discussed on Hacker News

🤖AI GitHub·

Running a 35B MoE model on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)

Discussed on Hacker News

🧠Inference Serving Towards AI

·

Continuous Batching: How to Keep Your GPU Actually Busy

Less-relevant results

🤖AI GitHub·

Show HN: Alloy – a PyTorch backend and inference engine for Apple Silicon

Discussed on Hacker News

📱Edge AI Optimization arxiv.org·

From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads

🔓Open Source AI GitHub·

yifanfeng97/Hyper-Extract

Covered by 何夕2077的个人站

🏗️LLM Infrastructure arxiv.org·

SwiftCache: Efficient LLM Serving for Multi-turn Conversations with Heterogeneous KV Cache Sharing

🏗️LLM Infrastructure arxiv.org·

AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor

🏗️LLM Infrastructure arxiv.org·

ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training

🤖AI Towards AI

·

How to Run NVIDIA’s Nemotron Locally on Your Laptop or Desktop

🤖AI GitHub·

Native Inference Engine for macOS 14 or newer

Discussed on Hacker News

🧠Inference Serving arxiv.org·

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

💾Prompt Caching GitHub·

Sors: a Rust proxy that reorders prompts to maximize vLLM prefix cache hits

Discussed on Hacker News

🏗️LLM Infrastructure arxiv.org·

SMEPilot: Characterizing and Optimizing LLM Inference with Scalable Matrix Extensions

🏗️LLM Infrastructure GitHub·

Cosmicgpt – A GPT-in-space simulator to research SpaceX AI satellite viability

Discussed on Hacker News

🏆LLM Benchmarking arxiv.org·

Towards Distributed Inference of LLMs on a P2P Network

🤖AI GitHub·

robert-mcdermott/phlox: Phlox is a self-hostable chat application with an agentic harness, document RAG, code execution, and MCP integration — running over any model provider: AWS Bedrock or any OpenAI-compatible endpoint (OpenAI, Ollama, vLLM, LiteLLM, LM Studio, local models).

Covers 2 stories including Ollama

Discussed on Hacker News

📱Edge AI Optimization arxiv.org·

Efficient On-Device Diffusion LLM Inference with Mobile NPU

🤖AI GitHub·

Show HN: Selora – local model for Home Assistant

Covers 4 stories including Model Context Protocol And OAuth

Discussed on Hacker News

Log in to enable infinite scrolling