Skip to main content
Scour
Discover
Docs
Login
Sign Up
Discover
About
Docs
Changelog
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Choose a timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
176
posts in
52.1
ms
🤖
AI
GitHub
·
6d
6 days ago
ahwurm/localharness:
Model-agnostic
agent harness for local LLMs — configure agents in YAML and run them on your own hardware (
vLLM
,
Ollama
, LM Studio, llama.cpp).
Covers
uv
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for ahwurm/localharness: Model-agnostic agent harness for local LLMs — configure agents in YAML and run them on your own hardware (vLLM, Ollama, LM Studio, llama.cpp).
🤖
AI
unsloth.ai
·
2d
2 days ago
GLM-5.2 – How to Run Locally
Covers
2 stories
See all stories this covers
including
GitHub here . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inferen...
Covered by
news.smol.ai
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for GLM-5.2 – How to Run Locally
🧠
Memory Management
thecomputersciencebook.com
·
6d
6 days ago
PagedAttention
is more than virtual memory
Covers
Efficient Memory Management for Large Language Model Serving with PagedAttention
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for PagedAttention is more than virtual memory
🏗️
LLM Infrastructure
Towards AI
·
2d
2 days ago
“Running Local
Models
Is Good Now” Was Written on a 64GB Mac. Half of You Have 16GB or Less
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for “Running Local Models Is Good Now” Was Written on a 64GB Mac. Half of You Have 16GB or Less
🏗️
LLM Infrastructure
arxiv.org
·
3d
3 days ago
UltraQuant: 4-bit
KV
Caching
for Context-Heavy Agents
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for UltraQuant: 4-bit KV Caching for Context-Heavy Agents
Less-relevant results
🤖
AI
threadreaderapp.com
·
2d
2 days ago
A YouTuber just did what $60 billion in funding could not stop.
Covers
2 stories
See all stories this covers
including
Ollama
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for A YouTuber just did what $60 billion in funding could not stop.
🏗️
LLM Infrastructure
vettedconsumer.com
·
6d
6 days ago
The
KV
Cache
, Explained: Why Long Context Eats Your VRAM (and How to Fit More)
Covers
2 stories
See all stories this covers
including
Efficient Memory Management for Large Language Model Serving with PagedAttention
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)
🤖
AI
GitHub
·
14h
14 hours ago
I forked ik_
llama.cpp
and added a "--numa mirror"
mode
to maximize performance on multi-socket CPU systems. Just sharing and looking for testers!
Covers
2 stories
See all stories this covers
including
Language models are few-shot learners (2020)
Discussed on
r/LocalLLaMA
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers!
🔓
Open Source AI
Anyscale blog posts
·
3d
3 days ago
High Performance Distributed
Inference
with Ray Serve
LLM
Covered by
Google Cloud Blog
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for High Performance Distributed Inference with Ray Serve LLM
🆕
New AI
huggingface.co
·
3d
3 days ago
225B-A23B
Covered by
news.smol.ai
Discussed on
r/LocalLLaMA
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for 225B-A23B
🤖
AI
GitHub
·
2d
2 days ago
How do I set the right
llama.cpp
parameters?
Covers
JSON Schema
Covered by
DEV Community
,
Alex Ewerlöf Notes
Discussed on
r/LocalLLaMA
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for How do I set the right llama.cpp parameters?
🤖
AI
devashish.me
·
5d
5 days ago
Two Qwen3
models
on one DGX Spark: the residency math
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Two Qwen3 models on one DGX Spark: the residency math
🏗️
LLM Infrastructure
Google Cloud Blog
·
4d
4 days ago
Scaling Ray Serve
LLM
on GKE: Performance without losing the developer experience
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Scaling Ray Serve LLM on GKE: Performance without losing the developer experience
🔓
Open Source AI
mstar.stanford.edu
·
3d
3 days ago
M* (M-Star): A Modular, Extensible, Serving System for Multimodal
Models
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models
🏗️
LLM Infrastructure
GitHub
·
2d
2 days ago
Pipeline-parallel
LLM
inference
across GPUs on separate machines
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Pipeline-parallel LLM inference across GPUs on separate machines
🏗️
LLM Infrastructure
abhishek.it
·
3d
3 days ago
Running GLM-5.2 5x faster at 500tps with limitation
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Running GLM-5.2 5x faster at 500tps with limitation
📱
Edge AI Optimization
arxiv.org
·
4d
4 days ago
From Tokens to Energy Flexibility:
Quantization-Enabled
Demand Response for Data Centers with
LLM
Inference
Workloads
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads
🤖
AI
GitHub
·
1d
1 day ago
Running a 35B MoE
model
on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Running a 35B MoE model on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)
🤖
AI
rocm.blogs.amd.com
·
5d
5 days ago
Unlocking Extreme AMD Instinct
Inference
with Software-Hardware Co-Optimization
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization
🤖
AI
lmsys.org
·
6d
6 days ago
DFlash and Spec V2 Decoding (14 minute read)
Covers
5 stories
See all stories this covers
including
Looking for a self-hosted alternative to Modal.com for running ML workloads
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for DFlash and Spec V2 Decoding (14 minute read)
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous post
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Discover
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help
Like
Save
Not for me
Report