Skip to main content
Scour
Discover
Docs
Login
Sign Up
Discover
About
Docs
Changelog
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Choose a timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
295
posts in
33.4
ms
🤖
AI
GitHub
·
6d
6 days ago
ahwurm/localharness:
Model-agnostic
agent harness for local LLMs — configure agents in YAML and run them on your own hardware (
vLLM
,
Ollama
, LM Studio, llama.cpp).
Covers
uv
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for ahwurm/localharness: Model-agnostic agent harness for local LLMs — configure agents in YAML and run them on your own hardware (vLLM, Ollama, LM Studio, llama.cpp).
🔓
Open Source AI
fitservers.com
·
2d
2 days ago
The Complete Guide to Deploying DeepSeek R1 on a Dedicated Server
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for The Complete Guide to Deploying DeepSeek R1 on a Dedicated Server
🏗️
LLM Infrastructure
medium.com
·
2d
2 days ago
vLLM
, Function Calling, and World
Models
explained
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for vLLM, Function Calling, and World Models explained
🧠
Memory Management
thecomputersciencebook.com
·
6d
6 days ago
PagedAttention
is more than virtual memory
Covers
Efficient Memory Management for Large Language Model Serving with PagedAttention
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for PagedAttention is more than virtual memory
🤖
AI
medium.com
·
2d
2 days ago
Don’t Use
Ollama
for Local LLMs
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Don’t Use Ollama for Local LLMs
🏗️
LLM Infrastructure
arxiv.org
·
3d
3 days ago
UltraQuant: 4-bit
KV
Caching
for Context-Heavy Agents
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for UltraQuant: 4-bit KV Caching for Context-Heavy Agents
🤖
AI
GitHub
·
8h
8 hours ago
fix(
ollama
): support GLM-5.2 cloud discovery
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for fix(ollama): support GLM-5.2 cloud discovery
🤖
AI
lemmy.world
·
1d
1 day ago
Wrote up a full guide for running AI locally on Windows (LM Studio +
Ollama
+ Open WebUI)
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Wrote up a full guide for running AI locally on Windows (LM Studio + Ollama + Open WebUI)
🏗️
LLM Infrastructure
networkworld.com
·
4d
4 days ago
Tether is shipping TurboQuant
KV-cache
quantization
with Vulkan support into its QVAC SDK
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Tether is shipping TurboQuant KV-cache quantization with Vulkan support into its QVAC SDK
🏗️
LLM Infrastructure
medium.com
·
2d
2 days ago
The Context Budget That Will Decide Everyday AI
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for The Context Budget That Will Decide Everyday AI
🏗️
LLM Infrastructure
vettedconsumer.com
·
6d
6 days ago
The
KV
Cache
, Explained: Why Long Context Eats Your VRAM (and How to Fit More)
Covers
2 stories
See all stories this covers
including
Efficient Memory Management for Large Language Model Serving with PagedAttention
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)
🔄
LLM RAG Pipelines
pyimagesearch.com
·
6d
6 days ago
RAG Observability with Langfuse,
vLLM
, and FAISS
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for RAG Observability with Langfuse, vLLM, and FAISS
🔓
Open Source AI
Anyscale blog posts
·
3d
3 days ago
High Performance Distributed
Inference
with Ray Serve
LLM
Covered by
Google Cloud Blog
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for High Performance Distributed Inference with Ray Serve LLM
🤖
AI
GitHub
·
13h
13 hours ago
I forked ik_
llama.cpp
and added a "--numa mirror"
mode
to maximize performance on multi-socket CPU systems. Just sharing and looking for testers!
Covers
2 stories
See all stories this covers
including
Language models are few-shot learners (2020)
Discussed on
r/LocalLLaMA
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers!
🤖
AI
pypi.org
·
6d
6 days ago
Show HN: Subagent-fleet – AI coding subagents across local
Ollama
machines
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Show HN: Subagent-fleet – AI coding subagents across local Ollama machines
📱
Edge AI Optimization
medium.com
·
3d
3 days ago
How I Shrunk a Plant Disease Classifier from 16MB to 5MB with Less Than 1% Accuracy Loss
Covered by
habr.com
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for How I Shrunk a Plant Disease Classifier from 16MB to 5MB with Less Than 1% Accuracy Loss
🏗️
LLM Infrastructure
GitHub
·
2d
2 days ago
I got tired of not understanding how
vLLM
works under the hood, so I built my own mini
inference
engine from scratch.
Discussed on
r/LLM
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for I got tired of not understanding how vLLM works under the hood, so I built my own mini inference engine from scratch.
⚡
Fast AI Inference
thecybersidekick.beehiiv.com
·
3d
3 days ago
AI
Inference
at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm
Discussed on
DEV
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm
🆕
New AI
huggingface.co
·
3d
3 days ago
225B-A23B
Covered by
news.smol.ai
Discussed on
r/LocalLLaMA
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for 225B-A23B
🤖
AI
GitHub
·
2d
2 days ago
How do I set the right
llama.cpp
parameters?
Covers
JSON Schema
Covered by
DEV Community
,
Alex Ewerlöf Notes
Discussed on
r/LocalLLaMA
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for How do I set the right llama.cpp parameters?
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous post
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Discover
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help
Like
Save
Not for me
Report