Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Inference
⚡ Inference
LLM inference, model serving, vLLM, TensorRT, latency
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
316
posts in
7.3
ms
TFLite Edge
Model
Quantizer
Snippet
🤖
AI
itsevilduck.gumroad.com
·
2d
2 days ago
·
DEV
Actions for TFLite Edge Model Quantizer Snippet
Making LLMs faster and more efficient across multiple languages
🧠
LLMs
techxplore.com
·
6d
6 days ago
Actions for Making LLMs faster and more efficient across multiple languages
Machinic Psychopharmacology: Do LLMs Self-Medicate?
🧠
LLMs
lesswrong.com
·
7h
7 hours ago
·
Hacker News
Actions for Machinic Psychopharmacology: Do LLMs Self-Medicate?
Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive
llama.cpp
conversions suffer accuracy loss
🧠
LLMs
Content type:
News
digg.com
·
5d
5 days ago
Actions for Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss
Re-quantizing
a local
LLM
14x faster by skipping the
tensors
that didn't change
🧠
LLMs
Content type:
News
Content type:
Blog
andreaborio.substack.com
·
9h
9 hours ago
·
Substack
Actions for Re-quantizing a local LLM 14x faster by skipping the tensors that didn't change
Two Leaps to 1000 Tokens/s on a 1T-Parameter
Model
: On
Inference
Systems, Execution Boundaries, and Co-Design
🤖
AI Agents
Content type:
Blog
tilert.ai
·
2d
2 days ago
·
Hacker News
Actions for Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design
Anatomy of a high-performance EP kernel
🧠
LLMs
Content type:
Blog
fergusfinn.com
·
22h
22 hours ago
·
Hacker News
Actions for Anatomy of a high-performance EP kernel
Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good
🧠
LLMs
Content type:
Blog
towardsai.net
·
2d
2 days ago
Actions for Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good
Apples to Apples: MLX vs.
Llama.cpp
for Gemma 4 12B on an M1 16GB
🧠
LLMs
Content type:
Blog
ziraph.com
·
5d
5 days ago
·
Hacker News
Actions for Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB
Show HN: Run
Llama.cpp
In-Process from Java with Project Panama FFM
🧠
LLMs
deemwar-products.github.io
·
5d
5 days ago
·
Hacker News
Actions for Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM
Token4Token — pay-per-token
inference
on Gnosis + Swarm
🧠
LLMs
t4t.eth.link
·
1d
1 day ago
·
Hacker News
Actions for Token4Token — pay-per-token inference on Gnosis + Swarm
Optimal Post-Training
Quantization
Scales and Where to Find Them
🧠
LLMs
Content type:
Academic
arxiv.org
·
18h
18 hours ago
Actions for Optimal Post-Training Quantization Scales and Where to Find Them
Massive AI Storage Demand Creates a New Memory Wall
🧠
LLMs
Content type:
News
eetimes.com
·
7h
7 hours ago
Actions for Massive AI Storage Demand Creates a New Memory Wall
Making Local
LLM
Go Brrr
🧠
LLMs
seanpedersen.github.io
·
6d
6 days ago
Actions for Making Local LLM Go Brrr
Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent
🧠
LLMs
Content type:
Blog
dnhkng.github.io
·
2d
2 days ago
Actions for Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent
On-device AI is a margin decision
🧠
LLMs
Content type:
Blog
ziraph.com
·
3h
3 hours ago
·
Hacker News
Actions for On-device AI is a margin decision
Where to Host Your Open-Source
Model
(Under 10B Parameters)
📊
AI Models
digitalocean.com
·
6d
6 days ago
Actions for Where to Host Your Open-Source Model (Under 10B Parameters)
From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure
☸️
K8S
Content type:
Blog
jimmysong.io
·
1d
1 day ago
Actions for From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure
BeeLlama.cpp
DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster
🧠
LLMs
sleepingrobots.com
·
3d
3 days ago
Actions for BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster
Google’s DiffusionGemma is 4x faster than its other Gemma
models
🧠
LLMs
thenewstack.io
·
4h
4 hours ago
Actions for Google’s DiffusionGemma is 4x faster than its other Gemma models
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help