Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
305
posts in
7.8
ms
KJLdefeated/RL.cu: RLVR training for
LLM
in CUDA/C++
🤖
Machine Learning
Content type:
Code
github.com
·
4d
4 days ago
·
Hacker News
Actions for KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++
TFLite Edge Model
Quantizer
Snippet
🤖
Machine Learning
itsevilduck.gumroad.com
·
2d
2 days ago
·
DEV
Actions for TFLite Edge Model Quantizer Snippet
Less-relevant results
AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support
🧠
Local llm
phoronix.com
·
16h
16 hours ago
·
r/artificial
Actions for AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support
Re-quantizing
a local
LLM
14x faster by skipping the
tensors
that didn't change
⚡
LLM Quantization
Content type:
News
Content type:
Blog
andreaborio.substack.com
·
19h
19 hours ago
·
Substack
Actions for Re-quantizing a local LLM 14x faster by skipping the tensors that didn't change
High Bandwidth Flash | A New Memory for AI Data Centers and Edge Computing | Sandisk
🐢
Turso
ncnonline.net
·
2d
2 days ago
Actions for High Bandwidth Flash | A New Memory for AI Data Centers and Edge Computing | Sandisk
LLM
Research Papers: The 2026 List (January to May)
🧠
Local llm
Content type:
News
magazine.sebastianraschka.com
·
4d
4 days ago
·
Hacker News
Actions for LLM Research Papers: The 2026 List (January to May)
China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)
⚡
LLM Quantization
Content type:
News
decrypt.co
·
2d
2 days ago
·
Hacker News
Actions for China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)
Where to Host Your Open-Source Model (Under 10B Parameters)
🧠
Local llm
digitalocean.com
·
6d
6 days ago
Actions for Where to Host Your Open-Source Model (Under 10B Parameters)
SPEAR: A System for
Post-Quantization
Error-Adaptive Recovery Enabling Efficient Low-Bit
LLM
Serving
⚡
LLM Quantization
Content type:
Academic
arxiv.org
·
4h
4 hours ago
Actions for SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving
Nvidia DGX Spark GB10 – AI Models and Guide with
vLLM
and Autonomous Script
⚡
LLM Quantization
Content type:
Code
github.com
·
5d
5 days ago
·
Hacker News
Actions for Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script
From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure
☸️
Kubernetes
Content type:
Blog
jimmysong.io
·
2d
2 days ago
Actions for From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure
Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks
⚡
LLM Quantization
aarushgupta.io
·
1d
1 day ago
·
Lobsters
,
Hacker News
Actions for Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks
The latest Gemma 4 models use a training trick to slash their on-device memory footprint
🧠
Local llm
androidauthority.com
·
5d
5 days ago
Actions for The latest Gemma 4 models use a training trick to slash their on-device memory footprint
On-device AI is a margin decision
🧠
Local llm
Content type:
Blog
ziraph.com
·
14h
14 hours ago
·
Hacker News
Actions for On-device AI is a margin decision
libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA
KV-cache
. Weights available on Hugging Face.
⚡
LLM Quantization
Content type:
Code
github.com
·
1d
1 day ago
Actions for libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.
AI Serving Platform That Adapts to Your Model
☸️
Kubernetes
Content type:
Blog
databricks.com
·
16h
16 hours ago
Actions for AI Serving Platform That Adapts to Your Model
Show HN: Run
Llama.cpp
In-Process
from Java with Project Panama FFM
🧠
Local llm
deemwar-products.github.io
·
5d
5 days ago
·
Hacker News
Actions for Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM
Google’s DiffusionGemma is 4x faster than its other Gemma models
⚡
LLM Quantization
thenewstack.io
·
15h
15 hours ago
Actions for Google’s DiffusionGemma is 4x faster than its other Gemma models
OpenCV 5.0 Computer Vision Library Released with Rewritten DNN Engine
🤖
Machine Learning
linuxiac.com
·
2d
2 days ago
Actions for OpenCV 5.0 Computer Vision Library Released with Rewritten DNN Engine
heterodoxin/graphkv: Graph-guided
KV
cache
compression for memory-efficient
LLM
inference.
🤖
Qwen
Content type:
Code
github.com
·
4d
4 days ago
·
r/LocalLLaMA
Actions for heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help