Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
85
posts in
6.7
ms
On-device AI is a margin decision
🧠
Local llm
Content type:
Blog
ziraph.com
·
8h
8 hours ago
·
Hacker News
Actions for On-device AI is a margin decision
KJLdefeated/RL.cu: RLVR training for
LLM
in CUDA/C++
🤖
Machine Learning
Content type:
Code
github.com
·
3d
3 days ago
·
Hacker News
Actions for KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++
1-bit and 1.58 bit
LLM
Benchmarking on Jetson Orin Nano Super | Bonsai LM
🤖
Qwen
smolhub.com
·
2d
2 days ago
·
r/LocalLLaMA
Actions for 1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM
Google’s DiffusionGemma is 4x faster than its other Gemma models
⚡
LLM Quantization
thenewstack.io
·
9h
9 hours ago
Actions for Google’s DiffusionGemma is 4x faster than its other Gemma models
LLM
Research Papers: The 2026 List (January to May)
🧠
Local llm
Content type:
News
magazine.sebastianraschka.com
·
4d
4 days ago
·
Hacker News
Actions for LLM Research Papers: The 2026 List (January to May)
defai-digital/ax-engine: Apple Silicon
LLM
runtime supporting Gemma 4 and Qwen 3.6 MTP modes
🤖
Qwen
Content type:
Code
github.com
·
1d
1 day ago
·
Hacker News
Actions for defai-digital/ax-engine: Apple Silicon LLM runtime supporting Gemma 4 and Qwen 3.6 MTP modes
DiffusionGemma: 4x Faster Text Generation
⚡
LLM Quantization
Content type:
News
Content type:
Blog
blog.google
·
10h
10 hours ago
·
Hacker News
,
r/LocalLLaMA
,
r/singularity
Actions for DiffusionGemma: 4x Faster Text Generation
google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
🦀
Rust
huggingface.co
·
2d
2 days ago
·
r/LocalLLaMA
Actions for google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
Show HN: Run
Llama.cpp
In-Process
from Java with Project Panama FFM
🧠
Local llm
deemwar-products.github.io
·
5d
5 days ago
·
Hacker News
Actions for Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM
FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse
Attention
⚡
LLM Quantization
Content type:
Academic
arxiv.org
·
1d
1 day ago
·
Hacker News
Actions for FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention
Nvidia DGX Spark GB10 – AI Models and Guide with
vLLM
and Autonomous Script
⚡
LLM Quantization
Content type:
Code
github.com
·
5d
5 days ago
·
Hacker News
Actions for Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script
Magenta RealTime 2: Open and Local Live Music Models
⚡
LLM Quantization
magenta.withgoogle.com
·
6d
6 days ago
·
Hacker News
,
Hacker News
,
r/LocalLLaMA
Actions for Magenta RealTime 2: Open and Local Live Music Models
Here's a
llama.cpp
CLI Command builder.
🧠
Local llm
llamabuilding.com
·
2d
2 days ago
·
r/LocalLLaMA
Actions for Here's a llama.cpp CLI Command builder.
heterodoxin/graphkv: Graph-guided
KV
cache
compression for memory-efficient
LLM
inference.
🤖
Qwen
Content type:
Code
github.com
·
3d
3 days ago
·
r/LocalLLaMA
Actions for heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.
MiMo-v2.5-Pro-UltraSpeed
: 1T model with 1000 TPS
⚡
LLM Quantization
Content type:
Blog
mimo.xiaomi.com
·
3d
3 days ago
·
Hacker News
,
r/LocalLLaMA
Actions for MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS
Apples to Apples: MLX vs.
Llama.cpp
for Gemma 4 12B on an M1 16GB
🧠
Local llm
Content type:
Blog
ziraph.com
·
5d
5 days ago
·
Hacker News
Actions for Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB
Anatomy of a high-performance EP kernel
👁️
Observability
Content type:
Blog
fergusfinn.com
·
1d
1 day ago
·
Hacker News
Actions for Anatomy of a high-performance EP kernel
Introducing
Granite
Libraries and Project
Granite
Switch
🔌
Model Context Protocol
Content type:
Blog
research.ibm.com
·
6d
6 days ago
·
Hacker News
Actions for Introducing Granite Libraries and Project Granite Switch
bigattichouse/packed-twin-inference
: PTI achieves ~2× throughput using a single
quantized
model (Q5_K_M or better) by running 4 generation streams in one
batched
decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss
🧠
Local llm
Content type:
Code
github.com
·
2d
2 days ago
·
r/LocalLLaMA
Actions for bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss
How to Measure Time To First Token (TTFT) in AI Systems
🧠
Local llm
qainsights.com
·
4d
4 days ago
·
Hacker News
Actions for How to Measure Time To First Token (TTFT) in AI Systems
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help