Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Quantization
⚡ LLM Quantization
Specific
quantization, GGUF, INT4, model compression, ternary inference
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
82
posts in
7.3
ms
I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)
🧠
Local llm
Content type:
Code
github.com
·
2h
2 hours ago
·
r/LocalLLaMA
Actions for I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)
OpenRTLSet: A Fully Open-Source Dataset for Large Language
Model-based
Verilog Module Design
🤖
Qwen
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design
google/gemma-4-12B-it-qat-q4_
0-gguf
🧠
Local llm
huggingface.co
·
5d
5 days ago
Actions for google/gemma-4-12B-it-qat-q4_0-gguf
Apple rebuilt its on-device AI stack at WWDC 2026
🤖
Machine Learning
Content type:
Blog
ziraph.com
·
1d
1 day ago
·
Hacker News
Actions for Apple rebuilt its on-device AI stack at WWDC 2026
not much happened today | AINews
🧠
Local llm
news.smol.ai
·
5d
5 days ago
Actions for not much happened today | AINews
Nvidia DGX Spark GB10 – AI
Models
and Guide with vLLM and Autonomous Script
🧠
LLM Inference
Content type:
Code
github.com
·
5d
5 days ago
·
Hacker News
Actions for Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script
heterodoxin/graphkv: Graph-guided KV cache
compression
for memory-efficient
LLM
inference
.
🧠
LLM Inference
Content type:
Code
github.com
·
4d
4 days ago
·
r/LocalLLaMA
Actions for heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.
Where to Host Your Open-Source
Model
(Under 10B Parameters)
🧠
LLM Inference
digitalocean.com
·
6d
6 days ago
Actions for Where to Host Your Open-Source Model (Under 10B Parameters)
1-bit
and 1.58
bit
LLM
Benchmarking on Jetson Orin Nano Super | Bonsai LM
🤖
Qwen
smolhub.com
·
2d
2 days ago
·
r/LocalLLaMA
Actions for 1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM
harshuljain13/llm-inference-at-scale
: A Practitioner handbook for production
llm
serving.
🧠
LLM Inference
Content type:
Code
github.com
·
4d
4 days ago
·
Hacker News
Actions for harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.
FADA: Accessible fetal ultrasound
interpretation
and annotation with a selectively distilled unified vision-language
model
🧠
Local llm
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model
Using local LLMs for agentic coding
🧠
Local llm
Content type:
Blog
blog.alexewerlof.com
·
6d
6 days ago
Actions for Using local LLMs for agentic coding
bigattichouse/packed-twin-inference
: PTI achieves ~2× throughput using a single
quantized
model
(Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads
model
weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft
model
. No quality loss
🧠
LLM Inference
Content type:
Code
github.com
·
2d
2 days ago
·
r/LocalLLaMA
Actions for bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss
Show HN:
Ext-Infer
🧠
Local llm
infer.displace.tech
·
4d
4 days ago
·
Hacker News
Actions for Show HN: Ext-Infer
Florian Brand, Prime
Intellect
research engineer, adopts Gemma 4 E4B
6-bit
quantized
as his primary local Mac LLM
🤖
Qwen
Content type:
News
digg.com
·
3d
3 days ago
·
Hacker News
Actions for Florian Brand, Prime Intellect research engineer, adopts Gemma 4 E4B 6-bit quantized as his primary local Mac LLM
Qwen3.6 + MTP: Calculated context size is smaller when I use `--spec-draft-type-* q4_0`. is this normal? · ggml-org
llama.cpp
· Discussion #24102
🧠
Local llm
Content type:
Discussion
Content type:
Code
github.com
·
6d
6 days ago
·
r/LocalLLaMA
Actions for Qwen3.6 + MTP: Calculated context size is smaller when I use `--spec-draft-type-* q4_0`. is this normal? · ggml-org llama.cpp · Discussion #24102
Does anyone know what PCIe
mode
was used for these benchmarks?
🧠
LLM Inference
Content type:
Code
github.com
·
4d
4 days ago
·
r/LocalLLaMA
Actions for Does anyone know what PCIe mode was used for these benchmarks?
2x GH200 for
LLM
inference
, Part 2: vLLM, DeepSeek V4 Flash, and MTP
🧠
LLM Inference
Content type:
Blog
dnhkng.github.io
·
3d
3 days ago
Actions for 2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP
defai-digital/ax-engine: Apple Silicon
LLM
runtime supporting Gemma 4 and Qwen 3.6 MTP
modes
🤖
Qwen
Content type:
Code
github.com
·
1d
1 day ago
·
Hacker News
Actions for defai-digital/ax-engine: Apple Silicon LLM runtime supporting Gemma 4 and Qwen 3.6 MTP modes
NeuroBait: I fine-tuned a
model
to spark dopamine for ADHD brain
🧠
Local llm
Content type:
Blog
huggingface.co
·
1d
1 day ago
Actions for NeuroBait: I fine-tuned a model to spark dopamine for ADHD brain
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help