LLM Quantization

Feeds to Scour
SubscribedAll
Scoured 83 posts in 5.8 ms

Remove padding and multiple D2D copies for MTP by gaugarg-nv · Pull Request #24086 · ggml-org/llama.cpp

 🧠Local llm  Content type: Code
github.com··r/LocalLLaMA

146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb

 🧠Local llm  Content type: Blog
adambien.blog·

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

 🧠Local llm  Content type: Blog
ziraph.com··Hacker News

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

 🧠LLM Inference  Content type: Academic
arxiv.org·

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

 🧠LLM Inference

Apple WWDC On-Device AI Deep Dive - Google Docs

 🧠LLM Inference
gist.is··Hacker News

stable-diffusion.cpp/docs/quantization_and_gguf.md at master · leejet/stable-diffusion.cpp

 🧠LLM Inference  Content type: Code

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

 🧠LLM Inference  Content type: Blog
dnhkng.github.io·

Ideogram4 GGUF is out!

 🧠Local llm

Gemma 4 12B: A unified, encoder-free multimodal model

 🧠Local llm  Content type: Discussion

alexziskind1/model-shelf: Model Shelf is a local-first model resolver that helps AI agents and scripts find model weights on your own storage before downloading from Hugging Face. Point it at an internal SSD, NAS, external SSD, or Thunderbolt DAS, and it returns the best local path for GGUF, MLX, safetensors, Ollama, vLLM, and other local AI workflows.

 🧠Local llm  Content type: Code
github.com·

SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

 🧠LLM Inference  Content type: Academic
arxiv.org·

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

 🧠Local llm
sleepingrobots.com·

A system programmer’s guide to LLM inference

 🧠LLM Inference  Content type: Blog

mtmd : add video input support by ngxson · Pull Request #24269 · ggml-org/llama.cpp

 🧠Local llm  Content type: Code
github.com··r/LocalLLaMA

mtp: support for gemma-4 E2B and E4B assistants by max-krasnyansky · Pull Request #24282 · ggml-org/llama.cpp

 🧠Local llm  Content type: Code
github.com··r/LocalLLaMA

OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

 🤖Qwen  Content type: Academic
arxiv.org·

Dew Drop - June 8, 2026 (#4685)

 🧠Local llm
alvinashcraft.com·

Launch HN: General Instinct (YC P26) – Frontier models on edge devices

 🧠Local llm  Content type: Discussion

john-rocky/apple-silicon-llm-bench: Neutral, reproducible benchmark for local LLMs on Apple Silicon (Mac · iPhone · iPad) — MLX, llama.cpp, CoreML, Apple Foundation Models

 🤖Qwen  Content type: Code
github.com··Hacker News

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help