Quantization

Feeds to Scour
SubscribedAll
Scoured 85 posts in 6.7 ms

Launch HN: General Instinct (YC P26) – Frontier models on edge devices

 💰Inference Cost  Content type: Discussion

QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy

 💰Inference Cost  Content type: Academic
arxiv.org·

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

 ⏱️Prefill Decoding
smolhub.com··r/LocalLLaMA

mtmd : add video input support by ngxson · Pull Request #24269 · ggml-org/llama.cpp

 FlashAttention  Content type: Code
github.com··r/LocalLLaMA

google/gemma-4-12B-it-qat-q4_0-gguf

 🧠Inference Engineering
huggingface.co·

Where to Host Your Open-Source Model (Under 10B Parameters)

 🧠Inference Engineering
digitalocean.com·

Benchmarking dots.tts on Strix Halo

 🎮GPU Computing
sleepingrobots.com·

LLMCodec: Adapting Video Codecs for Efficient Weight Compression of Large Language Models

 💰Inference Cost  Content type: Academic
arxiv.org·

Using local LLMs for agentic coding

 💰Inference Cost  Content type: Blog
blog.alexewerlof.com·

mtp: support for gemma-4 E2B and E4B assistants by max-krasnyansky · Pull Request #24282 · ggml-org/llama.cpp

 🚀Model Serving  Content type: Code
github.com··r/LocalLLaMA

CoreML vs TFLite: iPhone 15 Pro GPU 2.3x Faster

 💰Inference Cost  Content type: Blog  Content type: Discussion
tildalice.io·

Knowledge Distillation for Visual Autoregressive Models

 ⚙️MLOps  Content type: Academic
arxiv.org·

zhongkaifu/TensorSharp: A C# inference engine for running large language models (LLMs) locally using GGUF model files. TensorSharp provides a console application, a web-based chatbot interface, and Ollama/OpenAI-compatible HTTP APIs for programmatic access. It supports Windows/MacOS/Linux with full GPU capability

 💾KV Cache  Content type: Code
github.com··Hacker News

Show HN: Ext-Infer

 💰Inference Cost

Florian Brand, Prime Intellect research engineer, adopts Gemma 4 E4B 6-bit quantized as his primary local Mac LLM

 💾KV Cache  Content type: News
digg.com··Hacker News

bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss

 🧠Inference Engineering  Content type: Code
github.com··r/LocalLLaMA

Making Local LLM Go Brrr

 ⏱️Prefill Decoding

The Edge LLM Offload Story

 🧠Inference Engineering
semiengineering.com·

apple/coreai-models: Model export recipes, Python primitives, and Swift runtime utilities for on-device AI

 🧠Inference Engineering  Content type: Code
github.com··Hacker News

vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

 🔢GEMM Optimization  Content type: Academic
arxiv.org·
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help