⚡ Inference - ratorcvn · Scour

TFLite Edge Model Quantizer Snippet

itsevilduck.gumroad.com··DEV

Making LLMs faster and more efficient across multiple languages

techxplore.com·

Machinic Psychopharmacology: Do LLMs Self-Medicate?

lesswrong.com··Hacker News

Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss

🧠LLMs News

Re-quantizing a local LLM 14x faster by skipping the tensors that didn't change

🧠LLMs News Blog

andreaborio.substack.com··Substack

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

🤖AI Agents Blog

tilert.ai··Hacker News

Anatomy of a high-performance EP kernel

🧠LLMs Blog

fergusfinn.com··Hacker News

Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good

🧠LLMs Blog

towardsai.net·

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

🧠LLMs Blog

ziraph.com··Hacker News

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

deemwar-products.github.io··Hacker News

Token4Token — pay-per-token inference on Gnosis + Swarm

t4t.eth.link··Hacker News

Optimal Post-Training Quantization Scales and Where to Find Them

🧠LLMs Academic

Massive AI Storage Demand Creates a New Memory Wall

🧠LLMs News

Making Local LLM Go Brrr

seanpedersen.github.io·

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

🧠LLMs Blog

dnhkng.github.io·

On-device AI is a margin decision

🧠LLMs Blog

ziraph.com··Hacker News

Where to Host Your Open-Source Model (Under 10B Parameters)

digitalocean.com·

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

☸️K8S Blog

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

sleepingrobots.com·

Google’s DiffusionGemma is 4x faster than its other Gemma models

thenewstack.io·

Sign up or log in to see more results

Log in to enable infinite scrolling