💰 Inference Cost - nayyara.airlangga · Scour

Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good

🧠Inference Engineering Blog

towardsai.net·

Optimal Post-Training Quantization Scales and Where to Find Them

🗜️Quantization Academic

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

🗜️Quantization News Blog

kaitchup.substack.com··r/LocalLLaMA

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

🧠Inference Engineering Blog

tilert.ai··Hacker News

Domain-Specific Small Language Models (Manning)

⚙️ML Compilers

i-programmer.info·

Apple rebuilt its on-device AI stack at WWDC 2026

🔢GEMM Optimization Blog

ziraph.com··Hacker News

Where to Host Your Open-Source Model (Under 10B Parameters)

🧠Inference Engineering

digitalocean.com·

Researchers Build Self-Replicating AI Worm That Operates Entirely on Local, Open-Weight Models

thehackernews.com·

Unsloth Gemma 4 QAT

🗜️Quantization

ASUS ExpertBook Ultra Flagship Business Laptop Debuts In SEA Markets, Featuring Sub-1kg Chassis & Intel Core Ultra X7 Processor

🧠Inference Engineering

stable-diffusion.cpp/docs/quantization_and_gguf.md at master · leejet/stable-diffusion.cpp

🗜️Quantization Code

github.com··r/StableDiffusion

Azure OpenAI Architecture: The Decisions That Actually Matter (Part 1)

☁️Cloud Infrastructure

techcommunity.microsoft.com

·

Local LLMs, Buy a GPU, and the Case for Cognitive Security

🎮GPU Computing

briefing.forwardfuture.ai·

Ask HN: Is software engineering still a good career choice for new students?

🧠Inference Engineering Discussion

news.ycombinator.com··Hacker News

Why agentic AI needs an open inference stack

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

🧠Inference Engineering Academic

TFLite Edge Model Quantizer Snippet

🔢FP8 Training

itsevilduck.gumroad.com··DEV

MLPerf and the rise of latency-aware LLM benchmarking

⏱️Prefill Decoding

Re-quantizing a local LLM 14x faster by skipping the tensors that didn't change

🗜️Quantization News Blog

andreaborio.substack.com··Substack

3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1

⏱️Prefill Decoding Blog

databricks.com·

Sign up or log in to see more results

Log in to enable infinite scrolling