🗜️ Quantization - nayyara.airlangga · Scour

Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters

🧠Inference Engineering Academic

Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss

💰Inference Cost News

OpenAI govt stake 🇺🇸, Google compute deal 🚀, Microsoft Scout launch 🤖

💰Inference Cost

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

💰Inference Cost

local-llm.utop.workers.dev··Hacker News

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

💰Inference Cost Blog

ziraph.com··Hacker News

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

🎮GPU Computing Blog

dnhkng.github.io·

Gemma 4 12B: A unified, encoder-free multimodal model

⚡FlashAttention Discussion

news.ycombinator.com··Hacker News

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

🚀Speculative Decoding

sleepingrobots.com·

FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

💰Inference Cost Academic

Remove padding and multiple D2D copies for MTP by gaugarg-nv · Pull Request #24086 · ggml-org/llama.cpp

🟢CUDA Code

github.com··r/LocalLLaMA

Ideogram4 GGUF is out!

🚀Speculative Decoding

huggingface.co··r/StableDiffusion

Apple rebuilt its on-device AI stack at WWDC 2026

🔢GEMM Optimization Blog

ziraph.com··Hacker News

Dew Drop - June 8, 2026 (#4685)

🧠Inference Engineering

alvinashcraft.com·

stable-diffusion.cpp/docs/quantization_and_gguf.md at master · leejet/stable-diffusion.cpp

💰Inference Cost Code

github.com··r/StableDiffusion

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

🧠Inference Engineering Blog

dnhkng.github.io·

Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

💰Inference Cost Academic

iChristGit/comfyui-llamacpp-ideogram: ComfyUI Prompt enhancer for ideogram4 powered by llama cpp

🚀Speculative Decoding Code

github.com··r/StableDiffusion

not much happened today | AINews

🧠Inference Engineering

alexziskind1/model-shelf: Model Shelf is a local-first model resolver that helps AI agents and scripts find model weights on your own storage before downloading from Hugging Face. Point it at an internal SSD, NAS, external SSD, or Thunderbolt DAS, and it returns the best local path for GGUF, MLX, safetensors, Ollama, vLLM, and other local AI workflows.

🧠Inference Engineering Code

Launch HN: General Instinct (YC P26) – Frontier models on edge devices

💰Inference Cost Discussion

news.ycombinator.com··Hacker News

Log in to enable infinite scrolling