⚡ LLM Quantization - akapaka

Remove padding and multiple D2D copies for MTP by gaugarg-nv · Pull Request #24086 · ggml-org/llama.cpp

🧠LLM Inference Academic

arxiv.org·

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

🧠LLM Inference

local-llm.utop.workers.dev··Hacker News

Apple WWDC On-Device AI Deep Dive - Google Docs

🧠LLM Inference

gist.is··Hacker News

stable-diffusion.cpp/docs/quantization_and_gguf.md at master · leejet/stable-diffusion.cpp

🧠LLM Inference Code

github.com··r/StableDiffusion

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

🧠LLM Inference Blog

dnhkng.github.io·

Ideogram4 GGUF is out!

🧠Local llm

huggingface.co··r/StableDiffusion

Gemma 4 12B: A unified, encoder-free multimodal model

🧠Local llm Discussion

news.ycombinator.com··Hacker News

alexziskind1/model-shelf: Model Shelf is a local-first model resolver that helps AI agents and scripts find model weights on your own storage before downloading from Hugging Face. Point it at an internal SSD, NAS, external SSD, or Thunderbolt DAS, and it returns the best local path for GGUF, MLX, safetensors, Ollama, vLLM, and other local AI workflows.

🧠Local llm Code

github.com·

SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

🧠LLM Inference Academic

arxiv.org·

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

🧠Local llm

sleepingrobots.com·

A system programmer’s guide to LLM inference

🧠LLM Inference Blog

blog.xiangpeng.systems··Hacker News

mtmd : add video input support by ngxson · Pull Request #24269 · ggml-org/llama.cpp

🧠Local llm Code

github.com··r/LocalLLaMA

mtp: support for gemma-4 E2B and E4B assistants by max-krasnyansky · Pull Request #24282 · ggml-org/llama.cpp

🧠Local llm Code

github.com··r/LocalLLaMA

OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

🤖Qwen Academic

arxiv.org·

Dew Drop - June 8, 2026 (#4685)

🧠Local llm

alvinashcraft.com·

Launch HN: General Instinct (YC P26) – Frontier models on edge devices

🧠Local llm Discussion

news.ycombinator.com··Hacker News

john-rocky/apple-silicon-llm-bench: Neutral, reproducible benchmark for local LLMs on Apple Silicon (Mac · iPhone · iPad) — MLX, llama.cpp, CoreML, Apple Foundation Models

🤖Qwen Code

github.com··Hacker News

Remove padding and multiple D2D copies for MTP by gaugarg-nv · Pull Request #24086 · ggml-org/llama.cpp

146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

Apple WWDC On-Device AI Deep Dive - Google Docs

stable-diffusion.cpp/docs/quantization_and_gguf.md at master · leejet/stable-diffusion.cpp

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

Ideogram4 GGUF is out!

Gemma 4 12B: A unified, encoder-free multimodal model

SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

A system programmer’s guide to LLM inference

mtmd : add video input support by ngxson · Pull Request #24269 · ggml-org/llama.cpp

mtp: support for gemma-4 E2B and E4B assistants by max-krasnyansky · Pull Request #24282 · ggml-org/llama.cpp

OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

Dew Drop - June 8, 2026 (#4685)

Launch HN: General Instinct (YC P26) – Frontier models on edge devices

john-rocky/apple-silicon-llm-bench: Neutral, reproducible benchmark for local LLMs on Apple Silicon (Mac · iPhone · iPad) — MLX, llama.cpp, CoreML, Apple Foundation Models