🏠 Local LLM Deployment - masterdev

KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.

🗃️SQLite Code

github.com··Hacker News

Fixing a stuck Ollama runner and building a GPU watchdog

🖥️Self-hosted apps

patrickmccanna.net··Hacker News

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

🖥️Self-hosted apps News Blog

blog.google··Hacker News

A system programmer’s guide to LLM inference

🖥️Self-hosted apps Blog

blog.xiangpeng.systems··Hacker News

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

🖥️Self-hosted apps

deemwar-products.github.io··Hacker News

Less-relevant results

Apple WWDC On-Device AI Deep Dive - Google Docs

🖥️Self-hosted apps

gist.is··Hacker News

NexusOS v2.0 – A zero-dependency pipeline streaming server chaos to Parquet

🖥️Self-hosted apps

huggingface.co··Hacker News

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

🪟Awesome windows command-line

local-llm.utop.workers.dev··Hacker News

Looking Inside Chromium’s On-Device AI Stack

🖥️Self-hosted apps Blog

island.io··Hacker News

Integrate on-device AI models into your app using Core AI - WWDC26 - Videos

🖥️Self-hosted apps

developer.apple.com··Hacker News

Run (your largest) local models from your iPhone

🗃️SQLite Blog

lmstudio.ai··Hacker News, r/LocalLLaMA

Gemma 4 12B: A unified, encoder-free multimodal model

🗃️SQLite Discussion

news.ycombinator.com··Hacker News

Google’s DiffusionGemma is 4x faster than its other Gemma models

🗃️SQLite

thenewstack.io·

local AI agents for Cursor with pre-tuned marketplace/commu

🖥️Self-hosted apps

locaible.com··Hacker News

Omnifs: APIs and data sources as files you can ls, cat, grep, and pipe

🖥️Self-hosted apps

omnifs.dev··Hacker News

Apple Silicon's on-device AI bet hasn't moved – only the chip range that runs it

🖥️Self-hosted apps

tbreak.com··Hacker News, r/apple

Token4Token — pay-per-token inference on Gnosis + Swarm

🖥️Self-hosted apps

t4t.eth.link··Hacker News

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

Ollama 0.30 delivers faster NVIDIA GPU performance and wider hardware support

On-device AI is a margin decision

KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.

Fixing a stuck Ollama runner and building a GPU watchdog

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

A system programmer’s guide to LLM inference

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

Apple WWDC On-Device AI Deep Dive - Google Docs

NexusOS v2.0 – A zero-dependency pipeline streaming server chaos to Parquet

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

Looking Inside Chromium’s On-Device AI Stack

Integrate on-device AI models into your app using Core AI - WWDC26 - Videos

Run (your largest) local models from your iPhone

Gemma 4 12B: A unified, encoder-free multimodal model

Google’s DiffusionGemma is 4x faster than its other Gemma models

local AI agents for Cursor with pre-tuned marketplace/commu

Omnifs: APIs and data sources as files you can ls, cat, grep, and pipe

Apple Silicon's on-device AI bet hasn't moved – only the chip range that runs it

Token4Token — pay-per-token inference on Gnosis + Swarm