⚡ Inference Optimization - SeanNg

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

🤖AI Code

github.com··r/LocalLLaMA

Unsloth Gemma 4 QAT

🦙Llama

unsloth.ai·

Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good

🦙Llama Blog

towardsai.net·

NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization

🎯Fine-tuning Academic

arxiv.org·

local llm on laptop 780M GPU using llama + gemma 4 qat

🦙Llama Blog

alper.bearblog.dev·

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

🤖LLM

huggingface.co··r/LocalLLaMA

Shrinking a Neural Network Often Makes It Smarter

🦙Llama

siliconopera.com·

Ollama 0.30 delivers faster NVIDIA GPU performance and wider hardware support

🦙Llama

alternativeto.net·

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

🤖LLM Blog

dnhkng.github.io·

What's in the Box? A Field Guide to AI Models

🦙Llama Blog

iankduncan.com·

Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss

🦙Llama News

digg.com·

Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks

👁️Computer Vision

aarushgupta.io··Lobsters, Hacker News

The Inference Alpha: Maximizing Frontier Models on AMD

🤖LLM Blog

digitalocean.com·

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

🪟Context Windows

sleepingrobots.com·

146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb

🤖Agent Blog

adambien.blog·

CoreML vs TFLite: iPhone 15 Pro GPU 2.3x Faster

🤖AI Blog Discussion

tildalice.io·

AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support

OpenCV 5 brings new deep neural network engine, stronger ONNX support, and faster core

DeskDash - a free Windows tool to easily manage your GGUF files

A generalist biomedical vision-language model via multi-CLIP knowledge distillation

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

Unsloth Gemma 4 QAT

Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good

NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization

local llm on laptop 780M GPU using llama + gemma 4 qat

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

Shrinking a Neural Network Often Makes It Smarter

Ollama 0.30 delivers faster NVIDIA GPU performance and wider hardware support

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

What's in the Box? A Field Guide to AI Models

Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss

Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks

The Inference Alpha: Maximizing Frontier Models on AMD

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb

CoreML vs TFLite: iPhone 15 Pro GPU 2.3x Faster