🗜️ Quantization - nayyara.airlangga · Scour

Launch HN: General Instinct (YC P26) – Frontier models on edge devices

💰Inference Cost Discussion

news.ycombinator.com··Hacker News

QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy

💰Inference Cost Academic

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

⏱️Prefill Decoding

smolhub.com··r/LocalLLaMA

mtmd : add video input support by ngxson · Pull Request #24269 · ggml-org/llama.cpp

⚡FlashAttention Code

github.com··r/LocalLLaMA

google/gemma-4-12B-it-qat-q4_0-gguf

🧠Inference Engineering

huggingface.co·

Where to Host Your Open-Source Model (Under 10B Parameters)

🧠Inference Engineering

digitalocean.com·

Benchmarking dots.tts on Strix Halo

🎮GPU Computing

sleepingrobots.com·

LLMCodec: Adapting Video Codecs for Efficient Weight Compression of Large Language Models

💰Inference Cost Academic

Using local LLMs for agentic coding

💰Inference Cost Blog

blog.alexewerlof.com·

mtp: support for gemma-4 E2B and E4B assistants by max-krasnyansky · Pull Request #24282 · ggml-org/llama.cpp

🚀Model Serving Code

github.com··r/LocalLLaMA

CoreML vs TFLite: iPhone 15 Pro GPU 2.3x Faster

💰Inference Cost Blog Discussion

Knowledge Distillation for Visual Autoregressive Models

⚙️MLOps Academic

zhongkaifu/TensorSharp: A C# inference engine for running large language models (LLMs) locally using GGUF model files. TensorSharp provides a console application, a web-based chatbot interface, and Ollama/OpenAI-compatible HTTP APIs for programmatic access. It supports Windows/MacOS/Linux with full GPU capability

💾KV Cache Code

github.com··Hacker News

Show HN: Ext-Infer

💰Inference Cost

infer.displace.tech··Hacker News

Florian Brand, Prime Intellect research engineer, adopts Gemma 4 E4B 6-bit quantized as his primary local Mac LLM

💾KV Cache News

digg.com··Hacker News

bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss

🧠Inference Engineering Code

github.com··r/LocalLLaMA

Making Local LLM Go Brrr

⏱️Prefill Decoding

seanpedersen.github.io·

The Edge LLM Offload Story

🧠Inference Engineering

semiengineering.com·

apple/coreai-models: Model export recipes, Python primitives, and Swift runtime utilities for on-device AI

🧠Inference Engineering Code

github.com··Hacker News

vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

🔢GEMM Optimization Academic

Sign up or log in to see more results

Log in to enable infinite scrolling