🧠 LLM Inference - akapaka · Scour

On-device AI is a margin decision

🧠Local llm Blog

ziraph.com··Hacker News

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

🤖Machine Learning Code

github.com··Hacker News

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

smolhub.com··r/LocalLLaMA

Google’s DiffusionGemma is 4x faster than its other Gemma models

⚡LLM Quantization

thenewstack.io·

LLM Research Papers: The 2026 List (January to May)

🧠Local llm News

magazine.sebastianraschka.com

··Hacker News

defai-digital/ax-engine: Apple Silicon LLM runtime supporting Gemma 4 and Qwen 3.6 MTP modes

🤖Qwen Code

github.com··Hacker News

DiffusionGemma: 4x Faster Text Generation

⚡LLM Quantization News Blog

blog.google··Hacker News, r/LocalLLaMA, r/singularity

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

huggingface.co··r/LocalLLaMA

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

deemwar-products.github.io··Hacker News

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

⚡LLM Quantization Academic

arxiv.org··Hacker News

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

⚡LLM Quantization Code

github.com··Hacker News

Magenta RealTime 2: Open and Local Live Music Models

⚡LLM Quantization

magenta.withgoogle.com··Hacker News, Hacker News, r/LocalLLaMA

Here's a llama.cpp CLI Command builder.

llamabuilding.com··r/LocalLLaMA

heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.

🤖Qwen Code

github.com··r/LocalLLaMA

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

⚡LLM Quantization Blog

mimo.xiaomi.com··Hacker News, r/LocalLLaMA

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

🧠Local llm Blog

ziraph.com··Hacker News

Anatomy of a high-performance EP kernel

👁️Observability Blog

fergusfinn.com··Hacker News

Introducing Granite Libraries and Project Granite Switch

🔌Model Context Protocol Blog

research.ibm.com··Hacker News

bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss

🧠Local llm Code

github.com··r/LocalLLaMA

How to Measure Time To First Token (TTFT) in AI Systems

qainsights.com··Hacker News

Log in to enable infinite scrolling