⚡ LLM Inference - cyberpsych12 · Scour

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

🤖LLMs Code

github.com··Hacker News

Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%

zozo123.github.io··Hacker News

How ERGO Hestia reduced time-to-market with Lakebase and Mosaic AI Model Serving

📈Performance Engineering Blog

databricks.com·

UniSVQ: 2-bit Unified Scalar-Vector Quantization

✍️Prompt Engineering Academic

Google's DiffusionGemma generates 256 tokens in parallel and self-corrects as it goes

venturebeat.com·

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

🤖LLMs News Blog

blog.google··Hacker News

How Does Attention Work in LLMs? 2026 Deep Dive

✍️Prompt Engineering Blog

·

Qwen 3.6 27B AutoRound GGUF, need your feedback

huggingface.co··r/LocalLLaMA

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

🤖LLMs Blog

·

A visionary legacy: Andy Brown

uk.themedialeader.com·

Less-relevant results

Friday Five — June 12, 2026

Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work

📈Performance Engineering Blog Discussion

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

🤖LLMs News

newsletter.semianalysis.com

··Hacker News

Stop Treating Your Models Like Microservices

🔭Observability

cloudnativenow.com·

The latest Gemma 4 models use a training trick to slash their on-device memory footprint

androidauthority.com·

TurboQuant in PostgreSQL

🗄️Databases Blog

blog.mayflower.de·

NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

🤖LLMs Blog

blogs.nvidia.com·

Unsloth Gemma 4 QAT

🎮GPU Computing

Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks

📈Performance Engineering

aarushgupta.io··Lobsters, Hacker News

massimo92/spark: CLI tool for serving LLMs with vLLM on NVIDIA DGX Spark. One file, zero friction.

🤖LLMs Code

github.com··Hacker News

Log in to enable infinite scrolling