⚡ Inference - jobz

🧠LLMs Blog

dnhkng.github.io·

Gemma 4 QAT on 10GB Laptop: Local AI with 6.7GB VRAM

👁️Multimodal AI

everylocalai.com··DEV

Neo-X7/Neo-AI: A fully offline AI assistant powered by Ollama. Stores and retrieves conversations using SQLite + LanceDB vector search. No cloud. No API keys. Runs entirely on your machine.

📐Embeddings Code

github.com··DEV

Less-relevant results

AI Serving Platform That Adapts to Your Model

🎛️Fine-tuning Blog

databricks.com·

Tales of an Ollama Honeypot (Part 3): More Traffic, More Findings

🧪Synthetic Data

posts.inthecyber.com·

The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking

🧠LLMs

edn.com·

WEKA software speeds long context AI inferencing on Oracle’s public cloud

🏛️DAOs News

blocksandfiles.com·

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

🧠LLMs Academic

arxiv.org·

Anatomy of a high-performance EP kernel

🔌MCP Blog

fergusfinn.com··Hacker News

AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support

⚙️Agent Frameworks

phoronix.com·

The economics of speculative decoding

💎Token Economics Blog

fergusfinn.com··Hacker News

DiffusionGemma: The Developer Guide

🎛️Fine-tuning Blog

developers.googleblog.com·

Report: GKE Inference Gateway delivers up to 92% faster AI responses

🔍RAG Blog

cloud.google.com··Hacker News

Self-hosted remote access for Ollama without complicated setup

🏛️DAOs

oab.arc-i.co.uk··r/selfhosted

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

🤖AI Agents Blog

tilert.ai··Hacker News

A system programmer’s guide to LLM inference

🧠LLMs Blog

blog.xiangpeng.systems··Hacker News

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

🧠LLMs

huggingface.co··r/LocalLLaMA

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

👁️Multimodal AI Code

github.com··Hacker News

DiffusionGemma: 4x Faster Text Generation

🔬AI Research News Blog

blog.google··Hacker News, r/LocalLLaMA, r/singularity

Fixing a stuck Ollama runner and building a GPU watchdog

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

Gemma 4 QAT on 10GB Laptop: Local AI with 6.7GB VRAM

Neo-X7/Neo-AI: A fully offline AI assistant powered by Ollama. Stores and retrieves conversations using SQLite + LanceDB vector search. No cloud. No API keys. Runs entirely on your machine.

AI Serving Platform That Adapts to Your Model

Tales of an Ollama Honeypot (Part 3): More Traffic, More Findings

The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking

WEKA software speeds long context AI inferencing on Oracle’s public cloud

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

Anatomy of a high-performance EP kernel

AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support

The economics of speculative decoding

DiffusionGemma: The Developer Guide

Report: GKE Inference Gateway delivers up to 92% faster AI responses

Self-hosted remote access for Ollama without complicated setup

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

A system programmer’s guide to LLM inference

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

DiffusionGemma: 4x Faster Text Generation