⚡ Inference - foglerek

🧠LLMs Blog

equixly.com··Hacker News

Report: GKE Inference Gateway delivers up to 92% faster AI responses

🧠LLMs Blog

cloud.google.com··Hacker News

Token4Token — pay-per-token inference on Gnosis + Swarm

🧠LLMs

t4t.eth.link··Hacker News

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

🌐Open Source AI News Blog

blog.google··Hacker News

libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.

🌐Open Source AI Code

github.com·

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

🧠LLMs Blog

jimmysong.io·

Speculators v0.5.0: DFlash support and online training

🧠LLMs

developers.redhat.com·

Less-relevant results

Ask HN: Is software engineering still a good career choice for new students?

🧠LLMs Discussion

news.ycombinator.com··Hacker News

HNSW vs LSH: How Elasticsearch hits 0.99 recall@10 at 15,000 QPS — and what it costs

🧠LLMs Blog

elastic.co·

146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb

🧠LLMs Blog

adambien.blog·

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

🧠LLMs Academic

arxiv.org·

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

🌐Open Source AI

vettedconsumer.com··Hacker News

High Bandwidth Flash | A New Memory for AI Data Centers and Edge Computing | Sandisk

🎛️Fine-tuning

ncnonline.net·

The Bill Arrives: How to Manage Agentic AI Costs at Scale

🤖AI Agents Blog

cockroachlabs.com·

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

🌐Open Source AI

gizchina.com·

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

🌐Open Source AI News Blog

kaitchup.substack.com··r/LocalLLaMA

Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work

🎛️Fine-tuning Blog Discussion

tildalice.io·

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

🎛️Fine-tuning Blog

mimo.xiaomi.com··Hacker News, r/LocalLLaMA

Build a Medical Report Analyzer on Dedicated Inference with Python

🧠LLMs

digitalocean.com·

Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks

How we fight GPU scarcity without compromise

Report: GKE Inference Gateway delivers up to 92% faster AI responses

Token4Token — pay-per-token inference on Gnosis + Swarm

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

Speculators v0.5.0: DFlash support and online training

Ask HN: Is software engineering still a good career choice for new students?

HNSW vs LSH: How Elasticsearch hits 0.99 recall@10 at 15,000 QPS — and what it costs

146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

High Bandwidth Flash | A New Memory for AI Data Centers and Edge Computing | Sandisk

The Bill Arrives: How to Manage Agentic AI Costs at Scale

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

Build a Medical Report Analyzer on Dedicated Inference with Python