LLM Inference

Feeds to Scour
SubscribedAll
Scoured 75 posts in 20.3 ms

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

 Inference  Content type: Code
github.com··Hacker News

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)

 🖥️Local AI  Content type: Blog
dev.to··DEV

I switched from LM Studio to llama.cpp, and I'm never going back to a bloated wrapper

 🖥️Local AI
howtogeek.com·

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

 🔓Open Source AI  Content type: News  Content type: Blog
blog.google··Hacker News

Speculators v0.5.0: DFlash support and online training

 Inference
developers.redhat.com·

The latest Gemma 4 models use a training trick to slash their on-device memory footprint

 🔓Open Source AI
androidauthority.com·

KVarN, Cost.dev, headroom — the week the agent runtime bill got itemized

 🤖AI Inference  Content type: Blog
dev.to··DEV
Less-relevant results

not much happened today | AINews

 🔓Open Source AI
news.smol.ai·

alexziskind1/model-shelf: Model Shelf is a local-first model resolver that helps AI agents and scripts find model weights on your own storage before downloading from Hugging Face. Point it at an internal SSD, NAS, external SSD, or Thunderbolt DAS, and it returns the best local path for GGUF, MLX, safetensors, Ollama, vLLM, and other local AI workflows.

 🖥️Local AI  Content type: Code
github.com·

ComfyUI NVFP4 in 2026: 3 Faster Image Generation on RTX 50-Series (and the Right Format for RTX 40-Series)

 🟩Nvidia  Content type: Blog
dev.to··DEV

Launch HN: General Instinct (YC P26) – Frontier models on edge devices

 🚀Frontier AI  Content type: Discussion

How to Tune llama.cpp --n-gpu-layers: A Practical VRAM Guide (2026)

 🖥️Local AI  Content type: Blog
dev.to··DEV

I kept using Claude Code. Added one thing to it. Cut AI engineering costs by 62%.

 🤖Large Language Models  Content type: Blog
dev.to··DEV

Why Self-Hosted Claude Code Was 15 Slower Than It Should Be

 🧠LLMs  Content type: Blog
dev.to··DEV

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

 Quantization  Content type: Blog
dev.to··DEV

NVIDIA and Apple Solved the Hardware. Here's What's Left to Build.

 Quantization  Content type: Blog
dev.to··DEV

Run Gemma-4 12B on WSL2 with llama.cpp

 🔓Open Source AI  Content type: Blog
dev.to··DEV

AI-Native Network Security: Real-Time Threat Detection at the Edge

 💻WMI Abuse  Content type: Blog
dev.to··DEV

[AINews] not much happened today

 🔓Open Source AI  Content type: News
latent.space
·

The Future of AI Strategy – "Inference Economics" & Hybrid Infrastructure

 📊Compute Markets  Content type: Blog
dev.to··DEV

No more posts from buckman's subscribed feeds.

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help