Inference

LLM inference, vLLM, TensorRT, model serving, inference optimization

Feeds to Scour
SubscribedAll
Scoured 101 posts in 20.1 ms

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

 🧠LLMs
sleepingrobots.com·

What Arm-based innovations happened in May 2026?

 🤖AI Engineering  Content type: Blog
newsroom.arm.com·

defai-digital/ax-engine: Apple Silicon LLM runtime supporting Gemma 4 and Qwen 3.6 MTP modes

 🧠LLMs  Content type: Code
github.com··Hacker News

Using local LLMs for agentic coding

 🧠LLMs  Content type: Blog
blog.alexewerlof.com·

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

 🧠LLMs  Content type: Academic
arxiv.org·

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops

 🤖AI Engineering  Content type: Video
youtube.com·

not much happened today | AINews

 🤖AI Engineering
news.smol.ai·

Youssof Altoukhi (@Youssofal_)

 🤖AI Engineering
xcancel.com··r/LocalLLaMA

Beyond Per-Token Pricing: A Concurrency-Aware Methodology for LLM Infrastructure Cost Estimation

 🤖AI Engineering  Content type: Academic
arxiv.org·

The Death of the Four Golden Signals: Designing Telemetry for Non-Deterministic Infrastructure

 🛡️Reliability Engineering
devops.com·

RightNow-AI/AutoMegaKernel: An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch-1 LLM decode.

 🧠LLMs  Content type: Code
github.com··Hacker News

Introducing Granite Libraries and Project Granite Switch

 🤖AI Engineering  Content type: Blog

3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1

 🤖AI Engineering  Content type: Blog
databricks.com·

[eCHO News] Episode #104: mTLS for Cilium. Lisp for eBPF

 🔬eBPF

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

 🧠LLMs  Content type: Academic
arxiv.org·

bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss

 🧠LLMs  Content type: Code
github.com··r/LocalLLaMA

google/gemma-4-12B-it-qat-q4_0-gguf

 🧠LLMs
huggingface.co·

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

 🧠LLMs  Content type: Academic
arxiv.org·

not much happened today | AINews

 🤖AI Engineering
news.smol.ai·

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

 🧠LLMs  Content type: Code
github.com··Hacker News
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help