🧠 LLM Inference - saeedesmaili · Scour

Machinic Psychopharmacology: Do LLMs Self-Medicate?

lesswrong.com··Hacker News

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

⚡CUDA Blog

tilert.ai··Hacker News

bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss

🔬Deep Learning Code

github.com··r/LocalLLaMA

NexusOS v2.0 – A zero-dependency pipeline streaming server chaos to Parquet

🎒Backpacking

huggingface.co··Hacker News

"North Mini Code"; open weights, 30B param, Canadian coding model

🤖Data science Blog

cohere.com··Hacker News, Hacker News

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

🤖LLM Academic

agentgateway Joins AAIF as an Open Gateway for Agentic AI Infrastructure

🤖AI Agents Blog

aaif.io··Hacker News

Releases · youssofal/MTPLX

🤖Data science Code

github.com··Hacker News

Making Local LLM Fast

🪟Context Windows

bogdan.nimblex.net··Hacker News

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

🤖Data science

smolhub.com··r/LocalLLaMA

Teaching Diffusion to Speculate Left-to-Right

🤖LLM Academic

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

🧠LLMs Academic

LLM Research Papers: The 2026 List (January to May)

💬Natural Language Processing News

magazine.sebastianraschka.com

··Hacker News

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

huggingface.co··r/LocalLLaMA

defai-digital/ax-engine: Apple Silicon LLM runtime supporting Gemma 4 and Qwen 3.6 MTP modes

💬Natural Language Processing Code

github.com··Hacker News

Why I care so much about energy per token

🎯Fine-tuning Blog

ziraph.com··Hacker News

Arconia for Spring Boot: DevEx, Observability, Multitenancy, GenAI, Cloud Native

🏠Self-Hosting Code

arconia.io··Hacker News

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

🎯Fine-tuning Academic

Less-relevant results

I rebuilt the same SaaS plumbing four times. So I built the thing I wish existed.

🚀Bootstrapping Blog

indiehackers.com··Hacker News

LLM AI Chatbots are letting me down every single day

💬Natural Language Processing

umrashrf.github.io··Hacker News

No more posts from saeedesmaili's subscribed feeds.

Scour all 25258 feeds Learn more about Feeds

Sign up or log in to see more results

Log in to enable infinite scrolling