👁️ Attention Optimization - miterion · Scour

Cerebras: The $56.4 Billion IPO Challenging NVIDIA’s Memory Wall ⚡Flash Attention

artificialintelligencemadesimple.com·2d

A primer on how large language model works 🎓Model Distillation

mayijie.substack.com·5d·Substack

Sandisk’s AI Pivot Changes The NAND Narrative (NASDAQ:SNDK) ⚡Flash Attention

seekingalpha.com

·31m

Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion ⚡Flash Attention

Ollama on Mac: Setup and Optimization Guide (2026) 📊Profiling Tools

insiderllm.com·5d

InferenceBench: A Benchmark for Open-Ended Inference Optimization by AI Agents ⚡ONNX Runtime

inferencebench.ai·17h·Hacker News

Introducing the Ettin Reranker Family 📉Model Quantization

huggingface.co·2d·r/LocalLLaMA

RT by @awnihannun: Subagents running locally and simultaneously on MacBook Pro M5 with Codex CLI + @lmstudio to review code and find bugs using Qwen 3.6 🔄ONNX

twitter.macworks.dev·20h

michelangeloromerochisco/ternative: Inference engine for ternary-weight LLMs with runtime LoRA - the llama.cpp of BitNet models 🔄ONNX

github.com·1d·Hacker News

Gemini Extended Thinking ✨, ChatGPT finance 📱, Claude Code at scale 👨‍💻 🤖AI Coding Tools

Large-scale, SRAM-based LLM Inference Deployment (Groq) ⚡ONNX Runtime

semiengineering.com·40m

AI runs on tokens. There’s a missing artifact between them. ✂️CUTLASS

·2d

DALI VEGA Wireless Hi-Fi System Delivers All-in-One Sound With BluOS, HDMI ARC, and Adaptive Orientation ⏱️Benchmarking

ecoustics.com·8h·ecoustics.com

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips ⏱️CUDA Events

supercomputing-system-ai-lab.github.io·2d·Hacker News

The Ultimate LLM Fine-Tuning Guide ⚡ONNX Runtime

promptinjection.net·4d·Hacker News

Coding Agent Inference Benchmark Revealed ⚡ONNX Runtime

startuphub.ai·1d

Ollama vs vLLM vs llama.cpp: Which Wins for Your Use Case 📊Profiling Tools

tildalice.io·5d

Blazing fast on-device GenAI with LiteRT-LM 🎯Tensor Cores

developers.googleblog.com·1d·Hacker News

New comment by easygenes in "Gemini 3.5 Flash" 🔄ONNX

news.ycombinator.com·1d·Hacker News

MegaTrain Full Precision Training of 100B+ Parameter LLMs on a Single GPU 🏎️TensorRT

github.com·4d·Hacker News

Log in to enable infinite scrolling