🧮 Tensor Cores - dane8036 · Scour

NVIDIA A100 vs RTX 4090 for AI Workloads: The Cost Per FLOP Reality

🔲AI,GPU IC, SOC IC Blog

fitservers.com·

A Fast Locality Simulator for GEMM Design-Space Exploration on Multi-Chiplet GPUs

🔲AI,GPU IC, SOC IC Academic

Less-relevant results

Apple WWDC On-Device AI Deep Dive - Google Docs

gist.is··Hacker News

Unsloth Gemma 4 QAT

Profiling in PyTorch (Part 2): From Nn.Linear to a Fused MLP

🔲AI,GPU IC, SOC IC Blog

huggingface.co··Hacker News

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

🔲AI,GPU IC, SOC IC Code

github.com··Hacker News

OpenCV 5 Debuts with Improved ONNX Support and Native AI Upgrades

🖼图像处理 News

DiffusionGemma: The Developer Guide- Google Developers Blog

🧠NPU Blog

developers.googleblog.com··r/LocalLLaMA

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

local-llm.utop.workers.dev··Hacker News

Making Locality-aware GEMM Compatible with Page-Granularity Placement on Chiplet GPUs

🔲AI,GPU IC, SOC IC Academic

Benchmarking dots.tts on Strix Halo

🔲AI,GPU IC, SOC IC

sleepingrobots.com·

NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

🧠NPU Blog

blogs.nvidia.com·

The economics of speculative decoding

🧠NPU Blog

fergusfinn.com··Hacker News

CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?

🔲AI,GPU IC, SOC IC

uccl-project.github.io··Hacker News

Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon

🧠NPU Blog

tridao.me··Hacker News

Qwen 3.6 27B AutoRound GGUF, need your feedback

🔲AI,GPU IC, SOC IC

huggingface.co··r/LocalLLaMA

Vortex 3.0 Released As Full-Stack, Open-Source RISC-V GPU Now With 3D Pipeline

CoreML vs TFLite: iPhone 15 Pro GPU 2.3x Faster

📱Edge AI Blog Discussion

Density Field State Space Models: 1-Bit Distillation, Efficient Inference, and Knowledge Organization in Mamba-2

🧠NPU Academic

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

🔲AI,GPU IC, SOC IC Blog

Log in to enable infinite scrolling