🎮 CUDA - dmndxld · Scour

SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

💻GPU Computing Academic

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

🧠LLMs Code

github.com··Hacker News

DiffusionGemma: The Developer Guide- Google Developers Blog

🧠LLMs Blog

developers.googleblog.com··r/LocalLLaMA

Log in to enable infinite scrolling