LocalLlama · Scour

Achilles1089/duplex-chat: AI that thinks while you type. Speculative inference protocol that eliminates perceived latency in AI chat.

github.com·5w·r/LocalLLaMA

zolotukhin/zinc: Zig INferenCe Engine — LLM inference for AMD RDNA3/RDNA4 GPUs via Vulkan

github.com·5w·Hacker News, r/LocalLLaMA, r/Zig

Inference speed comparisons between M1 Pro and maxed-out M4 Max

github.com·61w·r/LocalLLaMA, r/LocalLLaMA

yashkc2025/turboquant: Python implementation of TurboQuant (arXiv 2504.19874). Data-oblivious, near-optimal 1–4 bit vector quantization for streaming KV-caches and databases.

github.com·5w·r/LocalLLaMA

mahmoudsamy7729/agentic-rag: A clean, modular implementation of an Agentic RAG (Retrieval-Augmented Generation) system built with a production-ready architecture.

github.com·5w·r/LocalLLaMA

Nvidia Kimodo: kinematic motion diffusion model trained on mocap data

research.nvidia.com·7w·Hacker News, r/LocalLLaMA

Día 27 de construir un laboratorio de IA autónomo con capital real.

descubriendoloesencial.substack.com·5w·r/LocalLLaMA, r/SideProject

Inference Engines — A visual deep dive into the journey of a token down the transformer layers

femiadeniran.com·5w·r/LocalLLaMA

Al0olo/voxtral-voice-clone: Training the missing codec encoder for Mistral's Voxtral-4B-TTS, enabling zero-shot voice cloning

github.com·5w·r/LocalLLaMA

Running Qwen 3.5 (122B) with ~72GB of VRAM

huggingface.co·9w·r/LocalLLaMA, r/LocalLLaMA

Agentic OS that replaces OpenClaw

github.com·5w·r/LocalLLaMA

nicedreamzapp/claude-code-local: Run Claude Code with local AI on Apple Silicon. 122B model at 41 tok/s with Google TurboQuant. No cloud, no API fees.

github.com·5w·r/ClaudeAI, r/LocalLLaMA

Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF

huggingface.co·5w·r/LocalLLaMA

Breaking change in llama-server?

github.com·5w·r/LocalLLaMA

Llama.cpp with Turboquant, Heavy-Hitter Oracle (H2O), and StreamingLLM. Even more performance!

github.com·5w·r/LocalLLaMA

Lowkey-Loki-SN/noflash-attention: Flash-attention-class memory efficiency for GPUs without flash attention

github.com·5w·r/LocalLLaMA

ggml: allow prefetching tensor overrides by am17an · Pull Request #21067

github.com·5w·r/LocalLLaMA

I built a fully local GraphRAG pipeline (0 GPUs needed) using Llama 3.1, Neo4j, and LangChain. Code included!

github.com·5w·r/LocalLLaMA, r/vibecoding

RED-BASE/SpruceChat: A tiny AI that lives inside your handheld. Local LLM chat on spruceOS.

github.com·5w·r/LocalLLaMA

arcprize.org·41w·Hacker News, Hacker News, r/LocalLLaMA

Log in to enable infinite scrolling