🧠 LLMs - nate_dkz

KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.

🧠LLM Code

github.com··Hacker News

Show HN: Ext-Infer

💬Prompt Engineering

infer.displace.tech··Hacker News

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

🧠LLM

local-llm.utop.workers.dev··Hacker News

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

🤖AI

deemwar-products.github.io··Hacker News

google/gemma-4-12B-it-qat-q4_0-gguf

🤖AI

huggingface.co·

Less-relevant results

AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

🤖AI Academic

arxiv.org··Hacker News

Appraising Artworks with Joins and LLMs (Ultorg Database UI)

🤖ChatGPT

ultorg.com··Hacker News

Don't dethrone consciousness

🧠LLM News

theintrinsicperspective.com··Hacker News

Arithmetic Without Numbers – How LLMs Do Math

🧠LLM

alvaro-videla.com··Hacker News

How to Measure Time To First Token (TTFT) in AI Systems

🤖AI

qainsights.com··Hacker News

defai-digital/ax-engine: Apple Silicon LLM runtime supporting Gemma 4 and Qwen 3.6 MTP modes

🤖AI Code

github.com··Hacker News

Show HN: Audit any AI/data pairing with Veritrooper

🧠LLM

veritrooper.com··Hacker News

Introducing Granite Libraries and Project Granite Switch

💬Prompt Engineering Blog

research.ibm.com··Hacker News

Show HN: Axiomax – Cryptographic proof of AI inference carbon footprint

💬Prompt Engineering

axiomaxllc.com··Hacker News

What an LLM Actually Does With Your Prompt First

🧠LLM

siliconopera.com·

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

🤖AI

vettedconsumer.com··Hacker News

vishal-dehurdle/state-harness: Runtime safety net for LLM agents. Detects token spirals, kills doomed tasks early, tells you exactly why. Rust core, Python SDK. pip install state-harness

🤖AI Code

github.com··Hacker News

Nvidia Nemotron 3 Ultra

Tokenminning: Because Tokenmaxxing Is a Bad Idea

Research Proposal: Decoupled RISC-LLM Architectures via Circadian Synaptic Consolidation

KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.

Show HN: Ext-Infer

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

google/gemma-4-12B-it-qat-q4_0-gguf

AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

Appraising Artworks with Joins and LLMs (Ultorg Database UI)

Don't dethrone consciousness

Arithmetic Without Numbers – How LLMs Do Math

How to Measure Time To First Token (TTFT) in AI Systems

defai-digital/ax-engine: Apple Silicon LLM runtime supporting Gemma 4 and Qwen 3.6 MTP modes

Show HN: Audit any AI/data pairing with Veritrooper

Introducing Granite Libraries and Project Granite Switch

Show HN: Axiomax – Cryptographic proof of AI inference carbon footprint

What an LLM Actually Does With Your Prompt First

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

vishal-dehurdle/state-harness: Runtime safety net for LLM agents. Detects token spirals, kills doomed tasks early, tells you exactly why. Rust core, Python SDK. pip install state-harness