Covers 2 stories including GitHub here . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inferen...Discussed on DEV

What flipped in b9437 Build b9437, published on May 30, 2026 at 20:56 UTC , ships two targeted default-value corrections to llama-bench. Flash attention (-fa) shifts from a hard-coded off to auto (LLAMA_FLASH_ATTN_TYPE_AUTO), and the GPU-layer count (-ngl) changes from the legacy sentinel 99 to -1. Both values now match what llama-server and llama-cli already used — the bench tool was simply never updated to track them until this build. Quick Answer: Before b9437 (published May 30, 2026) , ll...

Read the original article