I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers! (opens in new tab)

Covers 2 stories including Language models are few-shot learners (2020)Discussed on r/LocalLLaMA

ik_llama.cpp fork with experimental NUMA per-node mirroring of model weights and KV cache. This is an attempt to maximize CPU inference performance on multi-socket systems. - mikechambers84/ik_llam...

Read the original article