💾Cache Optimization arxiv.orgAcademic

A Spatio-Temporal Expert Prefetching Framework for Efficient MoE-based LLM Inference (opens in new tab)

Mixture-of-Experts (MoE) based large language models (LLMs), such as Qwen and DeepSeek, have recently emerged as an effective approach to improving model capacity without proportionally increasing computational cost. By replacing the conventional feed-forward network in dense LLMs with a set of experts and activating only a subset of them for each input token, MoE models significantly increase the total number of parameters while keeping the per...

Read the original article