Selective (smart) MoE experts offloading to CPU?

Abstract.

The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boos…

Abstract.

1. Introduction

Deploying Large Language Models (LLMs) directly on edge devices is becoming essential. This allows LLMs to interact instantly with user surroundings, enabling accurate perception and contextual understanding. Edge deployment offers enhanced data privacy and security, and improved system reliability by reducing reliance on network connectivity. These advantages are crucial for LLM applications in user-centric environments like smart homes, intelligent healthcare, autonomous transportation, and pervasive video analytics, where real-time processing, immediate environmental perception, and data privacy are paramount.

Mixture-of-Experts (MoE) LLMs, with online expert offloading, provide a solution for deploying LLMs on resource-constrained edge devices by reducing computational parameters. A primary challenge for edge hardware is its limited integrated GPU capacity, often preventing the entire model’s parameters from fitting into GPU memory. MoE architecture addresses this by segmenting traditional feed-forward layers into specialized experts. Unlike dense networks that process all parameters uniformly, only a sparse subset of these experts activates for any given input. This approach effectively eases the strain on limited GPU resources. Specifically, active experts residing in GPU memory are processed directly on the GPU, while active experts located in CPU memory are either transferred to the GPU for computation or processed directly on the CPU. Because computation is restricted to only the active experts, the inference overhead is significantly reduced compared to dense networks.

The bottleneck in online expert offloading comes from loading experts into GPU memory via PCIe or computing them on the CPU. Both processes are significantly slower than GPU-based expert computation. Specifically, PCIe loading can be two orders of magnitude slower than GPU computation. CPU computation, depending on core count, is typically one to two orders of magnitude slower and scales linearly with token count. This disparity means that after GPU-based expert computations finish, the system often faces substantial latency waiting for PCIe transfers or CPU processing, significantly hindering overall inference speed.

Current methods for managing active experts often overlook the significantly higher inference latency imposed by experts residing in CPU memory compared to those in GPU memory. Existing methods try to hide data transfer costs by prefetching experts and optimizing GPU memory. However, prefetching can’t fully eliminate all data transfer overhead. These methods fail to account for a critical disparity: selecting an expert residing in CPU memory imposes a significantly greater latency penalty than one already in GPU memory. This crucial difference is not adequately considered during MoE’s expert selection process, leading to suboptimal inference performance.

Refer to caption Figure 1. Traditional MoE Layer vs. Our Importance-Driven Expert Layer (via substituting low-score experts and prefetching top-score experts).

Our approach: prioritizing and substituting active experts in CPU memory based on their scores reduce their impact on inference performance. In a fine-grained Mixture-of-Experts (MoE) framework, we observe a distinct pattern: only a handful of active experts achieve high scores (termed top-score active experts), while others receive scores comparable to inactive experts (low-score active experts). To optimize performance, we prefetch only the top-score active experts that reside in CPU memory. Simultaneously, we substitute low-score active experts (that would otherwise be computed on the CPU) with inactive but GPU-resident experts of similar scores. This strategy ensures that computationally critical top-score active experts from CPU memory are loaded onto the GPU in advance, while largely avoiding the computational burden of low-score active experts on the CPU.

Challenges. Our work addresses three main challenges: (1) identifying the crucial, irreplaceable impact of top-score active experts on LLM accuracy, while substituting low-score active experts with similarly scored alternatives has negligible negative effects on accuracy; (2) designing highly accurate prefetching methods to load active experts before computation and implementing effective expert management to identify inactive experts suitable as substitutes; (3) real-world system implementation, which involves building a reliable synchronization system that leverages the CPU as an auxiliary computation unit for experts, thereby reducing transfers from CPU memory to GPU memory.

Contributions. Our contributions are summarized as follows.

We introduce the first approach to fine-grained expert optimization in MoE models, employing different strategies based on the scores of active experts.

We design a CPU-GPU-load pipeline system specifically for MoE LLMs, capable of handling online workloads without requiring any offline preparation.

We extensively evaluate our approach on practical workloads, demonstrating its effectiveness and efficiency.

2. Background

Refer to caption Figure 2. Inference process of LLM with MoE architecture.

Deploying Large Language Models (LLMs) directly on edge devices is crucial for instant, private, secure, and reliable interaction within user environments like smart homes and autonomous vehicles, demanding real-time processing and immediate perception. To address the limited GPU memory on edge hardware, Mixture-of-Experts (MoE) LLMs with online expert offloading offer a solution by segmenting traditional layers into specialized experts. This MoE architecture processes only a sparse subset of parameters per input, significantly easing the strain on constrained GPU resources compared to dense networks.

Our goal is to reduce the TPOT of LLMs with MoE that feature fine-grained expert segmentation on GPU-memory constrained edge devices. We discuss our background from three perspectives: (1) the performance metric (TPOT), (2) the model architecture (MoE with fine-grained expert segmentation), and (3) the online expert offloading architecture.

2.1. LLM Inference Metric TPOT

Time per output token, abbreviated as TPOT, is a key metric that measures the average time interval between consecutive token generations during LLM inference (nvi, 2024). LLM inference, an autoregressive model, generates each token based on previous ones. To understand TPOT, we need to examine the two main stages of LLM inference: prefill and decoding, as Figure 2 shows. The prefill stage outputs an initial token, then the generation phase sequentially produces tokens until a maximum limit or an end-of-sequence (¡EOS¿) token is reached. In the decoding stage, where TPOT is measured, the model generates tokens one at a time through autoregressive processing, with each new token only attending to previous tokens and the prompt. TPOT directly impacts user experience in real-world LLM applications, as high TPOT causes noticeable delays between token generations that make interactions feel unnatural. Optimizing TPOT is therefore crucial for developing responsive LLM systems.

2.2. LLM with Fine-grained Expert MoE

The Transformer architecture in LLMs consists of multiple layers, each with a self-attention block and a feed-forward network (FFN). Self-attention generates embedding vectors by capturing relationships among input tokens, with different heads extracting distinct features. These head outputs are then aggregated and fed into the FFN. The FFN refines the input sequence representation through non-linear transformations via fully connected layers and activation functions. Its output then proceeds to subsequent layers or forms the LLM’s final output.

MoE transformer architecture replaces the traditional dense transformer’s FFN with multiple specialized experts, each essentially a smaller FFN, as Figure 2 shows. During inference, a router network analyzes the input tensor as it reaches an MoE layer. It then selectively activates only the most relevant experts for processing, leaving others inactive. This selective activation mechanism significantly enhances parameter efficiency compared to dense neural networks, as only a subset of experts contributes to the computation for each input.

Compared to the traditional MoE architecture, fine-grained expert segmentation, divides each FFN layer into a larger number of experts, while also increasing the number of experts that are activated for each token during inference. This means that while traditional MoE models might activate only 1-2 experts per token, this approach activates more experts simultaneously to process each token, maintaining similar computational costs. This fine-grained design encourages each expert to learn more specialized and independent knowledge domains, while the increased number of activated experts enables more diverse combinations of these specialized knowledge. As a result, the model can leverage richer combinations of expertise when processing inputs, leading to more flexible and effective knowledge integration. DeepseekMoE (Dai et al., 2024) was the first to propose the fine-grained Expert Segmentation, and they have since been adopted in several large language models, including Qwen2-57B-A14B-Instruct (qwe, 2024) and XVERSE-MoE-A4.2B-Chat (xve, 2024), achieving strong model performance while significantly reducing training costs.

MoE with fine-grained expert segmentation often incorporates shared experts. These shared experts differ from regular ones by processing all tokens, irrespective of routing decisions. This design directly addresses a key limitation in traditional MoE, where common knowledge is redundantly stored across multiple experts. By dedicating specific shared experts to consolidate this common knowledge, the architecture enables other experts to focus exclusively on specialized domains. This clear separation of common and specialized knowledge processing ultimately yields superior parameter efficiency.

2.3. Expert Offloading in MoE LLMs

Refer to caption Figure 3. Online Expert Offloading in MoE LLMs at one layer. Step ①: Router selects the active experts. Step ②: CPU computes part of the active experts in CPU memory. Step ③: Part of active experts and CPU-computed expert results are transferred to GPU memory via PCIe. Step ④: GPU processes experts from its memory, consolidating those results with CPU-computed reusults.

The LLM offloading technique leverages CPU resources to run LLMs exceeding GPU memory (Song et al., 2024). GPU-centric offloading stores excess parameters in CPU memory, transferring them to the GPU as needed for processing, allowing inference for various model sizes. In contrast, hybrid offloading, like llama.cpp (lla, 2024), splits parameters between GPU and CPU at the layer level. The CPU processes its layers and sends intermediate results to the GPU, reducing latency by minimizing data transfer and mitigating slow PCIe bandwidth. This design is essential because edge devices often have limited single-GPU memory, preventing the entire LLM from fitting.

Expert Offloading is essential for deploying large Mixture-of-Experts (MoE) LLMs on edge devices, differing from LLM offloading by scheduling at the expert level rather than the layer level. This technique strategically places a subset of experts and frequently used common parameters, such as attention, token embedding, and router weights, in GPU memory, while all expert parameters are stored in CPU memory. This method significantly reduces the TPOT compared to general LLM approaches like llama.cpp, which do not optimize for MoE models. Additionally, other methods such as PowerInfer focus solely on LLMs with ReLU activation functions and lack specific offloading strategies for MoE models.

Similarly to the general LLM offloading technique: two main strategies are used, as shown in Steps ② and ③ of Figure 3: either freeing up GPU memory to transfer the necessary expert parameters from CPU memory for GPU computation, or directly performing the computations on the CPU and then aggregating the results with those from the GPU. These two approaches can be pipelined; as Figure 3, while multi-core CPUs compute one expert (E2E_{2}), another (E0E_{0}) can be simultaneously loaded into GPU memory via PCIe. Upon E2E_{2}’s completion, its results are then transferred to GPU memory, allowing for concurrent processing and data movement.

Our work focuses on online expert offloading approaches to address scenarios with unkonwn edge workloads. Expert offloading approaches are classified into online and offline MoE serving strategies based on whether experts are dynamically loaded into GPU memory according to the characteristics of current requests rather than the entire workload. Online strategies handle dynamically changing edge requests, where frequent loading/computation of non-resident experts impacts latency; they adapt via flexible scheduling (e.g., MoE-Infinity (Xue et al., 2024), HybriMoE (Zhong et al., 2025)) based on the current request. Offline strategies target predetermined workloads, capturing expert activation patterns based on workload characteristics (e.g., MoE-lightning and ) alongside expert pruning to optimize GPU resource use.

3. Motivation

Refer to caption Figure 4. Only a few achieve high scores, significantly influencing the output, while others have low scores, similar to inactive experts. Figure 5. Our idea: prefetching top-score experts and replacing low-score experts in each iteration at one layer.

Deploying Mixture-of-Experts (MoE) Large Language Models (LLMs) on edge devices necessitates online expert offloading due to constrained GPU memory. As workloads dynamically shift, so does the set of active experts; however, limited on-device GPU memory cannot always house all required experts. Consequently, active experts residing in GPU memory are processed directly, while those in CPU memory are either transferred to the GPU for computation or processed directly on the CPU.

Processing experts not already in GPU memory significantly impacts inference latency, regardless of the offloading method. This is primarily because PCIe loading can be two orders of magnitude slower than GPU computation, and CPU computation is typically one to two orders of magnitude slower, scaling linearly with token count. (Figure 6 illustrates the comparative latency across various settings.) Therefore, efficiently deploying GPU-memory constrained MoE LLMs hinges on minimizing the time spent loading experts or computing them on the CPU.

Several optimization strategies have emerged for online expert offloading to alleviate the bottleneck. One common approach focuses on increasing the hit rate of experts in GPU memory, often through prefetching active experts and employing caching strategies. Prefetching predicts future expert needs, enabling preemptive loading into GPU memory to pipeline parameter loading and computation, thereby reducing the TPOT impact. Caching strategies, such as LRU or those optimized for expert selection patterns, further reduce data transfer frequency. Another key strategy involves optimizing the system design for CPU computation, including more efficient CPU-load pipelines that estimate and minimize the maximum of CPU and loading times.

Efficiency leap: Current offloading strategies overlook the varying importance of activated experts, treating them uniformly. In MoE architectures, an expert’s importance to an input is reflected in its router’s gate score, with higher scores indicating greater significance. As Figure 4 illustrates, a distinct pattern of importance scores emerges among activated non-shared experts: only a few achieve high scores, significantly influencing the output, while others have low scores, similar to inactive experts. This differentiation arises because (1) shared experts handle common knowledge and (2) fine-grained segmentation creates highly specialized non-shared experts. Yet, existing online expert offloading methods, such as prefetching and CPU-load pipelining, do not consider the varying impact of activated experts on output results, despite each expert incurring a similar penalty on TPOT. This oversight results in time-consuming operations, like CPU computation and PCIe loading of experts, being used for experts that have minimal impact on the final outcome, thereby increasing TPOT.

Refer to caption Figure 6. Time cost of CPU and GPU computing an expert with a token, and PCIE loading an expert from three MoE LLM.

Idea: Our proposed online strategy schedules experts by their importance to minimize their impact on inference latency. As Figure 5 illustrates, we first prefetch top-score experts based on predicted expert scores, leveraging pipelining to overlap their loading time with computation. Given that PCIe loading time (Figure 6) significantly exceed computation time, this prefetching can only ensure timely resource access for critical top-score experts. After routing, when actual expert scores are determined and the top-kk experts (e.g., k=3k=3) are to be selected, if a top-score expert (e.g., EaE_{a}) was prefetched, the remaining expert (e.g., EcE_{c}) might reside in CPU memory, necessitating PCIe transfer or CPU computation. However, unlike previous methods, our approach recognizes EcE_{c}’s low score and minimal impact on the final output. Therefore, we substitute EcE_{c} with a similarly scored, GPU-resident inactive expert (e.g., EdE_{d}). This allows all selected experts to be computed directly on the GPU, effectively mitigating the TPOT impact of low-score CPU-resident experts at a negligible cost to accuracy, while ensuring top-score experts receive prompt computational resources.

4. System Overview

We present an inference acceleration framework specifically designed for the decoding phase of fine-grained expert MoE LLMs on edge with limited GPU memory, aiming to minimize TPOT.

Refer to caption Figure 7. Parameter initialization.

4.1. Architecture and workflow

The system operates through two phases: parameter initialization and online inference.

Parameter initialization. As the Figure 7 shows, through the parameter initialization phase, all common parameters, except for non-shared experts, are stored in GPU memory. This approach is adopted because each token requires computation of these common parameters, and keeping them in GPU memory avoids the need for frequent PCIe transfers. Additionally, some non-shared experts are also stored in GPU memory to maximize its utilization. Meanwhile, all non-shared experts are initially kept in CPU memory, allowing for direct computation by the CPU or transfer to GPU memory.

Refer to caption

Figure 8. GPU computing vs CPU computing.

Online inference. Online inference is segmented into prefill and decoding phases. In the prefill phase, the system employs a traditional offloading-based LLM approach, transferring experts not initially in GPU memory but needed for computation via PCIe to GPU memory before processing. Using the CPU to compute these experts during the prefill phase is less efficient than loading them into GPU memory and processing them there, as CPU usage scales linearly with the number of tokens computed, as shown in Figure 8. When the prompt length exceeds a certain threshold, each expert manages a significant number of tokens due to parallel processing, as illustrated in Figure LABEL:expertbatch, making CPU computation costly for prefill. To optimize prefill phase, we implement a pipeline that computes the current layer’s experts while prefetching the next layer’s experts into GPU memory. In the decoding phase, each request processes only one token, leaving most experts inactive. Building on previous work that optimized based on expert activity, we further enhance performance by implementing the importance-driven expert scheduler, which categorizes active experts into top-score and low-score groups, focusing on expert importance for optimization.

4.2. Pipeline Example between Layers

Refer to caption Figure 9. Importance-driven expert scheduler pipelines GPU, CPU, and load operations between two MoE layers.

Figure 9 illustrates how the importance-driven expert scheduler pipelines GPU, CPU, and load operations between two MoE transformer layers, XX and YY, to minimize pipeline bubbles. Three processes are pipelined, utilizing different resources: GPU, CPU, and PCIe.

CPU: CPU computation is divided into four parts, encompassing three processes of the Importance-Driven Expert Scheduler and the computation of expert parameters on the CPU. (1) The first part involves the expert-cache router calculation, which replaces some low-score active experts from layer XX with inactive experts from layer XX that reside in the GPU. (2) The second part, CPU-load balance, decides which experts should be computed directly on the CPU and which should be loaded into the GPU for the most efficient pipelining. (3) The third part uses the CPU to compute experts from layer XX that are not in GPU memory. (4) The fourth part manages the recent scores of experts to determine which expert to evict when loading new experts into the GPU.

GPU: GPU computation is divided into four parts, including three processes for computing expert parameters from layer X and one process for prefetching parameters from layer Y. (1) The first part computes the attention and gate parameters within the common parameters of layer X on the GPU, determining which experts from layer X are activated. The expert-cache router calculated on the CPU updates this selection of activated experts. (2) The second part involves direct computation of experts already in the GPU. (3) The third part predicts experts for prefetching, requiring the computation results of experts already in the GPU to be fed into the shared experts, attention, and gate of the next layer, predicting the top experts. (4) The fourth part continues computation of newly loaded experts once the PCIe loading is complete.

PCIe Load: This is divided into two parts: (1) one for prefetching experts from layer YY and (2) another for loading experts from layer XX.

5. Importance-Driven Expert Scheduler

5.1. Expert-cache Router

In the inference process of MoE LLM, the router component selects which experts should be used for computation at a given layer. The expert-cache router strategy embodies a trade-off between accuracy and inference performance, yielding notable enhancements in decoding efficiency while sacrificing a minor degree of accuracy relative to traditional top-k methods.

This approach is based on the observation that not all top-k experts contribute equally to the computation. As illustrated in the Figure 4, a clear pattern of importance scores among activated non-shared experts emerges: a few achieve high scores and significantly influence the output, while the majority have low scores similar to those of inactive experts. In traditional top-k router methods, these low-score experts must be loaded from CPU memory to GPU memory via PCIe or computed directly by the CPU, despite their minimal impact on the output. In contrast, the expert-cache router replaces these low-score active experts, which are not resident in GPU memory, with inactive experts that are already in GPU memory, thereby enhancing decoding efficiency.

Algorithm 1 illustrates our approach for the expert-cache router. We introduce a hyperparameter α\alpha to classify active experts into top-score and low-score. We denote the score of the expert ranked at (k+1) in terms of descending scores as β\beta. Those experts with scores above (1+α)β(1+\alpha)\beta are top-score, while scores between β\beta and (1+α)β(1+\alpha)\beta are low-score.

For top-score experts, when our system processes multiple requests concurrently, multiple tokens are decoded, each possessing its own set of top-score experts at the current layer. Given their critical role in computation, these experts are retained in the results of the expert-cache router and called as the top-score expert set.

For low-score experts, if they are already in GPU memory or part of the top-score expert set, they are retained in the expert-cache router’s results without incurring additional CPU overhead. Otherwise, they can be replaced with an inactive expert whose score falls between (1−α)β(1-\alpha)\beta and β\beta and is present in the GPU or in the top-score expert set. These inactive experts present in the GPU or in the top-score expert set are termed alternative low-score experts and collectively form the alternative set AA. If there are mm low-score experts not in the GPU and the size of AA, denoted as |A||A|, exceeds mm, the system selects the mm highest-scoring alternatives from AA to be included in the expert-cache router’s results. If |A||A| is less than mm, then only m−|A|m-|A| low-scoring experts are prepared for loading into GPU memory or for direct computation by the CPU.

Algorithm 1 Expert-Cache Router

1:α\alpha, kk, SS ⊳\triangleright Score inputs

2:EE ⊳\triangleright Output cacherouter_experts

3:Initialize EE, set thresholds TT, LL, RR, sets AA, BB, CC

4:for each token tt do

5: Sort StS_{t} desc; β←(k+1)-th score of St\beta\leftarrow\text{(k+1)-th score of }S_{t}

6: T←(1+α)βT\leftarrow(1+\alpha)\beta; L←βL\leftarrow\beta; R←(1−α)βR\leftarrow(1-\alpha)\beta

7: for each expert ee do

8: if score of e>Te>T then Add ee to E[t]E[t] and CC

9: end if

10: end for

11:end for

12:for each token tt do

13: Initialize BtB_{t}, AtA_{t}

14: for each expert ee do

15: if L≤L\leq score of e<Te<T then Add ee to BtB_{t}

16: else if R≤R\leq score of e<Le<L and ee in GPU or CC then Add ee to AtA_{t}

17: end if

18: end for

19: if |At|≥|Bt||A_{t}|\geq|B_{t}| then

20: Select top |Bt||B_{t}| experts from AtA_{t}; Add to E[t]E[t]

21: else

22: Add AtA_{t} to E[t]E[t];

23: Add top |Bt|−|At||B_{t}|-|A_{t}| experts from BtB_{t} to E[t]E[t]

24: end if

25:end for

26:return EE

5.2. Online Prefetching Top-score Experts

Refer to caption Figure 10. Our prefetching method compared to traditional method and normal workflow to output true scores.

Online prefetching proactively preloads required experts before their computation begins within an online expert offloading system, effectively overlapping the expert loading latency with ongoing computation time. Our approach has two key features that set it apart from other prefetching methods.

In contrast to traditional methods that often rely on offline training of additional parameters to predict expert scores in the next layer, our method, demonstrated in Figure 10, takes a different approach. We perform calculations on both unshared experts currently in the GPU memory and shared experts, resulting in the production of hidden states. These hidden states are then processed using the next layer’s key-value cache to complete the attention computation. Subsequently, we carry out a gate computation to determine the scores for all experts in that layer. There are two main reasons why the gate computation results, derived from calculations involving the shared experts and experts in GPU memory, are more accurate compared to using residuals as inputs for the next layer’s attention computation. The first reason underscores the importance of shared experts that process universal information across all inputs and remain constantly available in the GPU memory. The second reason rests on our cache eviction strategy in Section 5.3 that ensures high-scoring, hence important, experts are retained in the cache during the most recent decoding stages. These factors collectively lead to score results that closely align with the true outcomes.

Moreover, as shown in Figure 9, the prediction process is carried out entirely within the GPU. Due to the GPU’s superior speed, it tends to be more idle compared to the CPU and PCIe. As a result, the time taken by the GPU for prediction is often overshadowed by the time required for calculation by the CPU and loading by PCIe. Consequently, the impact of prediction on TPOT is rather minimal.

The second feature is to prioritize loading top-score experts. If we load all predicted experts without prioritization, computational overhead can’t be balanced with pipelining, as Figure 6 shows. Even worse, incorrect predictions lead to additional loading. In light of these factors, we identify and rank the top k experts based on their projected scores. These experts are then placed into the load queue. Subsequently, we initiate the loading of experts for the next layer, starting from the highest-scoring experts and progressively moving down the rank. When the inference process moves to the next layer and the gate computation is done, we get the actual scores of the experts. At this point, we don’t need to load experts based on predicted results. So we clear the load queue, which then awaits the loading of the current layer’s experts. This procedure is dictated by the CPU-load balance management in Section 5.4.

5.3. Cache Eviction

Refer to caption Figure 11. The reuse probability of experts based on score (in descending order) in three MoE models.

The cache eviction policy governs expert removal from GPU memory when loading new experts. It employs a score-based strategy, evicting the expert exhibiting the lowest average activation score accumulated over the preceding n iterations. This prioritizes retaining experts demonstrating higher historical impact, measured by their contribution to model outputs, rather than relying solely on recency of access.

To prevent thrashing (evicting an expert immediately before its use), the system dynamically elevates the eviction priority of any expert selected for computation by the expert router during a given layer’s processing phase. This temporary protection shield ensures the expert remains resident in GPU memory throughout its required computation window. The shield is automatically revoked upon completion of the layer’s computation, returning the expert to standard eviction eligibility based on its score history.

Critically, this score-aware eviction policy considers the activation scores of all experts accessed within the observation window (n iterations), including those not selected as top-k experts (inactive experts). This contrasts fundamentally with traditional Least Recently Used (LRU) policies, which focus only on access recency. Our approach acknowledges that experts with higher scores – even if inactive in a specific iteration – possess a greater inherent likelihood of being reused (as top-k, or alternative experts) in subsequent computations compared to low-scoring experts, as Figure 11 shows. By incorporating inactive experts’ scores, the policy better anticipates potential future utility beyond the immediate top-k selections.

5.4. CPU-Load Balancer

The CPU-Load balancer is designed to dynamically decide whether activated experts, not pre-cached in GPU memory, should be transferred to the GPU for calculation or directly computed on the CPU.

This mechanism is necessary due to certain limitations of our existing strategies; the expert-cache router and online prefetching top-score experts, which serve to maximize GPU memory utilization. There are instances when some experts are still held in CPU memory due to the following reasons: firstly, prefetching sometimes cannot transport all the top-score experts into the GPU memory in advance possibly due to an excessive number of not-in-GPU-memory top-scores or insufficient PCIe to fully load them prior to computing. It could also be due to predictive errors in prefetching, resulting in unsuccessful loading of top-score experts. Secondly, the expert-cache router fails to replace all the low-score experts with the ones in the GPU memory if there is a significant score difference between the low-score experts and all inactive experts.

We aim to optimize these remaining active experts in the CPU memory by implementing the CPU-Load balancer, which offers two options for these active experts; either compute directly on the CPU or transfer them to the GPU for computation via PCIe. To minimize the occurrence of idle times, we harmonize the time taken to load experts, denoted as TloadT_{load}, with the CPU computation time TCPUT_{CPU}, according to formula minmax(Tload,TCPU)min\ max(T_{load},T_{CPU}). The employed Algorithm 2 deploys a two-pointer strategy that dynamically balances cumulative costs by prioritizing larger batches for GPU loading, thus optimizing the GPU’s utilization, and offloads smaller batches to the CPU where appropriate. This is informed by the fact that once an expert is loaded into the GPU memory, the GPU, due to its significant parallel execution units, is able to compute considerably faster on that expert than either the CPU time or the loading time. Conversely, CPU computation time grows linearly with the batch size BB, due to its sequential processing constraints.

Furthermore, the results generated by the expert-cache router can be further optimized according to the CPU-Load balancer. For each token, we prioritize the replacement of low-score experts with alternative experts that have larger batch sizes. For instance, consider token 1 with a selection of experts {1,2,4}{1,2,4}, token 2 with a selection of experts {1,5,4}{1,5,4}, and token 3 with a selection of experts {2,6,7}{2,6,7}. For token 3, if expert 6 is a low-score expert and expert 4 is one of the alternative inactive experts, the result can be replaced with {2,4,7}{2,4,7}. Even if both expert 4 and expert 6 are not in the GPU memory, choosing {2,4,7}{2,4,7} over {2,6,7}{2,6,7} for token 3 leads to faster computation, since expert 4 can be loaded into the GPU. As per the properties of the GPU, an increase in batch size does not impact the computation speed.

Aligning these times by the Algorithm 2 allows for an overlap between the loading of experts and CPU computations through the furtherance of pipeline optimization.

Algorithm 2 CPU-Load Balancing Management

2:UbatchU_{\text{batch}}: User batch {(uidi,Bi)}{(uid_{i},B_{i})} with batch sizes

3:tcput_{\text{cpu}}: CPU time per sample

4:tloadt_{\text{load}}: Parameter loading time

6:LloadL_{\text{load}}: Experts to load on GPU

7:LcpuL_{\text{cpu}}: Experts to compute on CPU

8:Initialize Ccpu←0C_{\text{cpu}}\leftarrow 0, Cload←0C_{\text{load}}\leftarrow 0

9:Sort UbatchU_{\text{batch}} by BiB_{i} in descending order

10:Initialize l←0l\leftarrow 0, r←|Ubatch|−1r\leftarrow|U_{\text{batch}}|-1

11:while l≤rl\leq r do

12: if Cload≤CcpuC_{\text{load}}\leq C_{\text{cpu}} then

13: Cload←Cload+tloadC_{\text{load}}\leftarrow C_{\text{load}}+t_{\text{load}}

14: Lload.append(uidl)L_{\text{load}}.\text{append}(uid_{l})

15: l←l+1l\leftarrow l+1

16: else

17: Ccpu←Ccpu+Br⋅tcpuC_{\text{cpu}}\leftarrow C_{\text{cpu}}+B_{r}\cdot t_{\text{cpu}}

18: Lcpu.append(uidr)L_{\text{cpu}}.\text{append}(uid_{r})

19: r←r−1r\leftarrow r-1

20: end if

21:end while

6. Evaluation

6.1. Setup

Table 1. Model and Hardware Configurations


Setting	Model	GPU
S1	deepseek-moe-16b (dee, 2024)	3080ti (12GB)
S2	XVERSE-MoE-A4.2B (xve, 2024)	4060ti (16GB)
S3	Qwen2-57B-A14B (qwe, 2024)	A6000 (48GB)

Table 2. Workloads


Exam	Language	Knowledge	Understanding	Reasoning
Gaokao (Clark et al., 2018)	WiC (Pilehvar and Camacho-Collados, 2018)	BoolQ (Clark et al., 2019)	Race-mid (Lai et al., 2017)	gsm8k (Bisk et al., 2020)

Refer to caption Figure 12. TPOT of four baselines and our method in five workloads. Table 3. Accuracy (%) across different datasets and models at various expert substitution thresholds α\alpha.

Dataset Model 0.0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 GaoKao Deepseek 27.2 27.6 28.2 29.5 29.3 28.9 28.1 27.7 27.3 26.8 26.7 26.2 25.9 Xverse 47.2 47.5 47.9 48.1 49.0 47.2 47.9 47.5 46.5 46.8 47.5 46.2 45.9 Qwen 73.5 74.2 74.8 75.8 76.0 76.1 74.2 74.1 73.2 72.1 71.6 70.7 71.7 WiC Deepseek 50.7 50.9 51.3 51.6 50.9 51.7 50.6 50.7 51.9 51.8 51.7 50.1 50.4 Xverse 50.0 50.1 50.3 50.2 50.2 50.0 50.0 50.2 50.2 50.1 50.00 49.8 49.8 Qwen 60.5 60.7 60.5 60.6 60.9 60.7 60.6 60.7 61.9 60.8 60.3 60.4 60.1 triviaqa Deepseek 59.3 59.1 59.3 58.5 58.5 57.7 59.6 59.0 58.9 58.7 58.6 57.4 57.6 Xverse 53.1 53.2 52.8 52.7 53.1 53.3 53.4 53.8 52.8 53.1 53.2 52.1 52.4 Qwen 69.3 69.2 68.7 68.5 68.5 67.7 69.6 69.0 68.9 68.7 68.6 67.6 67.9 race Deepseek 70.0 69.8 69.2 69.7 69.9 70.4 70.0 69.1 70.2 70.2 70.4 68.9 68.6 Xverse 81.4 81.6 81.8 81.0 82.3 82.0 81.9 80.9 82.9 82.3 82.4 81.1 80.3 Qwen 80.0 79.8 79.9 79.7 79.9 80.4 80.0 79.1 80.2 80.1 80.4 79.8 79.6 gsm8k Deepseek 51.4 51.6 51.2 51.4 49.5 51.9 51.0 50.2 49.2 48.8 47.8 47.9 48.8 Xverse 62.9 62.7 62.9 62.4 63.3 61.7 62.5 61.2 61.2 60.3 60.4 60.1 58.6 Qwen 85.7 85.5 85.7 85.2 85.8 85.6 85.1 85.3 84.5 84.1 83.7 83.2 83.3 Refer to caption Figure 13. GPU cache ratio of three baselines and our method in five workloads on average.

Implementation. We build our work on top of Pytorch, written in Python with nogil (pyt, 2024). We leverage the implementation of model structures from the transformer library.

Models. To demonstrate that our method can be applied to various MoE models with the DeepSeekMoE architecture, we evaluate three popular MoE models with the DeepSeekMoE architecture: deepseek-moe-16b-base (dee, 2024), Qwen2-57B-A14B-Instruct (qwe, 2024), and XVERSE-MoE-A4.2B-Chat (xve, 2024). Although not evaluated, our work also supports other models compatible with the DeepSeekMoE architecture.

Hardware. To demonstrate the effectiveness of our method across different GPU memory configurations, we conducted tests under various hardware settings, including a single NVIDIA RTX 3080 Ti GPU (12GB), a single NVIDIA RTX 4060 Ti GPU (24GB), and a single NVIDIA A6000 GPU (48GB), with 6 CPU cores utilized. We evaluate three different model and hardware settings as shown in Table 1.

Workloads. To demonstrate that our method improves inference speed across different workloads without a significant drop in accuracy, we test various types of workloads, as shown in Table 2. These are categorized into four task categories: language, knowledge, understanding, and reasoning. Each category is characterized as follows: (1) Exam: It revolves around tasks that simulate testing environments, requiring comprehensive understanding and application of knowledge, such as the preparation and grading of academic examinations. We use the Math_I, Math_II, History, Biology datasets from the Gaokao benchmark (Clark et al., 2018) for the task testing of Exam. (2) Language: It focuses on tasks such as text generation and translation, emphasizing fluency and coherence. We use the WiC (Pilehvar and Camacho-Collados, 2018) dataset for this type of task. (3) Knowledge: It involves tasks that require factual recall and information retrieval, such as question-answering. We use the BoolQ (Clark et al., 2019) dataset for these tasks. (4) Understanding: It centers on tasks that assess comprehension, like summarization and sentiment analysis. We use the Race-mid (Lai et al., 2017) for these tasks. (5) Reasoning: It pertains to tasks that demand logical inference and problem-solving, such as mathematical reasoning and logical puzzles. We used the gsmk (Bisk et al., 2020) dataset for these challenges. For each category, we select one dataset, each containing prompts from various thematic scenarios. We extract 1000 prompts from each dataset for testing.

Metrics. For system performance, we follow existing works on LLM serving and analyze the performance metrics during the decoding and prefilling stages. For decoding performance, we evaluate the Time-Per-Output-Token (TPOT) as the key metric for the decoding stage. Additionally, we use the hit rate of in-memory experts to reflect the GPU memory utilization efficiency, which is determined by the expert scheduling strategies of different methods. For prefilling performance, we assess the Time-To-First-Token (TTFT) for the prefilling stage. For accuracy, since the cache router strategy might affect the inference results of LLMs, we utilize the open-source tool OpenCompass (ope, 2024) to test the system’s accuracy on relevant workloads both before and after adopting the cache router. This is to demonstrate that the impact of our strategy on accuracy is minimal.

Baselines. We evaluate our work, comparing it against baseline systems that support running MoE LLMs without enough GPU memory on the local platform: (1) MoE-infinity (Xue et al., 2024) is an efficient MoE inference system tailored for personal machines with limited GPU memory. It capitalizes on the high activation sparsity of MoE models during the decode phase, where a small number of experts are frequently reused. By employing a sparsity-aware expert cache to trace and analyze the sparse activation patterns, MOE-INFINITY optimizes expert replacement and prefetching. (2) Llama.cpp (lla, 2024) is a C++ implementation enabling efficient LLM inference on CPUs, optimized for low GPU memory devices. (3) DeepSpeed (Aminabadi et al., 2022) is an optimization framework for large model training/inference, featuring layer-wise loading of Transformer layers onto the GPU for computation.

Abstract.

Abstract.

1. Introduction

2. Background

2.1. LLM Inference Metric TPOT

2.2. LLM with Fine-grained Expert MoE

2.3. Expert Offloading in MoE LLMs

3. Motivation

4. System Overview

4.1. Architecture and workflow

4.2. Pipeline Example between Layers

5. Importance-Driven Expert Scheduler

5.1. Expert-cache Router

5.2. Online Prefetching Top-score Experts

5.3. Cache Eviction

5.4. CPU-Load Balancer

6. Evaluation

6.1. Setup

Similar Posts