By Adrien Payong and Shaoni Mukherjee
Models with hundreds of billions or even trillions of parameters achieve state-of-the-art performance in natural language processing, vision, and other areas. Training or even running these large language models can be computationally expensive with current hardware. To scale these models to higher levels of parallelism, researchers have applied techniques to distribute a model’s computation over many processors or graphics processing units (GPUs). Approaches such as data parallelism, tensor parallelism, and pipeline parallelism have pushed boundaries.
However, parallelism alone isn’t enough to address inefficiencies of computing every part of a network for every token. To further address this problem, researchers proposed the mixture‑of‑experts (MoE) architecture, which splits a network into a set of smaller, specialized subnetworks, known as experts.
Expert parallelism goes one step further, distributing these experts across multiple GPUs or nodes. In this article, we will cover what expert parallelism is, how it works, how it differs from other parallelism techniques, and why it is an essential component of modern AI systems.
At a high level, expert parallelism is the process of splitting a model’s experts across devices and routing each input to the appropriate device(s). Within a mixture‑of‑experts layer, each expert is a feed‑forward network (typically a multilayer perceptron) that operates on a set of tokens independently. Vanilla MoE uses a learned routing network to select the top‑k most relevant experts for each input.
The selected experts process the input in parallel, and their outputs are combined using gating values that specify the contribution of each expert to the final output.
The sparse activation refers to the fact that at most k experts are active for each token, leaving the other experts inactive, thus saving computation. Expert parallelism exploits this sparsity by assigning different experts to different devices, so each device is only responsible for the tokens assigned to its experts.
Expert parallelism reduces memory and computational overhead by distributing the experts. Since each GPU only stores the parameters of its experts (rather than the entire set), it enables the system to scale to MoE models with a large number of experts. As each token only “touches” a small subset of the model’s parameters, very large networks can be trained within a restricted hardware budget.
It may be helpful to walk through the internal process of how expert parallelism works in practice, including how tokens are routed, how experts are selected, and how their computations are coordinated across GPUs.When an input token arrives at an MoE layer using expert parallelism, the following occurs:
Expert Parallelism introduces communication overhead, as tokens must be dispatched to and gathered from the correct GPUs. Typically, all-to-all communication patterns are used; each GPU may need to send and receive data from every other GPU, especially as the number of experts or GPUs increases.
Expert parallelism is supported in several deep‑learning frameworks. In DeepSpeed, the MoE API takes an ep_size parameter, which is the number of processes in an expert parallel group. The total number of experts is split across these processes. For instance, num_experts=8 and ep_size=2 means each group of two GPUs will have four experts, and tokens will only be exchanged within the group.
TensorRT‑LLM and Megatron‑LM support Hybrid parallelism, where users specify --moe_ep_size (expert parallel size) and --moe_tp_size (tensor parallel size) when converting a checkpoint. With tensor and expert parallelism, expert weights are split evenly across devices, while entire experts are spread across GPU groups. Hybrid parallelism finds a tradeoff between memory, computation, and communication costs. An example MoE pipeline might use data parallelism to scale across nodes, tensor parallelism to split a large matrix within each expert, and expert parallelism to distribute experts across GPUs.
Parallel training strategies have different notions of workload splitting across devices. The table below contrasts the most common strategies, showing how expert parallelism relates to data, tensor, and pipeline parallelism.
Parallelism type | Description | Use case |
---|---|---|
Data parallelism | Replicates the entire model on each GPU and splits the dataset into shards. Each GPU processes its shard, computes gradients, and synchronizes updates via all-reduce . |
Simple training tasks when the model fits in GPU memory; widely used in standard training pipelines. |
Tensor parallelism | Splits the weights of each layer across GPUs (e.g., splitting weight matrices column-wise). Each GPU performs partial matrix multiplications; outputs are collected via all-gather . |
When the model’s layers are too large to fit on a single GPU; used in large Transformer models. |
Pipeline parallelism | Splits the model into stages, each residing on a different GPU. During the forward pass, activations are passed from one stage to the next; backward pass propagates gradients in reverse. | Deep models with sequential structure; reduces memory by spreading layers. |
Expert parallelism | Places whole experts of an MoE layer on different GPUs. A gating network selects the top-k experts for each token, and only the GPUs hosting those experts process the token. | Sparse models, such as Mixture-of-Experts, allow scaling to trillions of parameters with fewer compute resources. |
The diagram below is to help understand expert parallelism and how it contrasts with other approaches. The first visualization shows how the gating network sends tokens to the experts spread across the GPUs. The second contrasts data, tensor, pipeline, and expert parallelism, and illustrates which dimension of the model/data each method partitions.
Expert parallelism brings several important benefits for large‑scale deep learning. Let’s consider some of them:
Let’s examine a practical implementation of Expert Parallelism, where we configure distributed experts on GPUs using the DeepSpeed library. The code snippet below shows how to integrate a Mixture-of-Experts layer into a Transformer-style model.
Below is a simplified PyTorch/DeepSpeed code snippet illustrating how to configure a Mixture‑of‑Experts layer with expert parallelism:
import torch
from deepspeed.moe.layer import MoE
from deepspeed.pipe import PipelineModule
class ExpertLayer(torch.nn.Module):
def __init__(self, model_dim, hidden_dim):
super().__init__()
self.ff1 = torch.nn.Linear(model_dim, hidden_dim)
self.ff2 = torch.nn.Linear(hidden_dim, model_dim)
self.activation = torch.nn.ReLU()
def forward(self, x):
return self.ff2(self.activation(self.ff1(x)))
# Define the expert parallel group size and number of experts
ep_size = 2 # number of GPUs per expert group
num_experts = 8
# Create an MoE layer with distributed experts
moe_layer = MoE(
hidden_size=1024,
expert_class=ExpertLayer,
num_experts=num_experts,
ep_size=ep_size,
k=1, # top‑1 gating (Switch style)
expert_args=(1024, 4096)
)
# Integrate into a pipeline or Transformer model
model = PipelineModule(layers=[moe_layer, ...], loss_fn=torch.nn.CrossEntropyLoss())
This code defines a minimal feed-forward expert (ExpertLayer) and then uses DeepSpeed’s MoE layer to implement a Mixture-of-Experts block, which shards whole experts across GPUs using expert parallelism (ep_size=2 puts the GPUs used by this layer into 2 groups). With num_experts=8 total experts, and k=1 (Switch-style, top-1 routing), a lightweight gating network within MoE decides which token is assigned to which expert. The individual experts each implement a 2-layer MLP (Linear → ReLU → Linear), which maps hidden width 1024 → 4096 → 1024. DeepSpeed will manage the token routing across the EP group behind the scenes. Finally, the moe_layer is added to a PipelineModule, illustrating how the MoE block would be inserted into a larger Transformer/pipeline and trained end-to-end with a standard CrossEntropyLoss.
The following pseudo-configuration shows how DeepSpeed manages multiple parallelism dimensions when training an MoE model. It leverages bfloat16 (bf16) mixed precision, uses AdamW with a learning rate of 2e-4, and applies ZeRO stage-2 to shard optimizer, states, and reduce memory.
It provides a hybrid parallelism of data parallel size 4 ( this replicates the model across 4 groups), tensor parallel size 2 (partition large matrices across 2 GPUs), pipeline parallel 1 (no pipelining), and expert parallel size 4 (shard experts across 4 GPUs).
{
"bf16": { "enabled": true },
"optimizer": { "type": "adamw", "params": { "lr": 2e-4 } },
"zero_optimization": { "stage": 2 },
"parallelism": {
"data_parallel_size": 4,
"tensor_parallel_size": 2,
"pipeline_parallel_size": 1,
"expert_parallel_size": 4
},
"moe": {
"enabled": true,
"num_experts": 64,
"top_k": 2,
"capacity_factor": 1.25,
"load_balancing_loss_coef": 1e-2,
"router": "softmax",
"token_drop_policy": "capacity" // or "dropless" depending on framework
}
}
Expert parallelism is particularly well‑suited to models and tasks that have a large capacity and a limited compute budget:
Despite its benefits, expert parallelism introduces new complexities:
Routing tokens to the appropriate GPUs requires all‑to‑all communication. Each token is sent to the GPUs that hold its selected experts, and then those GPUs return the outputs of the experts to the tokens on the originating devices. If there are large numbers of tokens and experts, this communication can dominate the training time. As such, communication overhead is a primary consideration in the design of expert parallelism. Researchers often pair expert parallelism with data parallel or tensor parallel strategies to balance workload and reduce overhead. In their paper on the Switch Transformer, the authors formalize these trade‑offs by using a binary assignment matrix that specifies which tokens are assigned to which experts, along with a measure of communication costs.
If some experts are assigned more tokens than others, then the GPUs with those experts become bottlenecks. The gating network must balance between accuracy (routing tokens to the best expert) and load balancing (making sure no single expert/GPU is overloaded). There are more complex gating strategies that can redistribute tokens to make better use of all the experts(like load‑balanced gating). However, these may sacrifice the model’s ability to send tokens to the best expert. The capacity factor (which determines how many tokens an expert is allowed to process before overflowing or dropping excess tokens) and the gating algorithm must be carefully tuned to prevent hot spots.
Expert parallelism can increase pressure on the GPU interconnect and CPU memory. The all-to-all operations for dispatch, if not optimized, can result in a large number of small messages, which may overload the networking hardware. Leverage of modern interconnect technologies (NVSwitch intra-node, InfiniBand, or NVLink inter-node) is almost a requirement for large-scale expert parallel training.
How does Expert Parallelism differ from other forms of model parallelism?
In Expert Parallelism, entire experts (sub-networks) are distributed across GPUs, while other techniques partition the model differently: data parallelism partitions a batch of tokens, tensor parallelism partitions a matrix, and pipeline parallelism partitions layers of a model. Also, only the experts selected for each token are activated, reducing the computation and memory requirements.
Why do we use Mixture-of-Experts (MoE) layers with Expert Parallelism?
The gating network is used to route each token to only a few experts (top-k), making the MoE layer sparse. Expert Parallelism enables those experts to be placed on different GPUs or nodes. As a result, the number of experts (and therefore the capacity of the model) can scale to orders of magnitude larger than would fit on a single GPU. The combination of MoE and Expert Parallelism enables us to build trillion-parameter models that remain computationally feasible.
What are the main challenges in implementing Expert Parallelism?
Communication, load balancing, and hardware interconnect speed are the major challenges for implementing expert parallelism. Each token must be routed to the GPUs that contain the experts who were chosen for that token. This requires all-to-all communication at every layer, which can be quite expensive. If the gating network isn’t smart, it can cause some GPUs to become bottlenecks by sending too many tokens to a small number of experts.
Routing algorithms, careful regularization of the gating losses to ensure load balance, and high-bandwidth GPU interconnects (like NVLink or InfiniBand) are key to overcoming these issues.
Data, tensor, and pipeline parallelism are the three general approaches for model parallelism. These distribute computation across devices by splitting the data, matrices, or layers. On the other hand, expert parallelism distributes model capacity by spreading the experts of a mixture‑of‑experts layers across GPUs. Since each input token is routed to only a few experts, only those GPUs will perform computations.
Mixture‑of‑experts architectures and expert parallelism have opened the door for training significantly larger and more specialized models without prohibitive costs. In the future, we can expect routing algorithms and communication strategies to improve, further alleviating some of the concerns associated with load‑balancing and communication bottlenecks. Expert parallelism will likely be combined with data, tensor, and pipeline parallelism to create a hybrid approach that will become the standard for training future LLMs and sparse models. For hands-on resources, you can look at the following guides:
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
I am a skilled AI consultant and technical writer with over four years of experience. I have a master’s degree in AI and have written innovative articles that provide developers and researchers with actionable insights. As a thought leader, I specialize in simplifying complex AI concepts through practical content, positioning myself as a trusted voice in the tech community.
With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.