...

/

Mixture of Experts

Mixture of Experts

Learn about Mixture of Experts (MoE), which enables scalable, efficient deep learning by activating only a subset of specialized expert networks per input, reducing computation while increasing model capacity.

Interviewers at major tech companies often ask candidates to explain Mixture of Experts (MoE) models. This question has become common because MoE is a cutting-edge technique that trains massive AI models efficiently. For example, Waymo (an Alphabet/Google company) specifically asked a candidate how to improve the efficiency of a Transformer’s feed-forward network—the expected answer was to use a Mixture of Experts approach. Likewise, Meta’s latest Llama 4 models are the first in that family to employ MoE, and even OpenAI has hinted that future systems might incorporate MoE as it’s known to yield better results with fewer resources. In short, MoE has become a hot topic across Google, Meta, Amazon, OpenAI, and others, making it a likely interview subject.

What is a Mixture of Experts (MoE)?

Imagine a team of specialists, each expert in a different area, and a smart manager who directs tasks to the most qualified specialists. That’s the intuition behind Mixture of Experts (MoE). In an MoE model, instead of one monolithic neural network handling every input, you have multiple smaller networks—expertseach trained to handle certain inputs. A separate component called the gating network (or router) looks at each incoming sample and decides which expert (or experts) should handle it. The model “mixes” the expertise of many sub-models, but only a few are active for any given input.

This design allows the model to specialize and scale. Each expert can focus on a subset of the problem space (for example, one expert might become specialized in math problems while another specializes in language syntax). When a new input comes in, the gating network routes it to, say, the math expert if it’s a math question or the syntax expert if it’s a grammar question. For instance, in the DeepSeek MoE-based model, asking “What’s 2 + 2?” would activate the math expert (which responds “4”), while a request to “Write Python code for a loop” would engage the coding expert. Only the relevant parts of the network work on each query. By combining the outputs of the selected experts, the model produces a final result as if a panel of specialists solved the problem together.

Press + to interact
Mixture of Experts architecture
Mixture of Experts architecture

Formally, MoE is a form of conditional computation: different inputs activate different parts of the network. This concept originated in the early 1990s with Jacobs and Hinton’s work on adaptive mixtures of local experts, but it has seen a resurgence in deep learning. The key idea is that “Mixture of Experts layers are simple and allow us to increase the size or capacity of a language model without a corresponding increase in compute”. We achieve this by having multiple copies of a layer (the experts, each with its own parameters) and a gating mechanism that sparsely selects a subset of those experts for each input. In other words, an MoE layer is an ensemble of many networks, but thanks to the gating, only a few experts’ outputs are combined for any given data point. This yields a huge model capacity (parameters across all experts) while keeping the computation per sample much lower than using all parts of the model at once.

To summarize in simple terms: MoE is like having a large team of neural networks (experts) where, for each task, only the most relevant members of the team are called upon. This gives the power of a very large model, but you pay for only a small model’s computation for each task. Next, let’s dig into how this is actually implemented in a neural network architecture.

MoE architecture: Gating and sparse expert activation

In an MoE architecture, the two main components are: (1) a set of expert networks, and (2) a gating network. The experts are typically neural networks (often all with the same architecture) that each accept the same kind of input and produce the same kind of output. For example, in a Transformer-based MoE, each expert might be a feed-forward network (FFN) layer with its own weights. You might have dozens or even hundreds of experts in one MoE layer. The gating network is a smaller network (often a simple feed-forward layer or two) whose job is to output a weight or score for each expert, based on the input. Think of the gating network as the “router” or “manager” that decides which experts are relevant for this particular input token or example.

Here’s how a forward pass works in an MoE layer:

  • An input (say, a representation of a token or an image, depending on the task) is fed into the gating network. The gating network produces a set of scores, one for each expert. These scores can be turned into weights (for example, via a softmax) indicating how much each expert should contribute.

  • Instead of using all experts, the MoE layer will activate only a sparse subset of them for this input. Typically, the gating network will select the top few experts with the highest scores and ignore the rest (this is what we mean by “sparsely-gated” MoE). For instance, it might pick the single best expert (top-1 gating) or the top two experts (top-2 gating) for the input.

  • The chosen expert networks each process the input (in parallel), producing their individual outputs. Because we activated only a small number of experts, most of the network’s parameters remain dormant for this input – which is exactly why MoE saves computation.

  • Finally, the outputs of the active experts are combined to form the MoE layer’s output. If only one expert was active, its output might be taken as the result (or one could still weight it by the gating score). If two experts are active, we might average their outputs or weight-sum them according to the gating weights.

Press + to interact

Importantly, the gating decision can be hard (truly selecting a few experts and zeroing out the others) or soft (assigning fractional weights to all experts). In practice, modern MoEs use hard selection for efficiency—compute only the top-k experts—but during training the gating may use techniques to keep the process differentiable (we’ll discuss that shortly). The end result is that each input only flows through a tiny fraction of the model.

To visualize it, consider a Transformer where normally you have a single FFN after attention. In a MoE Transformer, that FFN is replaced by, say, 16 FFN experts. For each token, the gating network (which could be a linear layer taking the token’s representation) outputs 16 scores, one per expert. Suppose it selects the top 2 experts for that token. Only those two FFNs compute an output for the token, and then their outputs are averaged (perhaps weighted by the gate). The other 14 experts do nothing for this token. The token then moves on in the model with this mixed output. Next token might activate a different combination of experts depending on its nature. By doing this, the model’s capacity (16 FFNs) is much larger than a single FFN, but each token only used 2 FFNs worth of compute.

Expert selection strategies

A crucial part of MoE is the strategy used by the gating network to select experts. The ...