Large Language Models: Language at a Scale
Understand how large language models scale smarter through mixture of experts that activate specialized sub-networks for efficiency. Learn about reasoning models that break down complex tasks step-by-step to enhance accuracy. Discover methods to extend context windows, allowing models to handle massive inputs like entire documents.
We’ve gone from models with millions of parameters to those with billions or trillions, gaining better language skills and broader abilities. However, simply making models bigger is costly, comparable to running a massive power plant. The real challenge is scaling smarter, not just larger.
In this section, we’ll explore three approaches that go beyond raw scale, making models more efficient, capable, and practical for real-world use.
What is a mixture of experts (MoE) model?
A mixture of experts (MoE) model splits its workload across specialized sub-networks called experts and uses a router to decide which ones to activate.
Experts: Smaller networks trained to specialize in certain kinds of inputs (math, code, everyday language, etc.).
Router: A gating mechanism that routes each token (piece of input) to just one or two experts instead of all of them.
This means the model’s total capacity is very large (the sum of all experts), but per-token compute is much smaller because only a fraction of the experts are active at once.
For example, if a model has 100 experts but the router activates only 10 per token, it achieves the expressive power of a 100B-parameter model while running at the cost of just 10B.
The benefits:
Efficiency: Saves time, memory, and cost during inference.
Specialization: Different experts can excel at ...