Scaling Laws
Learn how scaling laws in language models are about understanding the predictable patterns of improvement as we make models bigger and train them on more data.
“What is the scaling law?” is a common interview question at leading AI labs and for good reason. Interviewers ask this to probe your grasp of how model performance scales with resources. They want to see that you understand the trade-offs in making models bigger, feeding them more data, and using more compute. In other words, can you discuss how scaling up a model affects its efficiency, performance gains, and practical costs? Top companies expect candidates to appreciate that bigger models can perform better, but also how and why, including diminishing returns and planning training resources.
Such questions test your high-level understanding of model scaling strategies. They’re looking for awareness that beyond a certain point, simply throwing more parameters or data at a model has trade-offs. For instance, do you know how much performance gain to expect by doubling the model size? How should model and dataset sizes grow together for optimal training? By asking about scaling laws, interviewers check if you understand the empirical rules-of-thumb governing efficient scaling of LLMs—an area crucial for designing state-of-the-art systems under fixed budgets.
What exactly is the law of scaling models?
At its core, scaling laws refer to the remarkably predictable improvements in model performance as we scale up three key factors:
Model size (number of parameters
) Dataset size (number of tokens
) Compute (training FLOPs
or training time)
In large language models, researchers have found that as you increase
For example, one can write the loss
Similarly, of dataset size is:
Finally, for compute:
where
The takeaway is that larger scale = better performance, following a simple curve. This empowers us to forecast model improvements without actually training gigantic models. (Notably, using such scaling laws, researchers correctly predicted GPT-4’s performance using only a tiny fraction of the final compute!) But scaling up isn’t as simple as just cranking one dial to max—you must balance model size, data quantity, and compute. Next, we’ll break down each factor and the role it plays.
How does model size affect scaling?
Generally, bigger models yield lower loss. A model with 10 billion parameters will typically outperform (i.e., have lower perplexity than) one with 1 billion parameters if all else is equal. Empirically, test loss decreases as a power law in
This means the loss drops slowly but steadily as