Scaling Laws

Learn how scaling laws in language models are about understanding the predictable patterns of improvement as we make models bigger and train them on more data.

“What is the scaling law?” is a common interview question at leading AI labs and for good reason. Interviewers ask this to probe your grasp of how model performance scales with resources. They want to see that you understand the trade-offs in making models bigger, feeding them more data, and using more compute. In other words, can you discuss how scaling up a model affects its efficiency, performance gains, and practical costs? Top companies expect candidates to appreciate that bigger models can perform better, but also how and why, including diminishing returns and planning training resources.

Such questions test your high-level understanding of model scaling strategies. They’re looking for awareness that beyond a certain point, simply throwing more parameters or data at a model has trade-offs. For instance, do you know how much performance gain to expect by doubling the model size? How should model and dataset sizes grow together for optimal training? By asking about scaling laws, interviewers check if you understand the empirical rules-of-thumb governing efficient scaling of LLMs—an area crucial for designing state-of-the-art systems under fixed budgets.

What exactly is the law of scaling models?

At its core, scaling laws refer to the remarkably predictable improvements in model performance as we scale up three key factors:

  • Model size (number of parameters NN)

  • Dataset size (number of tokens DD)

  • Compute (training FLOPs CC or training time)

In large language models, researchers have found that as you increase NN, DD, or CC, the test loss (or perplexity) drops following a power-law curve. In other words, model quality improves smoothly and reliably as you make the model bigger, train on more data, and spend more compute. This was first demonstrated in a seminal 2020 paper by Kaplan et al. (OpenAI) and has held over multiple orders of scale. Mathematically, these relationships can be sketched as power laws.

For example, one can write the loss LL as a function of model size, roughly as:

Similarly, of dataset size is:

Finally, for compute:

where α,β,γ>0\alpha,\beta, \gamma > 0 are scaling exponents. Here a,ca,c are constants (with cc representing an irreducible loss “floor”). The key insight is that performance vs. scale is approximately a straight line on a log-log plot. Crucially, none of these exponents are 1 or greater, meaning we get diminishing returns: doubling the scale yields less than double the improvement. However, the improvements are very consistent. There are no signs of the curves bending or saturating even at the largest scales tested (though eventually they must flatten as loss approaches the minimum).

The takeaway is that larger scale = better performance, following a simple curve. This empowers us to forecast model improvements without actually training gigantic models. (Notably, using such scaling laws, researchers correctly predicted GPT-4’s performance using only a tiny fraction of the final compute!) But scaling up isn’t as simple as just cranking one dial to max—you must balance model size, data quantity, and compute. Next, we’ll break down each factor and the role it plays.

How does model size affect scaling?

Generally, bigger models yield lower loss. A model with 10 billion parameters will typically outperform (i.e., have lower perplexity than) one with 1 billion parameters if all else is equal. Empirically, test loss decreases as a power law in NN (parameters). Kaplan et al. (2020) found for Transformers that:

This means the loss drops slowly but steadily as NN ...