...

Scaling Laws

Learn how scaling laws in language models are about understanding the predictable patterns of improvement as we make models bigger and train them on more data.

We'll cover the following...

“What is the scaling law?” is a common interview question at leading AI labs and for good reason. Interviewers ask this to probe your grasp of how model performance scales with resources. They want to see that you understand the trade-offs in making models bigger, feeding them more data, and using more compute. In other words, can you discuss how scaling up a model affects its efficiency, performance gains, and practical costs? Top companies expect candidates to appreciate that bigger models can perform better, but also how and why, including diminishing returns and planning training resources.

Such questions test your high-level understanding of model scaling strategies. They’re looking for awareness that beyond a certain point, simply throwing more parameters or data at a model has trade-offs. For instance, do you know how much performance gain to expect by doubling the model size? How should model and dataset sizes grow together for optimal training? By asking about scaling laws, interviewers check if you understand the empirical rules-of-thumb governing efficient scaling of LLMs—an area crucial for designing state-of-the-art systems under fixed budgets.

What exactly is the law of scaling models?

At its core, scaling laws refer to the remarkably predictable improvements in model performance as we scale up three key factors:

Model size (number of parameters $N$ )
Dataset size (number of tokens $D$ )
Compute (training FLOPs $C$ or training time)

In large language models, researchers have found that as you increase $N$ , $D$ , or $C$ , the test loss (or perplexity) drops following a power-law curve. In other words, model quality improves smoothly and reliably as you make the model bigger, train on more data, and spend more compute. This was first demonstrated in a seminal 2020 paper by Kaplan et al. (OpenAI) and has held over multiple orders of scale. Mathematically, these relationships can be sketched as power laws.

For example, one can write the loss $L$ as a function of model size, roughly as:

Introduction

Neural Network Training and Optimization

Embeddings and Tokenization

Attention Mechanisms

Evaluation Techniques

Model Architectures and Comparisons

Learning Techniques

Scalability and Efficiency

Wrap Up

Fundamentals of Generative AI

Scaling Laws

What exactly is the law of scaling models?