Search⌘ K
AI Features

Model Architectures and Scaling

Explore how large language models are designed and scaled, focusing on trade-offs in architecture, the impact of scaling laws like Chinchilla, mixture-of-experts routing, and techniques enabling million-token context windows. Learn to reason about real-world engineering decisions and optimize inference and training effectively.

Architecture questions in AI interviews often feel abstract until you connect them to decisions that real labs are making right now. Why did Meta release a 109B parameter model that only uses 17B parameters at runtime? Why did Google train Gemini 3 at a specific token budget rather than simply training longer? Why do frontier models suddenly handle 1M token documents? These are not trivia questions. They are consequences of scaling laws, mixture-of-experts design, and engineering choices that determine what is commercially viable to build and deploy.

You do not need to memorize architecture papers. You need to understand the trade-offs each architectural choice makes. Interviewers are not quizzing you on publication dates. They are checking whether you can reason about why a design decision was made and what it costs.

What are scaling laws and why do they determine training budgets?

Scaling laws describe how model performance changes as you scale compute, parameters, and data. The landmark result is the Chinchilla scaling law (Hoffmann et al., 2022, DeepMind), which showed that most models at the time were undertrained relative to their parameter count.

The key finding: for a fixed compute budget C, the compute-optimal allocation is roughly equal scaling of parameters N and training tokens D. Specifically, the optimal training token count is approximately 20x the parameter count. Chinchilla (70B parameters, 1.4T tokens) significantly outperformed Gopher (280B parameters, 300B tokens) despite using the same training compute, because Gopher was parameter-heavy and data-light.

The practical consequence for the industry: Llama 2 7B was trained on 2 trillion tokens (roughly 285x the parameter count), making it far over-trained relative to the Chinchilla optimum. This is intentional. Chinchilla optimal refers to the best model quality per unit of training compute. But inference is run millions of times after training, so it makes economic sense to overtrain a smaller model: you pay the extra training cost once and reduce inference cost forever.

When this question comes up in your interview, make this distinction explicit. Chinchilla optimal is a training compute concept, not an inference concept. Labs like Meta optimize for inference cost, not just training cost, which is why they train smaller models longer.

What is mixture of experts and why do modern LLMs use it?

A standard dense transformer applies every parameter to every token. A 70B model performs roughly 70 billion multiply-accumulate operations per token per forward pass. This is expensive at inference.

Mixture of Experts (MoE) breaks the feed-forward layers into N separate expert MLPs and routes each token to only k of them, typically k=2 or k=4. For a model with 64 experts and top-2 routing, each token activates roughly 2/64 = 3% of the expert parameters. This means a 400B total-parameter ...