Model Architectures and Scaling

Explore how large language models are designed and scaled, focusing on trade-offs in architecture, the impact of scaling laws like Chinchilla, mixture-of-experts routing, and techniques enabling million-token context windows. Learn to reason about real-world engineering decisions and optimize inference and training effectively.

We'll cover the following...

What are scaling laws and why do they determine training budgets?
What is mixture of experts and why do modern LLMs use it?
How do diffusion models work and how do they differ from autoregressive models?
How do modern models handle million-token context windows?
How would you implement a simple MoE forward pass?
What’s next?

Architecture questions in AI interviews often feel abstract until you connect them to decisions that real labs are making right now. Why did Meta release a 109B parameter model that only uses 17B parameters at runtime? Why did Google train Gemini 3 at a specific token budget rather than simply training longer? Why do frontier models suddenly handle 1M token documents? These are not trivia questions. They are consequences of scaling laws, mixture-of-experts design, and engineering choices that determine what is commercially viable to build and deploy.

You do not need to memorize architecture papers. You need to understand the trade-offs each architectural choice makes. Interviewers are not quizzing you on publication dates. They are checking whether you can reason about why a design decision was made and what it costs.

What are scaling laws and why do they determine training budgets?

Scaling laws describe how model performance changes as you scale compute, parameters, and data. The landmark result is the Chinchilla scaling law (Hoffmann et al., 2022, DeepMind), which showed that most models at the time were undertrained relative to their parameter count.

The key finding: for a fixed compute budget C, the compute-optimal allocation is roughly equal scaling of parameters N and training tokens D. Specifically, the optimal training token count is approximately 20x the parameter count. Chinchilla (70B parameters, 1.4T tokens) significantly outperformed Gopher (280B parameters, 300B tokens) despite using the same training compute, because Gopher was parameter-heavy and data-light.

When this question comes up in your interview, make this distinction explicit. Chinchilla optimal is a training compute concept, not an inference concept. Labs like Meta optimize for inference cost, not just training cost, which is why they train smaller models longer.

What is mixture of experts and why do modern LLMs use it?

A standard dense transformer applies every parameter to every token. A 70B model performs roughly 70 billion multiply-accumulate operations per token per forward pass. This is expensive at inference.

Mixture of Experts (MoE) breaks the feed-forward layers into N separate expert MLPs and routes each token to only k of them, typically k=2 or k=4. For a model with 64 experts and top-2 routing, each token activates roughly 2/64 = 3% of the expert parameters. This means a 400B total-parameter ...

1.How AI Models Work

2.LLM Training, Fine-Tuning, and Optimization

3.AI System Design

Model Architectures and Scaling

What are scaling laws and why do they determine training budgets?

What is mixture of experts and why do modern LLMs use it?