Search⌘ K
AI Features

What Are Large Language Models?

Explore how large language models work, including their multi-phase training process and emergent capabilities like code generation and chain-of-thought reasoning. Learn the fundamental differences between LLMs and traditional machine learning models and discover deployment considerations for integrating LLMs into production systems.

A single model that can draft legal contracts, write Python code, summarize medical research, and hold multi-turn conversations with a user: just five years ago, each of these tasks demanded its own purpose-built system with custom training data and specialized engineering. Today, a large language model handles all of them through natural language prompts alone. This shift is not incremental; it represents a fundamental change in how software systems leverage artificial intelligence.

A large language model (LLM) is a deep neural network built on the transformer architectureA neural network design introduced in 2017 that uses a self-attention mechanism to process all positions in a sequence simultaneously, enabling efficient learning of long-range dependencies in text.. These models are trained on internet-scale text corpora spanning hundreds of billions to trillions of tokens, and they contain billions of learned parameters. GPT-4, developed by OpenAI, and Claude, developed by Anthropic, are two flagship examples that demonstrate what becomes possible when language models operate at this scale.

Understanding what LLMs are and how they work is the prerequisite for everything else in this course. Before exploring the distinction between foundation models and task-specific models in the next lesson, you need a clear mental model of how these systems are trained, what capabilities emerge from scale, and why they differ so fundamentally from the traditional machine learning approaches you may already know. The rest of this lesson covers exactly that.

How LLMs are trained

The training process behind a modern LLM is not a single step. It follows a structured, multi-phase pipeline where each stage builds on the previous one, progressively shaping a raw statistical model into a helpful, instruction-following assistant. Think of it like educating a person: first, they read broadly to understand language, then they study specific examples of good work, and finally, they receive feedback from mentors to refine their judgment.

The three-phase training pipeline

Phase 1: Pre-training

The foundation of every LLM is pre-training, a self-supervised learning process where the model learns to predict the next tokenThe smallest unit of text that an LLM processes, which can be a word, a subword, or a character depending on the tokenizer used. in a sequence. The training corpus is massive, drawing from sources like Common Crawl, digitized books, academic papers, and code repositories. GPT-4, for instance, was trained on trillions of tokens using clusters of thousands of GPUs such as NVIDIA A100s. The model sees no explicit labels during this phase. Instead, it learns grammar, facts, reasoning patterns, and even coding conventions purely by predicting what comes next in text. This single objective, next-token prediction, turns out to be remarkably powerful when applied at sufficient scale.

Phase 2: Supervised fine-tuning (SFT)

A pre-trained model can complete text, but it does not naturally follow instructions. During supervised fine-tuning, the model trains on curated instruction-response pairs. For example, a prompt like “Summarize this article in three bullet points” is paired with a high-quality human-written summary. This phase teaches the model the format and style of helpful responses.

Phase 3: Reinforcement learning from human feedback

The final phase, RLHF, aligns the model with human preferences. Human annotators rank multiple model outputs for the same prompt, and these rankings train a separate reward model. The LLM is then updated using reinforcement learning to maximize the reward signal, pushing it toward responses that are more helpful, harmless, and honest. Both OpenAI (for GPT-4) and Anthropic (for Claude) use variations of this approach.

Note: Pre-training a frontier LLM can cost tens of millions of dollars in compute alone. This is why most practitioners work with pre-trained models rather than training from scratch.

The transformer’s self-attention mechanism is the architectural breakthrough that makes all of this feasible. It allows the model to weigh relationships between any two positions in a text sequence, regardless of distance, enabling coherent generation over long passages.

The following diagram illustrates how these three phases connect in sequence.

Three-phase LLM training pipeline from pre-training through SFT to RLHF alignment
Three-phase LLM training pipeline from pre-training through SFT to RLHF alignment

With the training pipeline established, the natural question becomes: what capabilities does this process actually produce?

Emergent capabilities at scale

Emergent capabilities are abilities that appear only when models reach a sufficient threshold of parameters and training data. They are not explicitly programmed into the model. A smaller model trained with the same objective simply does not exhibit them. This is one of the most surprising findings in modern AI research.

The key emergent behaviors that define today’s LLMs fall into several categories.

  • In-context learning (few-shot prompting): The model learns a new task from a handful of examples placed directly in the prompt, with no weight updates at all. You provide two or three input-output examples, and the model generalizes the pattern to new inputs.

  • Chain-of-thought reasoning: When prompted to “think step by step,” LLMs can solve multi-step math and logic problems that they fail on when asked for a direct answer. The intermediate reasoning steps act as a scaffold for the final output.

  • Code generation and debugging: Models like GPT-4 produce working programs across multiple languages, identify bugs in existing code, and explain their fixes in natural language.

  • Instruction following: After RLHF, models generalize to novel instructions they were never explicitly trained on, handling requests that differ substantially from their fine-tuning data.

A concrete illustration of reasoning emergence is GPT-4 passing the Uniform Bar Exam in the 90th percentile. Claude 3 demonstrates similarly strong performance on graduate-level reasoning benchmarks. These results are not the product of memorization; they reflect the model’s ability to compose learned patterns into novel solutions.

Practical tip: When you encounter a task where an LLM seems to fail, try adding “Let’s think step by step” to your prompt. Chain-of-thought prompting often unlocks reasoning capabilities that a direct question does not.

These emergent capabilities are precisely what make LLMs the backbone of modern generative AI applications. But how do they compare to the machine learning models that preceded them?

LLMs vs. traditional ML models

The contrast between LLMs and traditional ML models is not just a matter of scale. It reflects a fundamentally different paradigm for how AI systems are built and deployed. The following table summarizes the key differences across six dimensions.

Traditional ML Models vs. Large Language Models

Dimension

Traditional ML Models

Large Language Models

Training Data

Structured, labeled datasets (numerical, categorical, labeled images)

Internet-scale unstructured text corpora (books, articles, websites)

Architecture

Task-specific models (SVM, Random Forest, small neural nets)

Transformer architecture with billions of parameters

Task Scope

Single, narrowly defined tasks (spam detection, image classification)

Multi-task generalist (text generation, translation, summarization, Q&A)

Adaptation Method

Retrain or fine-tune on labeled data

Prompt engineering, few-shot learning, or fine-tuning

Compute Requirements

CPU or single GPU

Clusters of GPUs (e.g., AWS p4d.24xlarge instances)

Example

scikit-learn classifier

GPT-4, Claude

Traditional ML models are narrow specialists. A sentiment classifier trained on labeled movie reviews cannot summarize articles or translate languages. Each new task requires a separate cycle of data collection, feature engineering, and model training. LLMs invert this paradigm entirely. A single pre-trained model handles diverse tasks through natural language prompts, eliminating the need to build separate systems for each capability.

This generality comes with trade-offs, however. LLMs require significantly more compute and memory for inference. They can hallucinateGenerate text that sounds plausible but is factually incorrect or fabricated, a common failure mode in LLMs that requires careful mitigation in production systems.. They are also harder to interpret compared to a decision tree or logistic regression model. Monitoring tools can help track data and model quality in production, but the underlying reasoning process remains opaque.

Traditional models remain the better choice when labeled data is abundant, latency must be sub-millisecond, or full explainability is legally required, such as in credit scoring. The key insight is that LLMs and traditional ML are not competitors in every scenario; they occupy different positions on a spectrum of generality vs. specialization.

Attention: Do not assume an LLM is always the right tool. For high-throughput, low-latency classification tasks with clean labeled data, a lightweight scikit-learn model can outperform an LLM in both speed and cost by orders of magnitude.

Understanding this contrast prepares you to evaluate when a general-purpose foundation model is the right choice vs. a purpose-built alternative, which is exactly the focus of the next lesson.

The following markmap provides a high-level view of the LLM ecosystem covered so far.

This map connects core LLM concepts from training through emergent capabilities to production deployment

Real-world deployment considerations

Knowing how LLMs work is only half the picture. Putting them into production introduces a distinct set of engineering challenges that connect directly to the application architectures covered throughout this course.

Deploying models like GPT-4 or Claude typically follows a few distinct paths. The first is direct API-based access through providers like the OpenAI API or Anthropic API, where the provider manages infrastructure and scaling. The second is using a managed service like Amazon Bedrock, which provides API-level simplicity for models like Claude or Llama while keeping data within the secure AWS environment. The third path is self-hosted deployment using platforms like AWS SageMaker JumpStart. This final path gives teams the most control over data privacy, fine-tuning, and latency, but requires managing GPU instances and scaling infrastructure.

Practitioners monitor several key metrics in production. Inference latency measures how quickly the model returns a response. Throughput, expressed in tokens per second, determines how many concurrent users the system can serve. Cost per query directly impacts whether an LLM-based feature is economically viable at scale. Responsible AI tooling, such as AWS SageMaker Clarify for bias detection, adds another layer of operational rigor.

Practical tip: If you are prototyping, start with API-based access to avoid infrastructure overhead. Move to self-hosted deployment only when you have clear requirements around data residency, latency, or cost that the API cannot meet.

These deployment realities are what transform an LLM from a research artifact into a production system, and they motivate the architectural patterns you will study next.

The following quiz tests your understanding of the core concepts covered in this lesson.

Lesson Quiz

1.

What is the primary training objective during the pre-training phase of an LLM?

A.

Text classification

B.

Next-token prediction

C.

Translation

D.

Document clustering


1 / 2

Conclusion

Large language models are transformer-based neural networks trained on internet-scale text through a pipeline of pre-training, supervised fine-tuning, and RLHF. Their scale gives rise to emergent capabilities, including in-context learning, chain-of-thought reasoning, and code generation, that set them apart from traditional ML models. Traditional approaches remain narrow specialists requiring labeled data and separate engineering for each task, while LLMs serve as multi-task generalists driven by natural language prompts. Real-world examples like GPT-4 and Claude demonstrate these capabilities at production scale. With this foundational understanding in place, the next lesson explores the distinction between foundation models and task-specific models, helping you decide when a general-purpose LLM is the right architectural choice.