...

How ChatGPT Works?

Explore ChatGPT’s inference pipeline—from tokenization to response streaming and conversational context handling.

We'll cover the following...

What happens when a user sends a message to ChatGPT?
How is that text processed?
How does it handle the next token prediction?
How the streaming happen?
How is conversation state maintained across turns?
Conclusion

Press + to interact

Think of your answer as a guided tour of what happens when someone interacts with ChatGPT. In this lesson, we’ll walk through the core components of ChatGPT’s inference process step by step, with analogies and diagrams to help solidify your understanding. By the end, you should have a clear mental model to confidently communicate in an interview.

What happens when a user sends a message to ChatGPT?

The first step in ChatGPT’s pipeline is tokenization, which converts the raw text of the user’s prompt into a form the model can understand (numbers). Like other large language models, ChatGPT doesn’t read text character by character or word by word like humans do. Instead, it breaks the input text into sub-word units called tokens. A token is typically a piece of a word (for example, “fantastic” might be broken into tokens like “fan”, “tas”, “tic”), or sometimes a whole word or just a character—it depends on the tokenizer’s rules. Each token is then mapped to a unique numerical ID. This is a fundamental first step for any NLP model. For instance, the sentence “Hello, how are you?” might become a sequence of tokens like [15496, 11, 703, 389] (just as an example of IDs). The model has no inherent understanding of letters or words—only these token IDs and their associated vectors.

Once the user’s prompt is tokenized, it enters the model’s context window and any other context that needs to be considered (such as previous conversation history or system instructions). The context window refers to the maximum number of tokens the model can handle at once—essentially, the model’s working memory for the conversation. For example, GPT-4o has a context window of about 128k tokens. This means the sum of the input and output tokens can’t exceed that limit. If the conversation or prompt is longer, earlier parts will be truncated or summarized to fit.

The prompt isn’t just the latest user question in a chat setting. It usually consists of a sequence of messages, each with a role (e.g., system, user, assistant). For ChatGPT, a typical prompt construction at inference time might look like this internally:

A system message (hidden from the user) that sets the stage and guidelines. For example: “You are ChatGPT, a large language model trained by OpenAI. Follow the user’s instructions and answer helpfully...” This system message defines the assistant’s persona and boundaries.
The conversation history, alternating between user and ...

Introduction

Neural Network Training and Optimization

Embeddings and Tokenization

Attention Mechanisms

Evaluation Techniques

Model Architectures and Comparisons

Learning Techniques

Scalability and Efficiency

Wrap Up

Fundamentals of Generative AI

How ChatGPT Works?

What happens when a user sends a message to ChatGPT?