How ChatGPT Works?
Explore ChatGPT’s inference pipeline—from tokenization to response streaming and conversational context handling.
In AI and ML engineering interviews, it’s common to be asked, “Can you explain how ChatGPT works?” This question probes your understanding of large language models and ability to articulate complex systems clearly. Interviewers want to see that you grasp the key components of a generative AI system (like ChatGPT) and can explain the inference-time process—i.e., what happens from when a user enters a prompt to when ChatGPT streams back a response.
Think of your answer as a guided tour of what happens when someone interacts with ChatGPT. In this lesson, we’ll walk through the core components of ChatGPT’s inference process step by step, with analogies and diagrams to help solidify your understanding. By the end, you should have a clear mental model to confidently communicate in an interview.
What happens when a user sends a message to ChatGPT?
The first step in ChatGPT’s pipeline is tokenization, which converts the raw text of the user’s prompt into a form the model can understand (numbers). Like other large language models, ChatGPT doesn’t read text character by character or word by word like humans do. Instead, it breaks the input text into sub-word units called tokens. A token is typically a piece of a word (for example, “fantastic” might be broken into tokens like “fan”, “tas”, “tic”), or sometimes a whole word or just a character—it depends on the tokenizer’s rules. Each token is then mapped to a unique numerical ID. This is a fundamental first step for any NLP model. For instance, the sentence “Hello, how are you?” might become a sequence of tokens like [15496, 11, 703, 389] (just as an example of IDs). The model has no inherent understanding of letters or words—only these token IDs and their associated vectors.
Once the user’s prompt is tokenized, it enters the model’s context window and any other context that needs to be considered (such as previous conversation history or system instructions). The context window refers to the maximum number of tokens the model can handle at once—essentially, the model’s working memory for the conversation. For example, GPT-4o has a context window of about 128k tokens. This means the sum of the input and output tokens can’t exceed that limit. If the conversation or prompt is longer, earlier parts will be truncated or summarized to fit.
The prompt isn’t just the latest user question in a chat setting. It usually consists of a sequence of messages, each with a role (e.g., system, user, assistant). For ChatGPT, a typical prompt construction at inference time might look like this internally:
A system message (hidden from the user) that sets the stage and guidelines. For example: “You are ChatGPT, a large language model trained by OpenAI. Follow the user’s instructions and answer helpfully...” This system message defines the assistant’s persona and boundaries.
The conversation history, alternating between user and ...