Grokking the Generative AI System Design/

...

Training Infrastructure of a Text-to-Text Generation System

Learn to design, train, and evaluate text-to-text LLMs, focusing on requirements, data, distributed training, and performance metrics.

We'll cover the following...

Requirements
- Functional requirements
- Nonfunctional requirements
Model selection
The training process

Text-to-text LLMs are a subset of language models. Unlike their predecessors, which were primarily designed for text generation or translation tasks, conversational LLMs are specifically trained to engage in interactive dialogue. They can understand user input and generate human-like responses, making them ideal for applications like chatbots, virtual assistants, and interactive storytelling.

These are the brains behind those friendly AI assistants you interact with on websites or your smartphone. They are designed to understand your needs (even if you phrase them roundaboutly) and provide helpful, informative, and often entertaining responses.

Press + to interact

Let’s see how we can design our own conversational AI. The first step is defining the requirements to guide the design process.

Requirements

Building the backend for a robust conversational AI system requires careful consideration of both functional and nonfunctional requirements.

Functional requirements

Natural language understanding: The system must decipher the meaning behind user input, including identifying intentIntent refers to the purpose or goal behind a user's query (e.g., asking for information or making a request)., entitiesEntities are specific pieces of information extracted from the input, such as names, locations, or dates., and sentimentSentiment is the emotional tone or attitude conveyed in the input, which can range from positive to negative or neutral. Recognizing sentiment enables the system to tailor responses appropriately.. Imagine asking your AI assistant, “What’s the weather like in London tomorrow?” The system needs to understand that you’re asking about the weather (intent), that “London” is the location (entity), and “tomorrow” is the time (entity).

Press + to interact

Dialogue management: The system must effectively manage conversations by retaining relevant information from previous interactions (context retentionContext retention refers to the ability of the system to store and recall relevant details from earlier in the conversation, such as user preferences, prior topics discussed, or incomplete tasks, to provide coherent and personalized responses.) and maintaining an awareness of the conversation’s progress (state managementState management is the process of tracking the current state of the dialogue, including the conversation's flow, user intents, and unresolved queries, to ensure logical progression and appropriate responses.). This includes keeping track of user preferences, remembering recent topics, and understanding when to revisit or conclude a topic based on the conversation’s flow.
Natural language generation: Once the system (LLM) understands the user’s input and the conversation context, it needs to respond accurately to the query.
Personalization: The system should also be capable of tailoring responses based on user preferences and historical interaction.

Press + to interact

Modern conversational bots now include the ability to tailor their responses to each user. For example, we can tell Gemini to remember that our name is ABC, and it will remember that whenever we chat. We will see how LLMs can maintain memory in the next lesson.

Nonfunctional requirements

Low latency: The system should be optimized to minimize latency and provide a seamless conversational experience.

Note: There can be trade-offs between latency and accuracy. For instance, achieving faster responses might mean sacrificing some degree of accuracy, as complex computations or larger models may require more processing time. Balancing latency and accuracy is essential, especially in applications where real-time interaction is critical, yet the accuracy of information remains important.

Scalability: As the user base grows, the system needs to handle the increased demand without compromising performance. This means efficiently processing a large volume of requests concurrently.
Availability: The text generation model should be accessible and operational whenever users need it. This means minimizing downtime and ensuring consistent uptime.
Reliability: The model should give dependable and legitimate responses.
Security: Protecting user data and ensuring privacy is paramount. User data typically includes personally identifiable information (PII) and, importantly, the user’s inputs to the system. Strong security measures must be implemented to safeguard this sensitive information.

Additionally, preventing prompt jailbreakingJailbreaking is where users exploit vulnerabilities to bypass safeguards and misuse the system. is an emerging challenge. Such incidents in models like ChatGPT and Gemini highlight the need for ongoing research and robust defenses against exploitation. We are trying to handle it in our system by ensuring ample training and cleaning data, as we will see later in the lesson, but of course, this is an inherent imperfection with generative models that can generate new data that cannot be predicted with certainty.

Note: User inputs may also be used as training data for the model. Transparency about such practices is critical to maintaining user trust and complying with ethical and legal standards.

With our requirements decided, we can now discuss how to pick a GenAI model that can fulfill our system’s needs.

Model selection

Building a conversational AI requires careful selection of the base language model, balancing capabilities with efficiency and cost-effectiveness. For this design, we’ll use the Llama 3.2 3B model, a 3-billion-parameter LLM optimized for handling natural language inputs.

Let’s understand the reasons behind choosing this model for our use case:

Open source: Llama models provide flexibility and control. We can go into the model’s architecture, fine-tune it extensively, and adapt it precisely to our conversational needs without restrictions. The open-source nature of the Llama model provides the freedom to experiment with different training techniques and modify the model architecture if needed.
Smaller size, greater efficiency: The 3B parameter size balances capability and efficiency. It’s significantly smaller than its 11B and 90B counterparts, making it more manageable for training and deployment, especially when resources might be limited. This translates to faster training, reduced computational costs, faster inference times, and easier deployment (since the model size is smaller, it takes less memory to store it).
Accuracy: Despite its smaller size, Llama 3.2 3B demonstrates impressive accuracy on various language tasks, including dialogue generation and comprehension. It exhibits a good understanding of conversational nuances and can generate human-like responses.

Press + to interact

Introduction to GenAI System Design

Fundamental Concepts in GenAI

Back-of-the-envelope Calculations

Systematic Framework for Designing GenAI Systems

System Design of a Text-to-Text Generation System

ChatGPT

System Design of a Text-to-Image Generation System

DALL·E

System Design of a Text-to-Speech Generation System

ElevenLabs

System Design of a Text-to-Video Generation System

Sora

Conclusion

Training Infrastructure of a Text-to-Text Generation System

Requirements

Functional requirements

Nonfunctional requirements

Model selection