...

/

Deploying the System Design of a Text-to-Text Generation System

Deploying the System Design of a Text-to-Text Generation System

Understand the System Design for deploying a text-to-text generation model like ChatGPT.

In the previous lesson, we covered the training process and the evaluation of the Llama 3.2 GenAI model, resulting in a fully trained and production-ready model. With the model prepared, the next critical step is deployment—making it accessible to users at scale. Cloud service providers like AWS, GCP, and Microsoft Azure offer intuitive platforms that simplify deployment for third-party users. However, we will aim to build the system from scratch to get a solid understanding of various design decisions while ensuring it meets the specific demands of the system.

In this lesson, we’ll build the System Design for deploying a text-to-text (conversational) model, beginning with estimating the resources required for system deployment. From there, we’ll explore how different components are integrated into a robust, efficient, and scalable architecture that can support real-world use cases.

Considering the BOTECs chapter, we start with the different resources estimation, including:

  • Storage estimation

  • Inference servers

  • Network bandwidth

Let’s dive into the details of each of the above:

Storage estimation

Storage estimation includes model size, user profile and interaction data, and indexing storage:

  • Model size: In the previous lesson, we established that we would use a model similar to Llama 3.2 3B to design a text-to-text generation system. For the 3 billion parameters model, 6 GB of storage would be required for FP16 floating-point precision.

  • User profile data: For storing users’ metadata, assume that each user’s data takes approximately 10 KB, translating to:

Note: The model’s size (6 GB) and user profile data (1 TB) will remain constant unless the number of users increases. Therefore, we don’t include them in the subsequent storage estimation.

  • User interaction data: If we store each user interaction data, the storage per interaction will depend on the size of each interaction. Assume that each user interacts 10 times daily with the system, consuming 2 KB of space per interaction. For 100 M users, this storage requirement per day would be:

  • Indexing storage: We would need additional storage for indexing the user interaction data for fast retrieval. Let’s assume an average storage increase of 25% for indexing.

  So, the total storage requirement for user interaction per day would be:

  According to the above estimates, the monthly storage requirement is:

Let’s move on to the estimation of inference servers.

Inference servers estimation

For daily 100 M users, the total number of requests per second (TRPS) is 11,574 for each user making 10 requests per day. Similarly, using our proposed inference formula, an average query’s inference time for a 3B model (for 500 tokens) is approximately 9.6 milliseconds. This time is estimated using FP16 precision on an NVIDIA A100 GPU. According to this estimation, the QPSQueries per second (QPS) is a metric used in online systems to measure the number of queries a server receives per second. for an NVIDIA server having A100 GPU will be 104 QPSQPS =1/(inference time)=1/9.6 ms = 104, which yields us the following number of inference servers:

Bandwidth estimation

To calculate the ingress and egress network bandwidth for 11574 TRPS, we assume the single request size is approximately 2 KB. Also, as a result, the response size for each request is approximately 10 KB. According to the BOTECs, these assumptions give us the following bandwidths:

So, we have the following number of estimated resources:

  • Storage required (approximately): ...