...

/

Deployment of an Image Captioning System

Deployment of an Image Captioning System

Understand the System Design for deploying an image-captioning system utilizing BLIP-2.

We'll cover the following...

In the previous lesson, we covered the training and evaluation of the BLIP-2 model for image caption generation. Now, we focus on designing the system architecture for deploying this model effectively in a real-world setting.

Unlike text-to-image systems, which generate visual outputs from textual prompts, image captioning requires interpreting rich visual content and translating it into coherent textual descriptions, demanding different architectural considerations, such as efficient image processing pipelines and response-time constraints. Understanding how to deploy such a system is crucial not just for ensuring scalability and reliability, but also for handling the specific challenges that arise when visual input needs to be converted into meaningful text at scale.

Considering the BOTECs chapter, let’s begin with the resource estimation for the following key areas:

  • Storage estimation

  • Inference server requirements

  • Network bandwidth estimation

Let’s start with the storage estimation:

Storage estimation

In storage estimation, we consider the size of the model files, users’ profiles, interactions, and the additional indexing storage needed to organize and quickly retrieve captions:

  • Model size: For the model we selected in the previous lesson, with approximately 4 billion parameters, we would need around 8 GB of storage when using FP16 (16-bit floating-point) precision.

  • User profile data: Assuming each user’s data takes approximately 10 KB, we would need 1 TB to store this data:

Note: The model size (8 GB) and user profile data (1 TB) will stay the same unless the user base grows. So, for now, we won’t include them in the ongoing storage calculations.

  • User interaction data: If we store data for every user interaction, the storage needed will depend on how large each interaction is. Assume that the user uploads an image for 500×500500 \times 500 pixels in each interaction having a size of 250 KB. Also, if each user has 10 interactions per day, then for 100 million users, the daily storage needed would be:

  • Indexing storage: We also need extra storage to index the user interaction data for quick retrieval. Let’s assume the indexing adds about 25% more to the total storage.

  So, the total storage requirement for user interaction per day would be:

  According to the above estimates, the monthly storage requirement is:

Let’s move on to the estimation of inference servers.

Inference servers estimation

For daily 100 M users, the total number of requests per second (TRPS) is 11,574(100 M x 10) ÷ 86400 seconds = 11,574 TRPS for each user making 10 requests daily. The models similar to BLIP-2 may take multiple steps to produce the output, assuming 100 iterations. Therefore, using our proposed inference formula, an average query’s inference time for the 4 B BLIP-2 model is approximately 2.56 millisecondsBLIP2InferenceTime. This time is estimated using FP16 precision on an NVIDIA A100 GPU. According to this estimation, the QPSQueries per second (QPS) is a metric used in online systems to measure the number of queries a server receives per second. ...