Deployment of an Image Captioning System

Explore the deployment of a scalable image captioning system using the BLIP-2 model. Understand storage, inference server, and bandwidth requirements. Learn system components including image processing, model hosting, contextual enhancement, and moderation. Discover how these modules work together to generate accurate, personalized captions while ensuring scalability and ethical standards.

We'll cover the following...

Storage estimation
Inference servers estimation
Bandwidth estimation
High-level design for an image captioning system
- Achieving functional requirements
A detailed System Design
Conclusion

In the previous lesson, we covered the training and evaluation of the BLIP-2 model for image caption generation. Now, we focus on designing the system architecture to effectively deploy this model in a real-world setting.

Unlike text-to-image systems, which generate visual outputs from textual prompts, image captioning requires interpreting rich visual content and translating it into coherent textual descriptions, demanding different architectural considerations, such as efficient image processing pipelines and response-time constraints. Understanding how to deploy such a system is crucial not just for ensuring scalability and reliability, but also for handling the specific challenges that arise when visual input needs to be converted into meaningful text at scale.

Considering the BOTECs chapter, let’s begin with the resource estimation for the following key areas:

Storage estimation
Inference server requirements
Network bandwidth estimation

Let’s start with the storage estimation:

Storage estimation

In storage estimation, we consider the size of the model files, users’ profiles, interactions, and the additional indexing storage needed to organize and quickly retrieve captions:

Model size: For the model we selected in the previous lesson, with approximately 4 billion parameters, we would need around 8 GB of storage when using FP16 (16-bit floating-point) precision.

Note: The model size (8 GB) and user profile data (1 TB) will stay the same unless the user base grows. So, for now, we won’t include them in the ongoing storage calculations.

User interaction data: If we store data for every user interaction, the storage required will depend on the size of each interaction. Assume that the user uploads an image for $500 \times 500$ pixels in each interaction, having a size of 250 KB. Also, if each user has 10 interactions per day, then for 100 million users, the daily storage needed would be:

Let’s move on to the estimation of inference servers.

Inference servers estimation

For 100 M daily users, the total number of requests per second (TRPS) is 11,574(100 M x 10) ÷ 86400 seconds = 11,574 TRPS, with each user making 10 requests per day. Models similar to BLIP-2 may require multiple steps to produce the output, assuming 100 iterations. Therefore, using our proposed inference formula, an average query’s inference time for the 4 B BLIP-2 model is approximately 2.56 millisecondsBLIP2InferenceTime. This time is estimated using FP16 precision on an NVIDIA A100 GPU. According to this estimation, the QPSQueries per second (QPS) is a metric used in online systems to measure the number of queries a server receives per second. for an ...

1.Introduction to GenAI System Design

2.Fundamental Concepts in GenAI

Breakout Session

3.Back-of-the-envelope Calculations

4.Systematic Framework for Designing GenAI Systems

5.System Design of a Text-to-Text Generation System

Mock Interview

6.System Design of a Text-to-Image Generation System

Mock Interview

7.System Design of a Text-to-Speech Generation System

Mock Interview

8.System Design of a Text-to-Video Generation System

Mock Interview

9.System Design of an Image Captioning System

10.Conclusion

11.Free GenAI System Design Lessons

Deployment of an Image Captioning System

Storage estimation

Inference servers estimation