Search⌘ K
AI Features

Deployment of an Image Captioning System

Explore the deployment of a scalable image captioning system using the BLIP-2 model. Understand storage, inference server, and bandwidth requirements. Learn system components including image processing, model hosting, contextual enhancement, and moderation. Discover how these modules work together to generate accurate, personalized captions while ensuring scalability and ethical standards.

In the previous lesson, we covered the training and evaluation of the BLIP-2 model for image caption generation. Now, we focus on designing the system architecture to effectively deploy this model in a real-world setting.

Unlike text-to-image systems, which generate visual outputs from textual prompts, image captioning requires interpreting rich visual content and translating it into coherent textual descriptions, demanding different architectural considerations, such as efficient image processing pipelines and response-time constraints. Understanding how to deploy such a system is crucial not just for ensuring scalability and reliability, but also for handling the specific challenges that arise when visual input needs to be converted into meaningful text at scale.

Considering the BOTECs chapter, let’s begin with the resource estimation for the following key areas:

  • Storage estimation

  • Inference server requirements

  • Network bandwidth estimation

Let’s start with the storage estimation:

Storage estimation

In storage estimation, we consider the size of the model files, users’ profiles, interactions, and the additional indexing storage needed to organize and quickly retrieve captions:

  • Model size: For the model we selected in the previous lesson, with approximately 4 billion parameters, we would need around 8 GB of storage when using FP16 (16-bit floating-point) precision.

  • User profile data: Assuming each user’s data takes approximately 10 KB, we would need 1 TB to store this data:

Note: The model size (8 GB) and user profile data (1 TB) will stay the same unless the user base grows. So, for now, we won’t include them in the ongoing storage calculations.

  • User interaction data: If we store data for every user interaction, the storage required will depend on the size of each interaction. Assume that the user uploads an image for 500×500500 \times 500 pixels in each interaction, having a size of 250 KB. Also, if each user has 10 interactions per day, then for 100 million users, the daily storage needed would be:

  • Indexing storage: We also require additional storage to index user interaction data for quick retrieval. Let’s assume the indexing adds about 25% more to the total storage.

  So, the total storage requirement for user interaction per day would be:

  According to the above estimates, the monthly storage requirement is:

Let’s move on to the estimation of inference servers.

Inference servers estimation

For 100 M daily users, the total number of requests per second (TRPS) is 11,574(100 M x 10) ÷ 86400 seconds = 11,574 TRPS, with each user making 10 requests per day. Models similar to BLIP-2 may require multiple steps to produce the output, assuming 100 iterations. Therefore, using our proposed inference formula, an average query’s inference time for the 4 B BLIP-2 model is approximately 2.56 millisecondsBLIP2InferenceTime. This time is estimated using FP16 precision on an NVIDIA A100 GPU. According to this estimation, the QPSQueries per second (QPS) is a metric used in online systems to measure the number of queries a server receives per second. for an NVIDIA server having an A100 GPU will ...