Building the Multimodal Server

Learn how to build a multimodal MCP server that uses a Gemini vision model to analyze an image and provide a detailed text description.

With our architectural blueprint defined, we can now begin building the first and most crucial sensory component of our “Image Research Assistant.” This lesson is dedicated to constructing the VisualAnalysisServer, a specialized MCP server that will act as our agent’s eyes. By the end of this lesson, we will have a self-contained, reusable module that can take any image and transform it into a rich, textual description, providing the foundational input for our agent’s research workflow.

Building the VisualAnalysisServer

We will now build the first of our two specialized workers: the VisualAnalysisServer. Since our initial implementation will be a command-line interface, our agent will need to work with a file path provided by the user. To handle this, we will build two distinct tools: the first load_image_from_path tool will be responsible for reading the image file from the path and preparing it for analysis. The second get_image_description tool will perform the actual visual analysis using the Gemini model. Before we write the core logic for the server and its tool, our first step is to prepare the development environment by installing the necessary Python libraries.

Press + to interact

Setting up the server environment

Our vision server relies on the following library to interact with the Gemini vision model. We can install it using pip.

pip install google-genai
Installing the required libraries

The google-genai is the Python SDK we will use to interact with Google’s Gemini models for image analysis.

Note: You don’t need to worry about these installations in the course. We have already set up the environment for you. You can focus directly on writing and executing the code. We will begin by implementing the CLI-based logic and then proceed to develop the application’s user interface.

Implementing the load_image_from_path tool

Our first tool, load_image_from_path, serves as an important pre-processing step in our workflow. Its dedicated function is to handle the direct interaction with the filesystem. This tool takes a simple file path as input, reads the corresponding image data, and then transforms that data into a standardized format suitable for API transmission, specifically, a base64 encoded string and its correct MIME type.

import os
import base64
import mimetypes
from pathlib import Path
@mcp.tool()
def load_image_from_path(file_path: str) -> dict:
"""
Loads an image from a server-accessible file path, encodes it to Base64,
and determines its MIME type.
Args:
file_path: The absolute path to the image file, which must be
accessible by the server running this tool.
Returns:
A dictionary containing the 'base64_image_string' and 'mime_type',
or an 'error' key if loading fails.
"""
try:
image_path = Path(file_path)
if not image_path.is_file():
return {"error": f"File not found at path: {file_path}"}
# Open the file in binary read mode
with open(image_path, "rb") as f:
image_data = f.read()
# Encode the binary data to a Base64 string
base64_string = base64.b64encode(image_data).decode("utf-8")
# Guess the MIME type from the file extension
mime_type, _ = mimetypes.guess_type(image_path)
if not mime_type:
mime_type = "application/octet-stream" # A generic default
return {
"base64_image_string": base64_string,
"mime_type": mime_type
}
except FileNotFoundError:
return {"error": f"File not found at path: {file_path}"}
except Exception as e:
return {"error": f"An unexpected error occurred while loading the image: {str(e)}"}
The load_image_from_path tool

Explanation

Let’s understand the above code:

  • ...