Mastering MCP: Building Advanced Agentic Applications/

...

Building the Multimodal Server

Learn how to build a multimodal MCP server that uses a Gemini vision model to analyze an image and provide a detailed text description.

We'll cover the following...

Building the VisualAnalysisServer
Building the MCP Client
Executing the complete VisualAnalysisServer

With our architectural blueprint defined, we can now begin building the first and most crucial sensory component of our “Image Research Assistant.” This lesson is dedicated to constructing the VisualAnalysisServer, a specialized MCP server that will act as our agent’s eyes. By the end of this lesson, we will have a self-contained, reusable module that can take any image and transform it into a rich, textual description, providing the foundational input for our agent’s research workflow.

Building the `VisualAnalysisServer`

We will now build the first of our two specialized workers: the VisualAnalysisServer. Since our initial implementation will be a command-line interface, our agent will need to work with a file path provided by the user. To handle this, we will build two distinct tools: the first load_image_from_path tool will be responsible for reading the image file from the path and preparing it for analysis. The second get_image_description tool will perform the actual visual analysis using the Gemini model. Before we write the core logic for the server and its tool, our first step is to prepare the development environment by installing the necessary Python libraries.

Press + to interact

The google-genai is the Python SDK we will use to interact with Google’s Gemini models for image analysis.

Note: You don’t need to worry about these installations in the course. We have already set up the environment for you. You can focus directly on writing and executing the code. We will begin by implementing the CLI-based logic and then proceed to develop the application’s user interface.

Implementing the `load_image_from_path` tool

Our first tool, load_image_from_path, serves as an important pre-processing step in our workflow. Its dedicated function is to handle the direct interaction with the filesystem. This tool takes a simple file path as input, reads the corresponding image data, and then transforms that data into a standardized format suitable for API transmission, specifically, a base64 encoded string and its correct MIME type.

import os
import base64
import mimetypes
from pathlib import Path
@mcp.tool()
def load_image_from_path(file_path: str) -> dict:
    """
    Loads an image from a server-accessible file path, encodes it to Base64,
    and determines its MIME type.
    Args:
        file_path: The absolute path to the image file, which must be
                   accessible by the server running this tool.
    Returns:
        A dictionary containing the 'base64_image_string' and 'mime_type',
        or an 'error' key if loading fails.
    """
    try:
        image_path = Path(file_path)
        if not image_path.is_file():
            return {"error": f"File not found at path: {file_path}"}
        # Open the file in binary read mode
        with open(image_path, "rb") as f:
            image_data = f.read()
        
        # Encode the binary data to a Base64 string
        base64_string = base64.b64encode(image_data).decode("utf-8")
        
        # Guess the MIME type from the file extension
        mime_type, _ = mimetypes.guess_type(image_path)
        
        if not mime_type:
            mime_type = "application/octet-stream"  # A generic default
            
        return {
            "base64_image_string": base64_string,
            "mime_type": mime_type
        }
    except FileNotFoundError:
        return {"error": f"File not found at path: {file_path}"}
    except Exception as e:
        return {"error": f"An unexpected error occurred while loading the image: {str(e)}"}

The load_image_from_path tool

Getting Started

Foundations of Model Context Protocol

Implementing Single-Server MCP

Implementing Multi-Server MCP

Extending MCP with External Frameworks

Observability in MCP

Building an Image Research Assistant with MCP

Wrap Up

Building the Multimodal Server

Building the `VisualAnalysisServer`

Setting up the server environment

Implementing the `load_image_from_path` tool

Building the Multimodal Server

Building the VisualAnalysisServer

Setting up the server environment

Implementing the load_image_from_path tool

Building the `VisualAnalysisServer`

Implementing the `load_image_from_path` tool