...

/

The Problem Statement and Project Blueprint

The Problem Statement and Project Blueprint

Learn how to design a multimodal AI assistant by creating an architectural blueprint that orchestrates vision and research servers using MCP.

Our agent has successfully mastered text-based tools, skillfully interacting with web APIs and private knowledge bases to perform complex tasks. However, a vast amount of information in our world is visual, locked away in images and diagrams. In this module, we will cross that frontier by building our first multimodal application: an intelligent “Image Research Assistant” that demonstrates how MCP can orchestrate completely different types of intelligence, i.e., vision and text retrieval, to solve a single, complex problem.

The problem statement

Imagine a researcher is browsing a digital archive and finds an intriguing photograph of a grand, historic building. The image is captivating, but it lacks any context. There’s no caption, no metadata, nothing to identify the structure or its location. The researcher is left with several fundamental questions: What is this building? Where is it located? And what is its historical significance?

Press + to interact

To answer these questions, the researcher would have to embark on a disjointed, manual workflow. First, they might use a reverse-image search tool to hopefully identify the landmark. Then, armed with a name, they would switch to a new browser tab to search Wikipedia or a search engine for articles ...