Search⌘ K
AI Features

From Design to a Running Multimodal Web Agent

Explore how to transition from designing a multimodal web agent to implementing it with Google ADK. This lesson shows you how the agent navigates real web tasks, manages browser state with Playwright, and uses visual and textual tools to observe, act, and extract information. You'll learn to read project structure, inspect logs, and verify agent actions for reliability and transparency.

In the previous chapter, we designed the architecture of a multimodal web agent: how it observes pages, chooses actions, and stays grounded while interacting with the web. But architecture diagrams alone are not enough. To understand whether a design actually works, we need to study the implementation, run the agent on real websites, and inspect what happens step by step.

This lesson is the starting point for that transition from design to code. Before diving into individual functions and components, we will first build a practical mental map of the project. You will watch the agent perform a real task, explore the main folders and files in the repository, and learn which outputs to inspect after a run for debugging and verification.

By the end of this lesson, you should be able to:

  • Describe, at a high level, what happens during a demo run: observations, tool use, and final response.

  • Read the project tree and connect each major path to its specific responsibility.

  • List the files you would open after a run to replay what happened.

Pause and reflect: Before you continue, write one sentence: what would you want to see in a log file to convince yourself the agent "really looked" at the page before clicking?

Demo: Watch the agent complete a real task

To build a mental model, we will begin with one fixed task prompt so we can consistently compare the agent's intent and execution.

Task prompt used for the demo: Go to LinkedIn, find the company "Google," and tell me the most recent post they made. Extract the complete post content and the URL.

...