Building Multimodal Web Tools
Explore how to build a multimodal web agent by integrating visual labeling and DOM interaction tools in Python using Google ADK and Playwright. Learn to annotate web elements with numeric labels, perform safe clicks using CSS selectors or fallback coordinates, and fuse DOM data with vision-language model insights to enhance agent perception and action capabilities on dynamic websites.
In the previous lesson, we set up browser_runtime.py to maintain a persistent Playwright session. However, a browser alone does not make an agent. The agent needs "eyes" to see the page and "hands" to interact with it. In this lesson, we will explore src/tools/web_tools.py. This file contains the actual Python functions (tools) exposed to the Large Language Model. We will look at three critical functionalities:
Annotating the screen: Injecting numeric labels onto interactive UI elements so the Vision-Language Model (VLM) can reference them.
Executing actions: Translating a tool call like
click_element("5")into physical coordinates or CSS selector clicks.Multimodal fusion: Combining raw DOM data with VLM visual analysis into a single observation payload for the LLM.
By the end of this lesson, you should be able to:
Explain how numeric labels are drawn over web elements using bounding box data.
Understand the fallback mechanism (Selector vs. Coordinates) used when clicking an element.
Describe the structure of a fused multimodal observation.
Pause and reflect: If a web page dynamically changes a button's CSS class every time you reload, why is it safer for an LLM to interact with it using a numeric bounding-box ID rather than generating its own CSS selector?
Annotating the screenshot
To give the LLM a WebVoyager-style view of the page, we cannot just send a raw screenshot. We must label candidate elements (links, buttons, inputs) with IDs like [0], [1], [2].
Under the hood, a function called _collect_dom_elements uses JavaScript to ...