Search⌘ K
AI Features

Building Multimodal Web Tools

Explore building a multimodal web agent by managing Playwright sessions, annotating interactive elements with numeric labels, and fusing DOM data with visual language model analysis. Understand fallback mechanisms for element interaction and discover future directions in vision-based and accessibility tree parsing for robust web agents.

In the previous lesson, we set up browser_runtime.py to maintain a persistent Playwright session. However, a browser alone does not make an agent. The agent needs "eyes" to see the page and "hands" to interact with it. In this lesson, we will explore src/tools/web_tools.py. This file contains the actual Python functions (tools) exposed to the large language model. We will look at three critical functions:

  1. Annotating the screen: Injecting numeric labels onto interactive UI elements so the vision-language model (VLM) can reference them.

  2. Executing actions: Translating a tool call like click_element("5") into physical coordinates or CSS-selector clicks.

  3. Multimodal fusion: Combining raw DOM data with VLM visual analysis into a single observation payload for the LLM.

By the end of this lesson, you should be able to:

  • Explain how numeric labels are drawn over web elements using bounding box data.

  • Understand the fallback mechanism (selector vs. coordinates) used when clicking an element.

  • Describe the structure of a fused multimodal observation.

Pause and reflect: If a web page dynamically changes a button's CSS class every time you reload, why is it safer for an LLM to interact with it using a numeric bounding-box ID rather than generating its own CSS selector?

Annotating the screenshot

To give the LLM a WebVoyager-style view of the page, we cannot just send a raw screenshot. We must label candidate elements (links, buttons, and inputs) with IDs like [0], [1], [2].

Under the hood, a function called _collect_dom_elements uses JavaScript to ...