Building Multimodal Web Tools

Explore building a multimodal web agent by managing Playwright sessions, annotating interactive elements with numeric labels, and fusing DOM data with visual language model analysis. Understand fallback mechanisms for element interaction and discover future directions in vision-based and accessibility tree parsing for robust web agents.

We'll cover the following...

Annotating the screenshot
Grounding actions with element IDs
Fusing multimodal observations
Limitations and improvements
- Conclusion

In the previous lesson, we set up browser_runtime.py to maintain a persistent Playwright session. However, a browser alone does not make an agent. The agent needs "eyes" to see the page and "hands" to interact with it. In this lesson, we will explore src/tools/web_tools.py. This file contains the actual Python functions (tools) exposed to the large language model. We will look at three critical functions:

Annotating the screen: Injecting numeric labels onto interactive UI elements so the vision-language model (VLM) can reference them.
Executing actions: Translating a tool call like click_element("5") into physical coordinates or CSS-selector clicks.
Multimodal fusion: Combining raw DOM data with VLM visual analysis into a single observation payload for the LLM.

By the end of this lesson, you should be able to:

Explain how numeric labels are drawn over web elements using bounding box data.
Understand the fallback mechanism (selector vs. coordinates) used when clicking an element.
Describe the structure of a fused multimodal observation.

Pause and reflect: If a web page dynamically changes a button's CSS class every time you reload, why is it safer for an LLM to interact with it using a numeric bounding-box ID rather than generating its own CSS selector?

Annotating the screenshot

To give the LLM a WebVoyager-style view of the page, we cannot just send a raw screenshot. We must label candidate elements (links, buttons, and inputs) with IDs like [0], [1], [2].

Under the hood, a function called _collect_dom_elements uses JavaScript to ...

1.Agent Design Fundamentals

2.Multi-Agent Conversational Recommender System (MACRS)

Breakout Session

3.Nvidia Eureka Learning Agent

4.Implementing a Eureka-Like Reward Learning Agent with Google ADK

Breakout Session

5.Applying Agentic Design Principles

6.Designing an AI Agent for Generating LLM Pipelines

7. Designing a Web Agent

8.Implementing a Multimodal Web Agent with Google ADK

9.Designing a Multimodal-LLM Agent for Multi-Object Diffusion

10.Thought Exercise: AI Hospital

11.OpenClaw Design

12.Wrapping up

Mock Interview

13.Appendix: Free Reference Guides and Cheatsheets

Building Multimodal Web Tools

Annotating the screenshot