Search⌘ K
AI Features

The Multimodal Web Agent Challenge

Explore the design of multimodal web agents that navigate and interact with real-world websites using both visual screenshots and HTML analysis. Understand the grounding problem where agents must map high-level plans to specific webpage elements, and study the architecture of WebVoyager, a next-generation web agent designed to autonomously complete diverse tasks on live websites by mimicking human-like interactions.

Problem space: Multimodal web agents

So far in this course, we’ve seen agents designed for specific, relatively contained environments. Now, we’re going to explore a new and incredibly complex environment for agents to operate in: the live web.

What is a web agent?

A web agent is an autonomous system designed to navigate and interact with real-world websites to complete tasks on behalf of a user. Unlike a simple web scraper that just extracts data, a true web agent can perform actions like a human: clicking buttons, typing in forms, and making decisions based on what it sees.

A web agent must be able to perform actions like a human, such as clicking buttons, to complete its tasks
A web agent must be able to perform actions like a human, such as clicking buttons, to complete its tasks

Our goal in this chapter is to design an agent that, given a high-level instruction like “Find the cheapest flight from New York to London next Tuesday,” can autonomously browse a website and return a final answer.

The foundational tool: HTML

To perform any action, an agent must first understand the structure of a webpage. The primary source for this understanding is the website’s HTML (Hypertext Markup Language).

HTML is the skeleton of a webpage. For an agent, it’s an incredibly valuable source of signals because it explicitly defines the interactive elements on a page. By parsing the HTML, an agent can perform a number of functions outlined below.

  • Identify elements: It can see every button, input field, link, and image on the page.

  • Understand element types: It knows the difference between a clickable <button> and a fillable ...