...

The Multimodal Web Agent Challenge

Explore the fundamental challenges of web automation, including the limitations of text-only agents and the critical “grounding problem” that arises with multimodal models.

We'll cover the following...

Problem space: Multimodal web agents

So far in this course, we’ve seen agents designed for specific, relatively contained environments. Now, we’re going to explore a new and incredibly complex environment for agents to operate in: the live web.

What is a web agent?

A web agent is an autonomous system designed to navigate and interact with real-world websites to complete tasks on behalf of a user. Unlike a simple web scraper that just extracts data, a true web agent can perform actions like a human: clicking buttons, typing in forms, and making decisions based on what it sees.

Our goal in this chapter is to design an agent that, given a high-level instruction like “Find the cheapest flight from New York to London next Tuesday,” can autonomously browse a website and return a final answer.

The foundational tool: HTML

To perform any action, an agent must first understand the structure of a webpage. The primary source for this understanding is the website’s HTML (Hypertext Markup Language).

HTML is the skeleton of a webpage. For an agent, it’s an incredibly valuable source of signals because it explicitly defines the interactive elements on a page. By parsing the HTML, an agent can perform a number of functions outlined below.

Identify elements: It can see every button, input field, link, and image on the page.
Understand element types: It knows the difference between a clickable <button> and a fillable <input type="text">.
Extract textual content: It can read the text labels on buttons and links, which is crucial for understanding their function.

For early web agents, HTML was the only way they could perceive their environment. However, as ...

Agent Design Fundamentals

Multi-Agent Conversational Recommender System (MACRS)

Nvidia Eureka Learning Agent

Applying Agentic Design Principles

Designing an AI Agent for Generating LLM Pipelines

Designing a Web Agent

Designing a Multimodal-LLM Agent for Multi-Object Diffusion

Thought Exercise: AI Hospital

Wrapping up

The Multimodal Web Agent Challenge

What is a web agent?

The foundational tool: HTML