Text vs. Multimodal Web Agents in a Real-World Task

Understand the workflows of text-only and multimodal web agents to see why multimodality is crucial for real-world tasks.

We'll cover the following...

We will compare two agent workflows, one text-only and one multimodal, as they attempt the same real-world task: booking a flight. This detailed walkthrough will highlight the common failure points of text-only agents and show how a multimodal approach, like WebVoyager’s, is designed to overcome them.

A live web task: Book a flight

To truly understand the difference between a text-only web agent and a multimodal one, we need to see them in action. We’ll give both agents the same, seemingly simple task.

User instruction: “Find the cheapest one-way flight from Los Angeles (LAX) to New York (JFK) for next Friday.”

For this challenge, we will assume both agents start on the Google Flights homepage. Our goal is to analyze not whether they succeed, but how they perceive, reason, and act at each step. This will give us, as agent designers, a clear mental model of their strengths and weaknesses.

How a text-only web agent performs the task

The tradition text-only web agent agent relies solely on the website’s HTML code to understand and navigate the page. It cannot “see” the website; it can only read its underlying structure.

We’ll walk through the process of how a text‑only web agent works, step-by-step.

Step 1: Analyzing the homepage

Perception (The agent’s view): The agent’s browsing environment provides it with the complete HTML of the Google Flights homepage. This is a massive wall of text, potentially thousands of lines long, filled with <div>, <span>, <script>, and <style> tags. The agent’s LLM must parse all of this to find the relevant interactive elements.
Reasoning (The agent’s thought process): The LLM sifts through the HTML, looking for keywords like “from,” “to,” and “date” to identify the main form fields. It might generate a thought similar to the one shown below.

Thought: “I have analyzed the HTML. I’ve identified an <input> element with an aria-label of ‘Where from’ which seems to be the correct field for the departure city. I need to type ‘LAX’ into this element.”

The challenge (Ambiguity in the code): Here, the agent immediately faces a problem. Modern websites are incredibly complex. There might be multiple elements that look like an “Origin” field in the code (e.g., one for the main form, another in a hidden menu, a third in an example snippet). The HTML alone doesn’t provide enough context to know which one is the correct, visible input field.
Action and potential failure: The agent chooses one of the ...

Agent Design Fundamentals

Multi-Agent Conversational Recommender System (MACRS)

Nvidia Eureka Learning Agent

Applying Agentic Design Principles

Designing an AI Agent for Generating LLM Pipelines

Designing a Web Agent

Designing a Multimodal-LLM Agent for Multi-Object Diffusion

Thought Exercise: AI Hospital

Wrapping up

Text vs. Multimodal Web Agents in a Real-World Task

A live web task: Book a flight

How a text-only web agent performs the task

Step 1: Analyzing the homepage