...

/

Text vs. Multimodal Web Agents in a Real-World Task

Text vs. Multimodal Web Agents in a Real-World Task

Understand the workflows of text-only and multimodal web agents to see why multimodality is crucial for real-world tasks.

We'll cover the following...

We will compare two agent workflows, one text-only and one multimodal, as they attempt the same real-world task: booking a flight. This detailed walkthrough will highlight the common failure points of text-only agents and show how a multimodal approach, like WebVoyager’s, is designed to overcome them.

A live web task: Book a flight

To truly understand the difference between a text-only web agent and a multimodal one, we need to see them in action. We’ll give both agents the same, seemingly simple task.

User instruction: “Find the cheapest one-way flight from Los Angeles (LAX) to New York (JFK) for next Friday.”

The starting point for the flight booking task. For the agents, this page represents a visually simple but structurally complex challenge.
The starting point for the flight booking task. For the agents, this page represents a visually simple but structurally complex challenge.

For this challenge, we will assume both agents start on the Google Flights homepage. Our goal is to analyze not whether they succeed, but how they perceive, reason, and act at each step. This will give us, as agent designers, a clear mental model of their strengths and weaknesses.

How a text-only web agent performs the task

The tradition text-only web agent agent relies solely on the website’s HTML code to understand and navigate the page. It cannot “see” the website; it can only read its underlying structure.

The text-only agent’s perception: A verbose and noisy wall of HTML code. Without visual context, grounding an action to the correct, visible element is a significant challenge
The text-only agent’s perception: A verbose and noisy wall of HTML code. Without visual context, grounding an action to the correct, visible element is a significant challenge

We’ll walk through the process of how a text‑only web agent works, step-by-step.

Step 1: Analyzing the homepage

  • Perception (The agent’s view): The agent’s browsing environment provides it with the complete HTML of the Google Flights homepage. This is a massive wall of text, potentially thousands of lines long, filled with <div>, <span>, <script>, and <style> tags. The agent’s LLM must parse all of this to find the relevant interactive elements.

  • Reasoning (The agent’s thought process): The LLM sifts through the HTML, looking for keywords like “from,” “to,” and “date” to identify the main form fields. It might generate a thought similar to the one shown below.

Thought: “I have analyzed the HTML. I’ve identified an <input> element with an aria-label of ‘Where from’ which seems to be the correct field for the departure city. I need to type ‘LAX’ into this element.”

  • The challenge (Ambiguity in the code): Here, the agent immediately faces a problem. Modern websites are incredibly complex. There might be multiple elements that look like an “Origin” field in the code (e.g., one for the main form, another in a hidden menu, a third in an example snippet). The HTML alone doesn’t provide enough context to know which one is the correct, visible input field.

  • Action and potential failure: The agent chooses one of the ...