Text vs. Multimodal Web Agents in a Real-World Task
Discover the differences between text-based and multimodal web agents while seeing how each tackles a real-world flight booking task. Understand why multimodal perception, combining visual screenshots with code analysis, enables agents to avoid common pitfalls that text-only agents face. Learn how these insights guide robust web agent design.
We will compare two agent workflows, one text-only and one multimodal, as they attempt the same real-world task: booking a flight. This detailed walkthrough will highlight the common failure points of text-only agents and show how a multimodal approach, like WebVoyager’s, is designed to overcome them.
A live web task: Book a flight
To truly understand the difference between a text-only web agent and a multimodal one, we need to see them in action. We’ll give both agents the same, seemingly simple task.
User instruction: “Find the cheapest one-way flight from Los Angeles (LAX) to New York (JFK) for next Friday.”
For this challenge, we will assume both agents start on the Google Flights homepage. Our goal is to analyze not whether they succeed, but how they perceive, reason, and act at each step. This will give us, as agent designers, a clear mental model of their strengths and weaknesses.
How a text-only web agent performs the task
The tradition text-only web agent agent relies solely on the website’s HTML code to understand and navigate the page. It cannot “see” the website; it can only read its underlying structure.
We’ll walk through the process of how a text‑only web agent works, step-by-step.
Step 1: Analyzing the homepage
Perception (The agent’s view): The agent’s browsing environment provides it with the complete HTML of the Google Flights homepage. This is a massive wall of text, potentially thousands of lines long, filled with
<div>,<span>,<script>, and<style>tags. The agent’s LLM must parse all of this to find the relevant interactive elements.Reasoning (The agent’s thought process): The LLM sifts through the HTML, looking for keywords like “from,” “to,” and “date” to identify the main form fields. It might generate a thought similar to the one shown below. ...