Evaluating WebVoyager and Design Insights
Explore the evaluation process of WebVoyager, a multimodal web agent tested on 643 real-life tasks across popular sites. Understand the human and machine-led assessment protocols and analyze failure cases such as navigation, visual grounding, hallucination, and prompting errors. Learn how these insights guide improvements in agent planning, grounding, and self-critique to design better web agents.
We’ve explored WebVoyager’s architecture, but how do we know if it’s truly effective? A key challenge in agentic systems is creating a fair and realistic test. Many early web agents were tested in simplified simulators or on static, cached websites, which don’t capture the messiness of the live web.
To properly test a generalist agent like WebVoyager, we need to evaluate it on the same websites that real users interact with every day. This is the only way to test an agent’s ability to handle real-world challenges like floating ads, pop-up windows, constant updates, etc.
This led to the creation of a new benchmark, a curated collection of 643 real-world web tasks spread across 15 popular, live websites, including Amazon, Google Flights, ESPN, and GitHub.
Data construction via the self-instruct method
As agent designers, creating ...