Evaluating WebVoyager and Design Insights
Explore how WebVoyager's effectiveness is evaluated using a new benchmark of real-world web tasks. Understand the design of human and LLM-based evaluation protocols, analyze WebVoyager's performance, and identify key challenges like navigation and visual grounding that guide future web agent design improvements.
We’ve explored WebVoyager’s architecture, but how do we know if it’s truly effective? A key challenge in agentic systems is creating a fair and realistic test. Many early web agents were tested in simplified simulators or on static, cached websites, which don’t capture the messiness of the live web.
To properly test a generalist agent like WebVoyager, we need to evaluate it on the same websites that real users interact with every day. This is the only way to test an agent’s ability to handle real-world challenges like floating ads, pop-up windows, constant updates, etc.
This led to the creation of a new benchmark, a curated collection of 643 real-world web tasks spread across 15 popular, live websites, including Amazon, Google Flights, ESPN, and GitHub.
Data construction via the self-instruct method
As agent designers, creating ...