Evaluating WebVoyager and Design Insights

Understand the evaluation process of WebVoyager using a real-world benchmark with 643 live web tasks. Learn how human and LLM assessments measure success, and discover key failure areas that guide designing more robust and effective web agents.

We'll cover the following...

Data construction via the self-instruct method
Human evaluation
LLM evaluation: The scalable proxy
Performance and key findings
- A strong indicator of success
- Failure analysis: Where we focus our design efforts
Key takeaways for web agent designers

We’ve explored WebVoyager’s architecture, but how do we know if it’s truly effective? A key challenge in agentic systems is creating a fair and realistic test. Many early web agents were tested in simplified simulators or on static, cached websites, which don’t capture the messiness of the live web.

To properly test a generalist agent like WebVoyager, we need to evaluate it on the same websites that real users interact with every day. This is the only way to test an agent’s ability to handle real-world challenges like floating ads, pop-up windows, constant updates, etc.

This led to the creation of a new benchmark, a curated collection of 643 real-world web tasks spread across 15 popular, live websites, including Amazon, Google Flights, ESPN, and GitHub.

Data construction via the self-instruct method

As agent designers, creating ...

1.Agent Design Fundamentals

2.Multi-Agent Conversational Recommender System (MACRS)

Breakout Session

3.Nvidia Eureka Learning Agent

4.Implementing a Eureka-Like Reward Learning Agent with Google ADK

Breakout Session

5.Applying Agentic Design Principles

6.Designing an AI Agent for Generating LLM Pipelines

7. Designing a Web Agent

8.Implementing a Multimodal Web Agent with Google ADK

9.Designing a Multimodal-LLM Agent for Multi-Object Diffusion

10.Thought Exercise: AI Hospital

11.OpenClaw Design

12.Wrapping up

Mock Interview

13.Appendix: Free Reference Guides and Cheatsheets

Evaluating WebVoyager and Design Insights

Data construction via the self-instruct method