We want to hear from you! Take our quick AI survey and share your insights on the current state of AI, how you're implementing it, and what you expect to see in the future. learn more
AI agents are becoming a new research direction with potential applications in the real world. These agents use foundation models such as large language models (LLMs) and vision language models (VLMs) to take natural language instructions and achieve complex goals autonomously or semi-autonomously. AI agents can use a variety of tools such as browsers, search engines and code compilers to provide causal verification of their actions and their intentions.
However, a recent analysis by researchers at Princeton University revealed several flaws in current agent standards and evaluation methods that hinder their utility in real-world applications.
Their results highlight that agent benchmarking comes with different challenges, and we cannot evaluate agents in the same way we benchmark foundation models.
Cost versus accuracy trade-off
A key issue that the researchers highlight in their study is the lack of cost control in agent evaluation. AI agents can be much more expensive to run than a single model call, as they often rely on stochastic language models that can produce different results when asked the same question multiple times.
Countdown to VB Transform 2024
Join enterprise leaders in San Francisco July 9-11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of generative AI, and learn how to integrate AI applications into your industry. Register now
To increase accuracy, some agent systems generate multiple responses and use mechanisms such as voting or external validation tools to select the best response. Sometimes sampling hundreds or thousands of responses increases the agent's accuracy. Although this approach can improve performance, it comes at a significant computational cost. Estimation cost is not always an issue in research settings, where the goal is to maximize accuracy.
However, in practical applications, there is a limit to the budget available for each query, making it necessary to control the cost of agent evaluation. Failure to do so may encourage researchers to develop very expensive agents just to top the leaderboard. The Princeton researchers propose the use of evaluation results as a Pareto curve of accuracy and estimated cost, and jointly use agent optimization techniques for these two metrics.
The researchers evaluated the cost-accuracy trade-off of different prompting techniques and agent models introduced in different papers.
“For substantially similar accuracy, the value can vary by nearly two orders of magnitude,” the researchers write. “Nevertheless, the cost of running these agents is not a top-line metric listed in any of these papers.”
The researchers argue that by optimizing both metrics, “agents that cost less while maintaining accuracy.” Co-optimization can enable researchers and developers to eliminate the fixed and variable costs of running agents. For example, they can spend more on optimizing the agent design but reduce variable costs by using fewer contextual learning examples to prompt the agent.
The researchers tested the combined optimization on HotpotQA, a popular question-answering benchmark. Their results show that the joint optimization formulation provides a way to strike an optimal balance between accuracy and estimated cost.
“Beneficial agent evaluation must overcome cost—even if we ultimately do not care about cost and only about identifying innovative agent designs,” the researchers write. “Accuracy alone cannot indicate progress because it can be improved by scientifically meaningless methods such as retrying.”
Model development versus downstream applications
Another issue highlighted by researchers is the distinction between evaluating models for research purposes and developing downstream applications. In research, accuracy is often the primary focus, with estimated costs largely ignored. However, when developing real-world applications on AI agents, the estimation cost plays an important role in deciding which model and technique to use.
Estimated costs for AI agents are difficult to estimate. For example, different model providers may charge different amounts for the same model. Meanwhile, API call costs are changing regularly and can vary based on developers' decisions. For example, on some platforms, bulk API calls are charged differently.
The researchers created a website that adjusts model comparisons based on token prices to solve this problem.
They also conducted a case study on Novel QA, a standard for question-and-answer tasks on very long texts. They found that criteria designed for model evaluation can be misleading when used for downstream evaluation. For example, real novel QA studies show retrieval-augmented generation (RAG) to be much worse than long-context models in real-world scenarios. Their results showed that the RAG and long-context models were almost equally accurate, while the long-context models were 20 times more expensive.
Overfitting is a problem.
In learning new tasks, machine learning (ML) models often find shortcuts that allow them to score well on benchmarks. A prominent type of shortcut is “overfitting,” where the model finds ways to cheat benchmark tests and provide results that don't translate to the real world. The researchers found that overfitting is a serious problem for agent benchmarks, because they are small, typically consisting of only a few hundred samples. This problem is more serious than data contamination in training foundation models, since knowledge of test patterns can be programmed directly into the agent.
To solve this problem, researchers suggest that benchmark developers should create and maintain holdout test sets that contain examples that cannot be memorized during training and only through a correct understanding of the target task. can only be solved. In their analysis of 17 benchmarks, the researchers found that many lack adequate holdout data sets, allowing agents to take shortcuts, even unintentionally.
“Surprisingly, we find that many agent benchmarks do not include holdout test sets,” the researchers write. “In addition to creating a test set, benchmark developers should consider keeping it confidential to prevent LLM contamination or agent overfitting.”
They also require different types of holdout patterns based on the desired level of generality of the work performed by the agent.
“Benchmark developers should do their best to ensure that shortcuts are impossible,” the researchers write. “We see this as the responsibility of benchmark developers rather than agent developers, because designing benchmarks that don't allow shortcuts is much easier than testing each agent to see if it takes a shortcut.”
The researchers tested WebArena, a benchmark that evaluates the performance of AI agents in solving problems on different websites. They found several shortcuts in the training datasets that allowed agents to overfit tasks in ways that would easily break with minor changes in the real world. For example, the agent may make assumptions about the structure of a web address without considering that it may change in the future or that it may not work on different websites.
Researchers caution that these errors inflate accuracy estimates and lead to overoptimism about the agent's abilities.
With AI agents being a relatively new field, the research and developer communities still have a lot to learn about testing the limits of these new systems that may soon become an important part of everyday applications.
“AI agent benchmarking is new and best practices have not yet been established, making it difficult to separate real progress from hype,” the researchers write. “Our thesis is that agents are sufficiently different from these models that benchmarking methods need to be reconsidered.”