Better Benchmarks: A Roadmap for High-Stakes Evaluation in the Age of Agentic AI

Day
Time
Session ID
Location
Feb 7, 2025
11:30am–1pm
Track 07
CC9-CC13
Abstract:

AI models are increasingly deployed in high-stakes environments, demanding rigorous assessments of their capabilities and risks. Benchmarks remain central to evaluating model performance, guiding policy, and informing downstream tasks. Yet many commonly used benchmarks display concerning gaps, such as inadequate reporting of statistical significance and limited replicability. This talk will examine those discrepancies and propose a framework for designing more robust benchmarks. As we enter the age of agentic AI, the discussion broadens to encompass unique and foundational risks inherent to multiagent systems—such as miscoordination, conflict, and collusion—that remain inadequately addressed by existing benchmarks. This talk will present a taxonomy of AI risks that emerge, are much more challenging, or are qualitatively different in the multiagent setting, and explore how we might approach the measurement, benchmarking, and evaluation of such systems. This presentation aims to equip researchers, developers, and policymakers with insights for building more rigorous, transparent, and trustworthy AI benchmarks.

Speakers: