Mitigating Catastrophic Misalignment Risk With AI Control

Day
Time
Session ID
Location
Feb 6, 2025
4:30–6pm
Track 05
CC2
Abstract:

One particularly dangerous failure mode for advanced AI systems is strategicsubversion, where the AI system actively works to bypass safety measures.Recent work has provided evidence that models might become egregiouslymisaligned and thus engage in such strategic subversion. This type of threat is analogous to insider threats in traditional security contexts. We recently introduced AI control, a framework for developing and evaluating safety measures that are robust against intentional circumvention. Drawing from established risk-management strategies, we present novel techniques for implementing and testing these protective measures. This presentation explores the threat model underlying strategic subversion, examines the principles of AI control, and evaluates its potential as a defensive strategy against advancedAI systems that might actively work to bypass safety constraints.

Speakers: