Mitigating Catastrophic Misalignment Risk With AI Control

Day

Time

Session ID

Location

Feb 6, 2025

4:30–6pm

Track 05

CC2

IASEAI Program Overview

Agenda

Abstract:

One particularly dangerous failure mode for advanced AI systems is strategicsubversion, where the AI system actively works to bypass safety measures.Recent work has provided evidence that models might become egregiouslymisaligned and thus engage in such strategic subversion. This type of threat is analogous to insider threats in traditional security contexts. We recently introduced AI control, a framework for developing and evaluating safety measures that are robust against intentional circumvention. Drawing from established risk-management strategies, we present novel techniques for implementing and testing these protective measures. This presentation explores the threat model underlying strategic subversion, examines the principles of AI control, and evaluates its potential as a defensive strategy against advancedAI systems that might actively work to bypass safety constraints.

Speakers:

Buck Shlegeris