An Introduction to Mechanistic Interpretability

Day

Time

Session ID

Location

Feb 7, 2025

2:30–4pm

Track 12

CC7

IASEAI Program Overview

Agenda

Abstract:

This talk will be an overview of mechanistic interpretability: the study of how to reverse engineer what a neural network is doing. Neel will discuss why understanding what's happening inside a network is hard, some key bits of progress we've made so far, and where the field is going. He'll discuss the current state of the art on what interpretability currently can and can't do, including key limitations and ways it's overhyped. Finally, he'll discuss some of the practical implications of good interpretability tools for AGI safety, and for understanding and evaluating the safety of current frontier systems.

Speakers:

Neel Nanda