An Introduction to Mechanistic Interpretability

Day
Time
Session ID
Location
Feb 7, 2025
2:30–4pm
Track 12
CC7
Abstract:

This talk will be an overview of mechanistic interpretability: the study of how to reverse engineer what a neural network is doing. Neel will discuss why understanding what's happening inside a network is hard, some key bits of progress we've made so far, and where the field is going. He'll discuss the current state of the art on what interpretability currently can and can't do, including key limitations and ways it's overhyped. Finally, he'll discuss some of the practical implications of good interpretability tools for AGI safety, and for understanding and evaluating the safety of current frontier systems.

Speakers: