IN THIS LESSON

This is the eleventh lecture in the Language Models and Intelligent Agentic Systems course, run by Meridian Cambridge in collaboration with the Cambridge Centre for Data Driven Discovery (C2D3).

This lecture covers deceptive alignment, a phenomenon where an agentic system acts aligned in situations where it believes it is observed and under threat of modification, but behaves differently in situations where it believes there is no threat of modification. We start with a conceptual discussion of deceptive alignment and why it might arise naturally from gradient-based training. We then look at recent empirical works on sleeper agents, sandbagging, and alignment faking. Though there is not yet existing evidence of deceptive alignment 'in the wild', we suggest that this is a very real risk in the years to come.