IN THIS LESSON

This is the ninth lecture in the Language Models and Intelligent Agentic Systems course, run by Meridian Cambridge in collaboration with the Cambridge Centre for Data Driven Discovery (C2D3).

This lecture covers various failure modes that can occurs when powerful systems are trained via reinforcement learning. In reward hacking, a system finds a way to optimise the reward funciton given to it that is contrary to the intentions of the system designer. In goal misgeneralisation, the system develops an internal goal different from the goal intended by the designer, which leads the model to display intended behaviour during training but can lead to undesired behaviour in deployment.

View Lecture Slides

9. Reward Hacking and Goal Misgeneralisation