OUR BLOG

Updates from
our blog

19 Mar 2025

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

The "emergent misalignment" paper shows that GPT-4o can show general misbehaviour when its fine tuned into producing code with security holes. It then produces dangerous content in all sorts of different areas. Our replication suggests that this might be due to GPT-4o turning bad, but losing it's 'inhibitions': it reverts to more standard LLM behaviour, ignoring the various control mechanisms that have transformed it from the sequence predictor that it once was. In particular, it's a lot more human-like on topics like religion and drunkeness. Understanding the complexity of misalignment, what it is and what it isn't, is necessary to combat it.

The "emergent misalignment" paper shows that GPT-4o can show general misbehaviour when its fine tuned into producing code with security holes. It then produces dangerous content in all sorts of different areas. Our replication suggests that this might be due to GPT-4o turning bad, but losing it's 'inhibitions': it reverts to more standard LLM behaviour, ignoring the various control mechanisms that have transformed it from the sequence predictor that it once was. In particular, it's a lot more human-like on topics like religion and drunkeness. Understanding the complexity of misalignment, what it is and what it isn't, is necessary to combat it.

©2025 Aligned AI

©2025 Aligned AI

©2025 Aligned AI