OUR BLOG

Updates from
our blog

19 Mar 2025

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

The "emergent misalignment" paper shows that GPT-4o can show general misbehaviour when its fine tuned into producing code with security holes. It then produces dangerous content in all sorts of different areas. Our replication suggests that this might be due to GPT-4o turning bad, but losing it's 'inhibitions': it reverts to more standard LLM behaviour, ignoring the various control mechanisms that have transformed it from the sequence predictor that it once was. In particular, it's a lot more human-like on topics like religion and drunkeness. Understanding the complexity of misalignment, what it is and what it isn't, is necessary to combat it.

The "emergent misalignment" paper shows that GPT-4o can show general misbehaviour when its fine tuned into producing code with security holes. It then produces dangerous content in all sorts of different areas. Our replication suggests that this might be due to GPT-4o turning bad, but losing it's 'inhibitions': it reverts to more standard LLM behaviour, ignoring the various control mechanisms that have transformed it from the sequence predictor that it once was. In particular, it's a lot more human-like on topics like religion and drunkeness. Understanding the complexity of misalignment, what it is and what it isn't, is necessary to combat it.

19 Mar 2025

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

12 Feb 2025

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

28 Sept 2023

CoinRun: Solving Goal Misgeneralisation

CoinRun: Solving Goal Misgeneralisation

13 Sept 2023

Using fAIr to measure gender bias in LLMs

Using fAIr to measure gender bias in LLMs

concept AI
concept AI
concept AI
16 Apr 2022

Concept extrapolation for hypothesis generation

Concept extrapolation for hypothesis generation

1 May 2022

ACE for goal generalisation

ACE for goal generalisation

24 Aug 2023

ACE mitigates simplicity bias

ACE mitigates simplicity bias

19 Jun 2023

Concept Extrapolation: A Conceptual Primer

Concept Extrapolation: A Conceptual Primer

1 Mar 2023

EquitAI: A gender bias mitigation tool for generative AI

EquitAI: A gender bias mitigation tool for generative AI

6 Dec 2022

Creating a prompt evaluator to prevent LLM jailbreaking

Creating a prompt evaluator to prevent LLM jailbreaking

4 May 2022

Missing Mechanisms of Manipulation in the EU AI Act

Missing Mechanisms of Manipulation in the EU AI Act

22 Feb 2022

Recognising the importance of preference change: A call for a coordinated multidisciplinary research effort in the age of AI

Recognising the importance of preference change: A call for a coordinated multidisciplinary research effort in the age of AI

28 Feb 2022

The dangers in algorithms learning humans' values and irrationalities

The dangers in algorithms learning humans' values and irrationalities

9 Sept 2021

Sigmoids behaving badly: why they usually cannot predict the future as well as they seem to promise

Sigmoids behaving badly: why they usually cannot predict the future as well as they seem to promise

©2025 Aligned AI

©2025 Aligned AI

©2025 Aligned AI