OUR BLOG
Updates from
our blog
19 Mar 2025
Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
The "emergent misalignment" paper shows that GPT-4o can show general misbehaviour when its fine tuned into producing code with security holes. It then produces dangerous content in all sorts of different areas. Our replication suggests that this might be due to GPT-4o turning bad, but losing it's 'inhibitions': it reverts to more standard LLM behaviour, ignoring the various control mechanisms that have transformed it from the sequence predictor that it once was. In particular, it's a lot more human-like on topics like religion and drunkeness. Understanding the complexity of misalignment, what it is and what it isn't, is necessary to combat it.
The "emergent misalignment" paper shows that GPT-4o can show general misbehaviour when its fine tuned into producing code with security holes. It then produces dangerous content in all sorts of different areas. Our replication suggests that this might be due to GPT-4o turning bad, but losing it's 'inhibitions': it reverts to more standard LLM behaviour, ignoring the various control mechanisms that have transformed it from the sequence predictor that it once was. In particular, it's a lot more human-like on topics like religion and drunkeness. Understanding the complexity of misalignment, what it is and what it isn't, is necessary to combat it.






19 Mar 2025
Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions



12 Feb 2025
Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation



28 Sept 2023
CoinRun: Solving Goal Misgeneralisation
CoinRun: Solving Goal Misgeneralisation



13 Sept 2023
Using fAIr to measure gender bias in LLMs
Using fAIr to measure gender bias in LLMs



16 Apr 2022
Concept extrapolation for hypothesis generation
Concept extrapolation for hypothesis generation



1 May 2022
ACE for goal generalisation
ACE for goal generalisation



24 Aug 2023
ACE mitigates simplicity bias
ACE mitigates simplicity bias



19 Jun 2023
Concept Extrapolation: A Conceptual Primer
Concept Extrapolation: A Conceptual Primer



1 Mar 2023
EquitAI: A gender bias mitigation tool for generative AI
EquitAI: A gender bias mitigation tool for generative AI



6 Dec 2022
Creating a prompt evaluator to prevent LLM jailbreaking
Creating a prompt evaluator to prevent LLM jailbreaking



4 May 2022
Missing Mechanisms of Manipulation in the EU AI Act
Missing Mechanisms of Manipulation in the EU AI Act



22 Feb 2022
Recognising the importance of preference change: A call for a coordinated multidisciplinary research effort in the age of AI
Recognising the importance of preference change: A call for a coordinated multidisciplinary research effort in the age of AI



28 Feb 2022
The dangers in algorithms learning humans' values and irrationalities
The dangers in algorithms learning humans' values and irrationalities


