OUR BLOG
Updates from
our blog
19 Mar 2025
Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
The "emergent misalignment" paper shows that GPT-4o can show general misbehaviour when its fine tuned into producing code with security holes. It then produces dangerous content in all sorts of different areas. Our replication suggests that this might be due to GPT-4o turning bad, but losing it's 'inhibitions': it reverts to more standard LLM behaviour, ignoring the various control mechanisms that have transformed it from the sequence predictor that it once was. In particular, it's a lot more human-like on topics like religion and drunkeness. Understanding the complexity of misalignment, what it is and what it isn't, is necessary to combat it.
The "emergent misalignment" paper shows that GPT-4o can show general misbehaviour when its fine tuned into producing code with security holes. It then produces dangerous content in all sorts of different areas. Our replication suggests that this might be due to GPT-4o turning bad, but losing it's 'inhibitions': it reverts to more standard LLM behaviour, ignoring the various control mechanisms that have transformed it from the sequence predictor that it once was. In particular, it's a lot more human-like on topics like religion and drunkeness. Understanding the complexity of misalignment, what it is and what it isn't, is necessary to combat it.






19 Mar 2025
Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions



12 Feb 2025
Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation



13 Sept 2023
Using fAIr to measure gender bias in LLMs
Using fAIr to measure gender bias in LLMs



16 Apr 2022
Concept extrapolation for hypothesis generation
Concept extrapolation for hypothesis generation



1 May 2022
ACE for goal generalisation
ACE for goal generalisation



24 Aug 2023
ACE mitigates simplicity bias
ACE mitigates simplicity bias



1 Mar 2023
EquitAI: A gender bias mitigation tool for generative AI
EquitAI: A gender bias mitigation tool for generative AI


