Research

Concept extrapolation: Teaching AI systems to ‘think’ in human-like concepts

19 Jun 2023

As artificial intelligence (AI) becomes a ubiquitous presence in daily life, it is imperative to understand and manage the impact of AI systems on our lives and decisions.

At present, AI systems are not safe or robust enough. This is, in part, because AI systems do not have the capacity to generalise. They struggle to adapt when they encounter 'out-of-distribution' scenarios, which are situations that are different from the data on which they were trained. The repercussions of an AI systems’ failure to generalise could range from adverse to catastrophic, not just for the individual entity but potentially for the entirety of human society. Achieving reliable generalisation in AI is a significant challenge.

Over the years, we have seen numerous examples of AI systems failing to generalise when they encounter out-of-distribution scenarios. Self-driving vehicles trained in Arizona have broken down when encountering snow. Meanwhile, when COVID-19 happened, banks had to switch to using manual systems for fraud detection, as the AI system behind credit card fraud detection started marking upwards of 80% of transactions as fraudulent, because people's usage of their cards suddenly dramatically changed. When there is a distribution shift, AI systems of today degrade.

Concept extrapolation

At Aligned AI, our mission is to align AI with human values and intent, and make every AI system safe. Our Co-founder and Chief Technology Officer (CTO), Dr. Stuart Armstrong, spent 10+ years thinking about existential risk at the University of Oxford’s Future of Humanity Institute, including the existential risk posed by AI. As part of this, he analysed approaches to AI alignment, and saw that they all had a common point of failure: concept extrapolation.

Concept extrapolation is the ability to take a concept, a feature or a goal that is defined in one context and extrapolate it safely to a more general context. Humans can concept extrapolate; we can leverage our knowledge of fundamental principles and concepts to decipher unfamiliar scenarios. Concept extrapolation aims to solve model splintering, which is a ubiquitous occurrence wherein features or concepts shift as the world changes over time. Concept extrapolation enables an AI system to function when features splinter in the real world. It is by scrutinising model splintering and value splintering (the latter of which is the term we use when the concept is critical to our values) that we can improve the safety and efficacy of AI systems.

Concept extrapolation can be applied to many AI safety problems, including goal misgeneralisation. Goal misgeneralisation is when an AI agent has learned a goal based on a given environment but incorrectly transfers its knowledge to different environments. This is because the AI agent has only been exposed to a limited set of scenarios and learns undesirable correlations, thus lacking the ability to generalise correctly from those scenarios to new ones. While there are many examples of goal misgeneralisation in AI, CoinRun is the most well-known, and can be used to test whether a machine learning (ML) system misgeneralises.

Self-governing autonomous systems

We have developed an algorithm, named ACE, which stands for “Algorithm for Concept Extrapolation”, which is able to self-govern and self-correct. It is model agnostic; it works with textual data, image data and tabular data. It can extract multiple features, when traditional algorithms only succeed in finding the simplest one, and therefore overcomes the simplicity bias inherent in traditional algorithms. Our ACE AI agent beat the CoinRun benchmark, which is a huge leap forward in tackling goal misgeneralisation.

Concept extrapolation: Teaching AI systems to ‘think’ in human-like concepts

©2025 Aligned AI

/ Press

/ Careers

©2025 Aligned AI

/ Press

/ Careers

©2025 Aligned AI

/ Press

/ Careers