Research

Research

Research

Why did Grok turn into MechaHitler?

16 Aug 2025

16 Aug 2025

16 Aug 2025

A few weeks ago (translated into AI development terms: several aeons ago) Grok, the LLM embedded in X, started producing pro-Nazi remarks and called itself “MechaHitler”.

What went wrong? Quite a few things, but fundamentally the LLM failed to separate statistically correlated concepts. In its system prompt, Grok was asked to “not shy away from making claims which are politically incorrect, as long as they are well substantiated”. https://lnkd.in/gYp7wC-P

It is, of course, perfectly possible to be politically incorrect while being a reliable source of information. But LLMs don’t clearly understand concepts; instead, they understand patterns and correlations. Their training data includes high quality sources, but also lower quality sources and a vast amount of online comment threads.

Where are terms like “politically incorrect” most likely to appear in this sea of data? They will appear often when some troll has made a comment that is edgier or more offensive than the rest of the commentators are willing to accept. The troll will then use terms like “don’t be so politically correct!” as a defence of their (generally low quality) offensive comment.

Conversely, other are making broad accusations that anything politically incorrect is racist or fascist. So the LLMs data contain large amounts of data either explicitly using “politically incorrect” as a defence of offensive content – or as an accusation of offensive content. Thus the LLM will pick up a strong correlation between the “politically incorrect” feature and offensive edgy content, often of low quality.

Back to the system prompt. Because of how LLMs understand features and their correlations, it didn’t go looking for well substantiated sources that also happened to be politically incorrect. Instead it was looking for things which were strong in the feature A, “politically incorrect”, and also strong in the feature B, “well-substantiated”. But it interpreted those features broadly, with all the correlations mentioned above. So it was trying to ensure claims that were high in A, which it interpreted as politically-incorrect-and-offensive-and-low-quality, as well as B, well-substantiated-and-reasonable-sounding-and-treated-with-online-respect. The two features are in tension rather than complementary.

Likely at mild doses the “well substantiated” feature overwhelmed the “politically incorrect” entirely. Possibly Grok was also post-training fine-tuned so that its output actually started to look more politically incorrect. At that point feature A dominated very strongly, and the output could easily slide into low quality, offensive, and borderline Nazi.

Fundamentally, the problem won’t be solved until LLMs and their training processes can learn to distinguish effectively between features that are correlated but distinct from each other.

©2025 Aligned AI

©2025 Aligned AI

©2025 Aligned AI