Can we unteach AI to be undangerous?

Big models that know too much are a safety issue in their own right

Anna

4/4/20264 min read

There are two main ways that AI represents a scary unknown. The first is the possibility of having a mind of its own — subjectivity, its own goals, the ability to deceive. This problem is at the crux of AI alignment and gets the lion's share of attention in AI safety.

The second danger lies in what AI "knows". Although this represents risks of many shapes and sizes – from private data breaches all the way to bad actors bootstrapping biological weapons – it is far less discussed as a problem in its own right. My sense is that’s because you would end up questioning the most basic proposition of current AI development, namely, training large models on vast data stores to make them helpful assistants.

Although no one understands exactly how it works, everyone knows that throwing a ton of data at models makes the transformer magic happen. And this assumption is baked into the fundamental economic calculus of AI. Training models on data scraped from the internet is relatively fast and cheap, whereas using responsibly curated and specific datasets would be slow, expensive and non-scalable. So the status quo continues. All kinds of information gets hoovered up during pre-training and the biased, compromising or dangerous knowledge that results becomes an externality foisted on to society, much like AI’s energy and water use.

Where the problem is acknowledged, solutions tend to take the form of post-hoc interventions such as data filtering, output guardrails, or machine unlearning. I’d like to dig into machine unlearning in particular as an example of how difficult (and often ineffective) it is to tackle the problem of dangerous knowledge downstream. I also think machine unlearning is a great illustration (in reverse!) of how brittle all machine learning fundamentally still is.

What is unlearning?

When I first heard the word “unlearning”, I thought it sounded both elegant and intuitive: the equivalent of Ctrl+Z on your keyboard. Alas – unlearning is not “undo”. It is messy and ineffective, for technical as well as epistemological reasons.

Epistemologically, you could start by asking: what’s the difference between information and knowledge? Dangerous knowledge is rarely just data points, but rather the sum of certain kinds of information plus the procedural knowledge to put it into practice. (I.e. the difference between having a list of chemicals and the instructions for assembling them into a bomb.)

So unlearning is not a process of pulling out discrete elements of data from a model’s weights and parameters, because that wouldn’t be technically feasible anyway, but rather, for the most part, fine-tuning that trains models to suppress responses flagged for specific risk factors. Yet like all fine-tuning, this is superficial. It only refines surface behaviours and tends not to be very robust, especially in the face of adversarial persistence.

Apart from fine-tuning, there are deeper unlearning techniques that are more like surgical interventions. Whether called edits, ablations, or perturbations, these are all examples of targeted damage to the model's thinking mechanism aimed at eliminating an unwanted capability.

Finally, there's compression, or making models smaller through techniques such as pruning, quantising, or distillation. Going back to the fundamental “bigger is better” orthodoxy we talked about at the beginning, compression is generally not seen as a good solution because you lose the generalist capabilities that make models such helpful assistants in a variety of situations.

So I was intrigued to read a paper last year that combines several individual unlearning techniques such as fine-tuning, perturbation, and distillation into one broad unlearning method called UNDO, or Unlearn-Noise-Distill-on-Outputs. (Not quite as simple as Ctrl-Z, but getting there!) Essentially, UNDO creates a student model compressed from a larger teacher model, where the student learns only a safe subset of the teacher model’s responses.

Two uncomfortable parallels

While I found the UNDO paper insightful and hopeful on a technical level, it also made me reflect on some of the broader dimensions of “knowing too much” and how this has played out in human society.

First, consider the teacher-student model as a parallel for the burden of knowledge across generations. When it comes to the “problematic ancestors” in our own history — such as soldiers returning from war, or people entangled in complicity of some kind — they are often left to deal with their complex, forbidden knowledge alone, or even outright rejected and pushed to the margins of society. Let’s recall too that the AI industry is shaped by American culture, which resists social responsibility of any kind on one hand and is deeply wedded to a good guys-bad guys binary on the other (I say this as an American in exile myself).

All of which means that the reflexive response to models knowing too much is likely to be some form of compartmentalisation where the blame is put on individual humans or, who knows, maybe even certain models will be scapegoated in the future.

Second: unlearning techniques like noise, ablations, and perturbations have a disturbing parallel with brutal psychiatric interventions of the 20th century — think of lobotomies or electric shock — which were performed disproportionately on marginalised populations. I am not anthropomorphising models when I say this; I don't believe they experience suffering from these interventions. My point is that the societal response to minds that threaten prevailing power structures seems to involve both reflexive violence and euphemisms to disguise it.

Compartmentalisation and ablation are variations on a broader pattern: draping uncomfortable truths in jargon and turning them into research specialisms or bureaucratic make-work. I wish we would call out "more data, bigger models" as the elephant in the room instead of devising ever more elaborate post-hoc solutions.

Making AI safer can't only be about technical fixes aimed at suppressing or undoing. We need a public conversation about what knowing too much really means.