If you listen to the compelling arguments of AI doomsayers, the coming generations of artificial intelligence represent a profound threat to mankind – possibly even an existential threat.
We've all seen how easily apps like ChatGPT can be tricked into saying or doing naughty things they shouldn't. We have seen that they try to hide their intentions, and try to gain and consolidate power. The more access AIs are given to the physical world via the Internet, the greater their ability to harm them in a variety of creative ways if they so choose.
Why would they do such a thing? We don't know. In fact, their inner workings have been more or less completely opaque, even to the companies and individuals who make them.
Inexplicable alien 'minds' of AI models
These remarkable pieces of software are very different from most that come before them. Their human creators have created the architecture, infrastructure, and methods by which these artificial minds can develop their own version of intelligence, and have provided them with vast amounts of text, video, audio, and other data. But since then, AIs have gone ahead and built their own 'understanding' of the world.
They convert these large stores of data into tiny scraps called tokens, sometimes parts of words, sometimes parts of images or bits of audio. And then they generate an incredibly complex set of probability weights relating tokens to each other, and linking groups of tokens to other groups. In this way, they are somewhat like the human brain, finding connections between letters, words, sounds, images and more abstract concepts, and forming them into a highly complex neural network.
These massive matrices full of probabilistic weights represent the AI's 'mind', and they drive its ability to receive input and respond with specific outputs. And, like the human brains that inspired their designs, it's nearly impossible to figure out what they're thinking, or why they're making certain decisions.
Personally, I'm imagining them as strange alien minds locked in black boxes. They can only interact with the world through the limited pipelines through which information can flow in and out of them. And all efforts to 'align' these minds to work with humans in a productive, safe and non-intrusive way have been done at the pipeline level, not the 'minds' themselves.
We can't tell them what to think, we don't know where bad words or bad ideas reside in their minds, we can only restrict what they can say and do – a Concepts that are tough now, but promise to get tougher. They become smarter.
This is my most reductive, bone-level understanding of a dense and complex situation – and please jump into the comments to expand, query, debate or clarify if needed – but this gives some indication as to why I think the news that came out of Anthropic. And OpenAI has recently been a major milestone in humanity's relationship with AIs.
What is the definition?
Explanation: Peeking into the black box
“Today,” the Anthropic Interpretability team writes in a blog post in late May, “we report a significant advance in understanding the inner workings of AI models. We have identified that within Cloud Sonnet How millions of concepts are represented, one of the largest entities we've deployed, is the first detailed look at a modern, production-grade language model that will help us future-proof AI models. can do.”
Essentially, the Anthropic team is tracking the 'internal state' of its AI models as they work, spitting out huge lists of numbers representing the 'neuron activation' in their artificial brains as they mimic those of humans. Let's talk together. “It turns out,” the team writes, “that each concept is represented in many neurons, and each neuron is involved in representing many concepts.”
Through 'super autoencoders' using a technique called 'dictionary learning', Anthropic researchers began trying to match patterns of 'neuron activation' with concepts and ideas familiar to humans. They had some success working with very small “toy” versions of language models late last year, discovering 'thought patterns' that model DNA sequences, nouns in math, and capital letters. The text was animated by dealing with ideas such as
It was a promising start, but the team was by no means certain that it would reach the large size of today's commercial LLMs, let alone machines. So Anthropic built a dictionary learning model capable of tackling its medium-sized Claude 3 Sonnet LLM, and set out to test the approach at scale.
Results? Well, the team was blown away. “We successfully extracted millions of features from the middle layer of the Claude 3.0 Sonnet,” reads the blog post, “providing a rough conceptual map of its internal states halfway through its computation. This modern, production Inside is the first detailed look at the Grade 1 language model.”
It is interesting to note that AI is storing concepts in ways that are independent of language, or even data type; The 'idea' of the Golden Gate Bridge, for example, is illuminated when the model processes images of the bridge, or text in several different languages.
And 'ideas' can become much more abstract than that. The team discovered features that activate when faced with things like coding errors, gender bias, or many different ways of approaching the concept of discretion or privacy.
And indeed, Team AI's conceptual web explores all manner of darkness, from ideas about code backdoors and biological weapons development to notions of racism, sexism, the pursuit of power, deception and manipulation. Was successful. It's all there.
What's more, the researchers were able to visualize the relationships between different concepts stored in the model's 'brain', develop a measure of the 'distance' between them and create a series of mind maps that show How closely related are the concepts? Near the concept of the Golden Gate Bridge, for example, the team found other features such as Alcatraz Island, the Golden State Warriors, California Governor Gavin Newsom and the 1906 San Francisco earthquake.
The same held for more abstract concepts for the idea of a Catch-22 situation, which the model grouped under 'impossible choices,' 'difficult situations,' 'curious paradoxes,' and 'between a rock and a hard place.' . The space “demonstrates,” the team writes, “that the internal organization of concepts in the AI model is, at least to some extent, consistent with our human notions of similarity. This is due to Claude's excellent ability to create analogies and metaphors.” Could be real.”
The dawn of AI brain surgery – and possible lobotomies
“Importantly,” the team writes, “we can also artificially enhance or suppress these features to see how cloud responses change.”
The team began “clamping” some concepts, modifying the model to force some features to be fired because they answered completely unrelated questions, and this radically changed the model's behavior. , as shown in the video below.
Learning vocabulary on Claude 3 sonnets
This is a very incredible thing; Anthropic has shown that it can not only create a mind map of an artificial intelligence – it can also modify the relationships within that mind map and toy with the model's understanding of the world – and then its behavior. too
The potential here for AI safety is clear. If you know where the bad thoughts are, and you can see when the AI is thinking about them, well, you've got an extra layer of monitoring that can be used in a supervisory sense. And if you can strengthen or weaken the connections between certain concepts, you can potentially eliminate some behavior from the AI's range of possible responses, or even some ideas from its understanding of the world. can be extracted from
In the sci-fi masterpiece, it's imaginatively reminiscent of Jim Carrey and Kate Winslet paying a brainwashing company to erase each other's memories after a breakup. Eternal Sunshine of the Spotless Mind. And, like the movie, it raises the question: Can you really delete a powerful idea?
The Anthropic team also demonstrated the potential vulnerability of this approach, “clamping” the concept of scam emails, and showing how the idea of a powerful enough mental connection to the Claude model's alignment training to write such content. can be quickly bypassed while forbidding. This type of AI brain surgery can really supercharge a model's capacity for bad behavior, allowing it to break through its own gutters.
Anthropic has other concerns about the technology's limitations. “The work has really just begun,” the team writes. “The features we found represent a small subset of all the concepts learned by the model during training, and finding a complete set of features using our current techniques would be cost-prohibitive (our The computation required by the current approach would be much greater than the computation used to train the model first).
“Understanding the representations a model uses does not tell us how it uses them; even though we have the properties, we still need to find the circuits that contain them. And we need to show that “We've begun to explore security-related features that can actually be used to improve security, and there's more to be done.”
This sort of thing can be a very valuable tool, in other words, but it's never likely. completely Understand the thought process of AI at commercial scale. This will give little comfort to the naysayers, who will point out that while the results are likely there, a 99.999 percent success rate won't cut the mustard.
Still, it's an extraordinary development, and a remarkable insight into the way these incredible machines perceive the world. It will be interesting to see how closely an AI's brain map fits that of a human, if it might ever be possible to measure it.
OpenAI: Also working on annotation, but apparently lagging behind.
Anthropic is a key player in the modern AI/LLM field, but the juggernaut in the space is still OpenAI, makers of GPT models and certainly the company driving the public conversation around AI the hardest.
In fact, Anthropic was founded in 2021 by a group of former OpenAI employees, to keep AI security and reliability at the top of the priority list while OpenAI partnered with Microsoft and operated as a commercial enterprise. started.
But OpenAI is also working on annotation, and using a similar approach. In research released in early June, the OpenAI interpretability team announced that it had found nearly 16 million 'thought' patterns in GPT-4, many of which the team could understand and map onto concepts meaningful to humans. Considers worthy.
The OpenAI team doesn't seem to have ventured into the fields of mind-mapping or editing yet, but it also notes the fundamental challenges in understanding a large AI model as it works. “Currently,” the team writes, “passing the activation of GPT-4 through a sparse autoencoder results in the same performance as a trained model with about 10x less compute. The concepts in Frontier LLMs are fully But to map it, we may need to measure billions or trillions of features, which would be challenging even with our finer scaling techniques.”
So for both companies, it's early days right now. But at least humanity now has at least two ways to unlock the 'black box' of AI's neural web and understand how it thinks.
The OpenAI research paper is available here.
The Anthropic Research Paper is available here.
Hear members of Anthropic's interpretation team discuss the research in detail in the video below.
Interpretation of scale
Sources: Anthropic, OpenAI