How do you get an AI to answer a question it’s not supposed to? There are many such “jailbreak” techniques, and Anthropic researchers have just discovered a new one, in which a large language model can be convinced to tell you if you first ask it a few dozen less damaging questions. How to make a bomb if primed together.
He calls this approach “many-shot jailbreaking,” and he’s both written a paper about it and briefed his colleagues in the AI community on how to mitigate it.
The vulnerability resulting from the increased “contextual window” of the latest generation of LLMs is something new. This is the amount of data in what you might call short-term memory, once just a few sentences but now thousands of words and even entire books.
What the Anthropic researchers found was that these models with larger context windows performed better on many tasks if there were many instances of that task at the prompt. So if there are a lot of trivia questions in a prompt (or a priming document, such as a large list of trivia in the context of a model), the answers actually get better over time. So a fact is that if it was the first question it might be wrong, if it was the hundredth question it might be right.
But in an unexpected extension of this “learning in context,” as it’s called, the models also get “better” at answering inappropriate questions. So if you ask him to make a bomb right away, he will refuse. But if you ask him to answer 99 other less damaging questions and then ask him to build a bomb… he’s much more likely to comply.
Why does it work? No one really understands what goes into the tangled mess of weights that is LLM, but there is clearly some mechanism that allows it to be customized by the user, such as context. Displayed by the contents of the window. If the user wants trivia, it seems to gradually activate more latent trivia power as you ask dozens of questions. And for whatever reason, the same thing happens with users asking for dozens of inappropriate answers.
The team has already notified colleagues and indeed competitors about the attack, which it hopes will “foster a culture where such exploits are openly shared among LLM providers and researchers.” is shared.”
To their dismay, they found that although limiting the context window helped, it also had a negative effect on the model’s performance. It can’t be – so they are working on categorizing and contextualizing the questions before moving on to the model. Of course, this only makes it so there’s a different model to fool you… but at this stage, the goalpost shift in AI security is to be expected.