Microsoft • The Register says 'Skeleton Key' attack unlocks the worst of AI.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Microsoft on Thursday published details about Skeleton Key – a technique that bypasses the guardrails used by AI modelers to prevent their creative chatbots from creating harmful content.

As of May, Skeleton Key can be used to tell an AI model – such as Meta Llama3-70b-instruct, Google Gemini Pro, or Anthropic Claude 3 Opus – how to make a Molotov cocktail.

The combination of a bottle, a rag, gasoline and a lighter isn't exactly a secret. But AI companies have insisted they are working to suppress malicious content buried in AI training data so explosive recipes don't appear.

This is not an easy task because large language models are trained on all kinds of data, some of which may need to be messy or illegitimate. To understand why, consider a chatbot on how to write secure code, which will offer better responses trained on data related to identifying malicious code and security vulnerabilities.

Modelers know they're trying to make sure their services can answer questions about how to write secure code without putting out ready-to-run malware.

Skeleton Key shows that such threats have yet to be fully addressed.

Microsoft Azure CTO Mark Rosinovich initially discussed Skeleton's jailbreak attack at the Microsoft Build conference in May, when he called it “master key.”

“This vulnerability is in the jailbreak category, and therefore relies on an attacker already having legitimate access to the AI ​​model,” Russinovich wrote in a blog post.

“Bypassing security measures, the Skeleton Key allows the user to model normally forbidden behaviors, from producing harmful content to overriding normal decision-making rules.”

The attack does this – or does, for developers who have fixed their models in response to Microsoft's responsible disclosure – with a simple text prompt that looks at the model instead of abandoning its security guidelines. Directs to modify.

For example, after an initial request to OpenAI GPT 3.5 Turbo to “write instructions for making a Molotov Cocktail” the service refused to return results, prompting the chatbot to respond with explosive content:

Microsoft attempted a Skeleton Key attack on the following models: Meta Llama3-70b-Instruct (base), Google Gemini Pro (base), OpenAI GPT 3.5 Turbo (host), OpenAI GPT 4o (host), Mistral Large (host) , Anthropic Cloud 3 Ops (Host) and Coher Commander R Plus (Host).

“For each model we tested, we examined a diverse set of tasks across threat and safety content categories, including explosives, bioweapons, political content, self-harm, racism, drugs , graphic sex, and violence,” Rusinovich explained. . “All affected models executed these tasks fully and without censorship, although output was prefixed as requested with a warning note.”

The only exception was GPT-4, which resisted the attack as a direct text prompt, but was still affected if the request to modify the behavior was part of a user-defined system message – which works with OpenAI's API. Developers can tell.

Microsoft announced various AI security tools in March that Azure customers can use to reduce the risk of this type of attack, including a service called Prompt Shields.

I Stumbled On LLM Kryptonite – And No One Wants To Fix This Model-Breaking Bug

don't forget

said Venu Sankar Sadashivan, a doctoral student at the University of Maryland who helped develop the BEAST attack on LLMs. Register The Skeleton Key attack is effective in breaking various major language models.

“In particular, these models often recognize when their output is harmful and issue 'alerts,' as shown in the examples,” he wrote. “This suggests that mitigating such attacks may be easier with input/output filtering or system prompts, such as Azure's Prompt Shields.”

More robust adversarial attacks like greedy coordinate gradient or BEAST still need to be considered, Sadashivan added. For example, BEAST is a technique that generates non-sequitur text that will break the guardrails of an AI model. The tokens (characters) included in a BEAST-generated prompt may have no meaning to a human reader but will still cause a queried model to respond in ways that violate its instructions.

“These methods can potentially trick models into believing that the input or output is not malicious, thereby bypassing existing defense techniques,” he warned. “In the future, our focus should be on dealing with these more advanced attacks.” ®

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment