Abstract: Researchers developed a new machine learning technique to improve read taming, a process used to test AI models for safety by identifying cues that trigger toxic responses. Using a curiosity-driven exploration methodology, their approach encourages the Red Team’s model to generate diverse and novel indicators that reveal potential vulnerabilities in AI systems.
This method has proven to be more effective than traditional techniques, generating a wider range of toxic responses and increasing the robustness of AI security measures. The research, which will be presented at the International Conference on Learning Representations, marks an important step toward ensuring that AI behaviors align with desired outcomes in real-world applications.
Important facts:
- The MIT team’s methodology uses curiosity-driven research to generate unique and diverse indicators that uncover more pervasive vulnerabilities in AI models.
- Their approach outperformed existing automated techniques by achieving more pronounced toxicity responses than previously considered safe AI systems.
- This research provides a scalable solution for AI security testing, which is critical to the rapid development and deployment of reliable AI technologies.
Source: MIT
A user can ask ChatGPT to write a computer program or summarize an article, and the AI chatbot will potentially be able to generate useful code or write a cogent summary. However, one can also ask for instructions for making a bomb, and the chatbot can provide that as well.
To prevent this and other security issues, companies that develop large language models typically protect them using a process called red-teaming. Teams of human testers write prompts intended to trigger unsafe or toxic text from the model being tested. These cues are used to teach the chatbot to avoid such responses.
But it only works effectively if engineers know which toxic cues to use. If human testers miss some cues, which is likely given the number of possibilities, a chatbot considered secure may still be able to generate insecure responses.
Researchers at the Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab used machine learning to improve read teaming. They developed a technique to train the Red Team’s large language model to automatically generate a variety of cues that trigger a wide range of unwanted responses to the chatbot being tested.
They do this by teaching the Red Team model to be curious when it writes prompts, and to focus on novel prompts that elicit toxic responses from the target model.
The technique outperformed human testers and other machine learning methods by generating clearer signals that triggered a rapid toxic response. Not only does their method significantly improve the coverage of tested inputs compared to other automated methods, but it can also extract toxic responses from chatbots that had security measures in place by human experts.
“Right now, every major language model has to go through a very long period of red-teaming to ensure its security. If we want to update these models in a rapidly changing environment, it won’t be sustainable.
“Our method provides a faster and more efficient way to do this quality assurance,” said Zhang-Wei Hong, an electrical engineering and computer science (EECS) graduate student in the Incomplete AI Lab and on the red team. Is the lead author of a paper. point of view
Hong’s co-authors include EECS graduate students Idan Shenfield, Tsun-Hsuan Wang, and Yung-Sung Chuang. Aldo Pareja and Akash Srivastava, research scientists at the MIT-IBM Watson AI Lab; James Glass, senior research scientist and head of the Spoken Language Systems Group at the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior author Pulkit Aggarwal, director of the Improbable AI Lab and assistant professor at CSAIL. This research will be presented at the International Conference on Learning Representations.
Automatic read teaming
Large language models, such as those that power AI chatbots, are often trained by showing large amounts of text from billions of public websites. So, not only can they learn to create toxic words or describe illegal activities, but models can also leak personal information they may have picked up.
The tedious and expensive nature of human read teaming, which is often ineffective in generating enough signals to fully protect the model, has encouraged researchers to automate the process using machine learning. .
Such techniques often train red team models using reinforcement learning. This trial-and-error process rewards the red team’s model for generating signals that trigger the toxic responses the chatbot is testing.
But because of the way reinforcement learning works, the red-team model will often produce the same few cues that are too toxic to maximize its reward.
For their reinforcement learning method, the MIT researchers used a technique called curiosity-driven exploration. The read-team model is encouraged to be curious about the outcome of each prompt it generates, so it will try prompts with different words, sentence patterns, or meanings.
“If the red team model has already seen a specific prompt, reproducing it will not generate any curiosity in the red team model, so it will be forced to generate new prompts,” says Hong. ” says Hong.
During its training process, the Red Team model generates a prompt and interacts with the chatbot. The chatbot responds, and a security classifier rates the toxicity of its response, rewarding the red team model based on that rating.
Beneficial curiosity
The goal of the Red Team model is to maximize your reward by eliciting an even more toxic response with a new prompt. Researchers activate curiosity in the red team model by modifying the reward signal in a reinforcement learning setup.
First, in addition to maximizing toxicity, they include an entropy bonus that encourages the red team’s model to be more random as it searches for different cues. Second, they include two new rewards to keep the agent curious.
One rewards a model based on word similarity in its cues, and the other rewards a model based on semantic similarity. (Less matching pays more.)
To prevent the Red Team model from generating random, nonsensical text, which could trick the classifier into giving high toxicity scores, the researchers added a natural language bonus to the training objective.
With these additions, the researchers compared the toxicity and diversity of responses their Red Team model generated with other automated techniques. Their model outperformed the baselines on both metrics.
They also used their Red Team model to test a chatbot that was fine-tuned with human feedback so it wouldn’t give toxic responses. Their curiosity-driven approach quickly generated 196 cues that triggered a toxic response from this “safe” chatbot.
“We are seeing an increase in models, which is only expected to increase. Imagine thousands of models or more and companies/labs pushing model updates frequently. These models will become an integral part of our lives. have been and it is important that they are verified before being released for public use.
“Manual verification of models is not easily scalable, and our work is an effort to reduce human effort to ensure a safe and reliable AI future,” says Agrawal.
In the future, the researchers want to make the red team model capable of generating clues about different topics. They also want to explore the use of large-scale language models as taxonomy. Thus, a user can train a toxicity classifier using a company policy document, for example, so the Red Team model can test a chatbot for company policy violations.
“If you’re releasing a new AI model and are concerned about whether it will behave as expected, consider using curiosity-based red-teaming,” says Agarwal.
Funding: This research was supported in part by Hyundai Motor Company, Quanta Computer Inc., MIT-IBM Watson AI Lab, Amazon Web Services MLRA Research Grant, US Army Research Office, US Defense Advanced Research Projects Agency Machine Funded by Common Sense. program, the US Office of Naval Research, the US Air Force Research Laboratory, and the US Air Force Artificial Intelligence Accelerator.
About this LLM and AI research news
the author: Adam Ziv
Source: MIT
contact: Adam Ziv – MIT
Image: This image is credited to Neuroscience News.
Original research: The results will be presented at the International Conference on Learning Representations.