Training Diffusion Models with Reinforcement Learning – Berkeley Artificial Intelligence Research Blog


WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Training diffusion models with reinforcement learning

Diffusion models have recently emerged as the de facto standard for generating complex, high-dimensional output. You may know them for their productivity. Stunning AI art and highly realistic synthetic imagesbut they have also found success in other applications such as Drug design And Continuous control. The key idea behind diffusion models is to iteratively transform random noise into a sample, such as an image or protein structure. This is usually motivated as a Maximum likelihood estimation problem, where the model is trained to produce patterns that match the training data as closely as possible.

However, most use cases for diffusion models are not directly related to matching training data, but instead with a downstream objective. We don't just want an image that looks like existing images, but an image that has a unique appearance. We don't just want a drug molecule that is physiologically plausible, but one that is as effective as possible. In this post, we show how diffusion models can be trained directly on these downstream targets using reinforcement learning (RL). To do this, we finetune Stable diffusion for a number of purposes, including image compression, human-perceived aesthetic quality, and instant image alignment. The last of these purposes uses feedback. A large model of vision language to improve the model's performance on anomalous signals, showing how Powerful AI models can be used to improve each other. Without a human in the loop.

The diagram illustrates the RLAIF objective that uses the LLaVA VLM.

A diagram illustrating the purpose of prompt image alignment. It uses lavaA large vision language model, for evaluating generated images.

Denoising Diffusion Policy Optimization

When transforming diffusion into an RL problem, we make only the most basic assumption: given a sample (e.g., an image), we have access to a reward function that we can estimate to tell How “good” that sample is. Our goal is to develop samples for the diffusion model that maximize this reward function.

Diffusion models are typically trained using a loss function derived from maximum likelihood estimation (MLE), i.e. they are encouraged to produce samples that are more likely to fit the training data. In the RL setting, we no longer have training data, only diffusion model samples and their associated rewards. One way we can still use the same MLE induced loss function is to treat the samples as training data and add rewards by weighting each sample's loss by its reward. This gives us an algorithm we call Reward Weighted Regression (RWR). Current algorithm From RL literature.

However, there are a few problems with this approach. One is that RWR is not a particularly accurate algorithm—it only approximately maximizes the reward (see Nair et al. al, Appendix A). The MLE-induced loss for diffusion is also not valid and is derived using a instead. Variable binding on the true probability of each sample. This means that RWR maximizes the reward by almost two levels, which we find significantly impairs its performance.

Chart comparing DDPO to RWR

We evaluate two variants of DDPO and two types of RWR on three reward functions and find that DDPO consistently achieves the best performance.

The key insight of our algorithm, which we call Denoising Diffusion Policy Optimization (DDPO), is that we can get the final sample reward much better if we focus on all the steps that got us there. To do this, we recast the diffusion process as a multi-step one. Markov Decision Process (MDP). In MDP terms: each deductive step is a process, and the agent is rewarded at the last step of each reaction only when the final sample is produced. This framework allows us to implement many powerful algorithms from the RL literature that are specifically designed for multistep MDPs. Instead of using the approximate probability of the final sample, these algorithms use the exact probability of each rejection step, which is extremely easy to calculate.

We chose to implement policy gradient algorithms because of their ease of implementation and Past success in language model fine-tuning. This led to two types of DDPO: DDPOSFwhich uses a simple score function estimator of the policy gradient also called to strengthen; and DDPOI.S, which uses a more powerful significance model estimator. DDPOI.S Ours is the best-performing algorithm and its implementation closely follows it. Proximity Policy Optimization (PPO).

Fine-tuning stable dispersion using DDPO

For our main results, we fix Stable dispersion v1-4 Using DDPOI.S. We have four tasks, each defined by a different reward function:

  • Compressibility: How easy is it to compress an image using the JPEG algorithm? The reward is the negative file size (in kB) of the image when saved as a JPEG.
  • Incompressible: How hard is it to compress an image using the JPEG algorithm? The reward is the positive file size (in kB) of the image when saved as JPEG.
  • Aesthetic Quality: How aesthetically pleasing is the image to the human eye? is the product of the reward LAION aesthetic predictorwhich is a neural network trained on human preferences.
  • Prompt Image Alignment: How well does the image represent what was asked for in the prompt? It's a bit more complicated: we feed the image into it. lavaask him to describe the picture, and then calculate the similarity between that description and the original prompt. BERTScore.

Since stable diffusion is a text-to-image model, we also need to choose a set of prompts to give it during fine-tuning. For the first three tasks, we use simple form notation. “one) [animal]”. For quick image alignment, we use form pointers. “one) [animal] [activity]”where there are activities. “washing dishes”, “playing chess”And “riding a motorcycle”. We found that Stable Diffusion often struggled to produce images that matched the prompt for these unusual scenarios, leaving considerable room for improvement with RL fine-tuning.

First, we illustrate the performance of DDPO on simple rewards (compressibility, incompressibility, and aesthetic quality). All images are generated with the same random seed. In the upper left quadrant, we show what “vanilla” stable dispersion produces for nine different animals. All RL-specified models show a clear qualitative difference. Interestingly, the aesthetic quality model (top right) leans toward minimal black-and-white line drawings, revealing the types of images that the LAION aesthetic predictor considers “more aesthetic.”

Results on aesthetics, compressibility, and incompressibility

Next, we demonstrate DDPO on a more complex prompt-image alignment task. Here, we show several snapshots from the training process: each series of three images shows the same prompt and random seed samples over time, the first sample coming from vanilla stable diffusion. Interestingly, the model shifted to a more cartoonish style, which was not intentional. We hypothesize that this is because animals performing human-like activities are more likely to appear in cartoon-like patterns in the pre-training data, so the model shifts to that pattern so that it first Align with the prompt more easily by taking advantage of what you already know.

Instant image alignment results

Unexpected generalization

A surprising generality arises when fine-tuning large language models with RL: for example, models fine-tuned following instructions in English only. Often there are improvements in other languages.. We find that the same trend occurs with text-to-image diffusion models. For example, our aesthetic quality model was fine-tuned using cues selected from a list of 45 common animals. We find that it makes common not only unseen animals but also everyday objects.

Generalization of aesthetic standards

Our instant image alignment model used the same list of 45 common animals and only three activities during training. We find that it generalizes not only unseen animals but also unseen activities, and even novel combinations of the two.

Normalizing prompt image alignment

Overcorrection

It is well known that fine-tuning of the reward function, especially learned, can lead to this. Reward overcorrection where the model exploits the reward function to obtain a higher reward in a non-utilitarian way. Our setting is no exception: in all tasks, the model eventually destroys any meaningful image content to maximize reward.

Overcorrection of reward functions

We also discovered that LLaVA is susceptible to typographic attacks: when optimized for alignment with respect to form cues; “[n] Animals”DDPO was able to produce a text similar to the correct number instead of successfully fooling LLaVA.

Exploiting LLaVA on RL enumeration tasks.

There is currently no general-purpose method for preventing overcorrection, and we highlight this issue as an important area for future work.

Result

Diffusion models are hard to beat when it comes to generating complex, high-dimensional output. However, so far they have mostly been successful in applications where the goal is to learn patterns from large amounts of data (for example, image caption pairs). What we've found is a way to effectively train diffusion models that goes beyond pattern matching — and without requiring any training data. The possibilities are only limited by the quality and creativity of your reward function.

The way we used DDPO in this work is inspired by recent achievements in language model fine-tuning. OpenAI's GPT models, such as Stable Diffusion, have previously been trained on large amounts of Internet data. They are then fine-tuned with RL to produce useful tools like ChatGPT. In general, their reward function is learned from human preferences, but others Take more recently Instead it figured out how to develop powerful chatbots using reward functions based on AI feedback. Compared to chatbot systems, our experiments are small scale and limited in scope. But considering the tremendous success of this “pretrain + fine-tune” paradigm in language modeling, it certainly looks like it's worth going further into the world of diffusion models. We hope that others can build on our work to improve large diffusion models, not only for text-to-image generation, but also for many interesting applications such as Video generation, Generation of music, Image editing, Protein synthesis, Roboticsand more.

Furthermore, the “pretrain + fine tune” paradigm is not the only way to use DDPO. As long as you have a good reward function, there's nothing stopping you from training with RL from the start. While this configuration has yet to be explored, this is where the power of DDPO can really shine. Pure RL has long been applied to a wide variety of domains. playing games To Robotic manipulation To Nuclear fission To Chip design. Adding the powerful expression of diffusion models to the mix has the potential to take existing applications of RL to the next level—or even to invent new ones.


This post is based on the following paper:

If you want to know more about DDPO, you can visit Paper, website, Original codeor get The weight of the model on the hugging face. If you want to use DDPO in your project, check mine. Implementation of PyTorch + LoRA Where you can fix Stable Diffusion with less than 10GB of GPU memory!

If the DDPO affects your work, please refer to:

@misc{black2023ddpo,
      title={Training Diffusion Models with Reinforcement Learning}, 
      author={Kevin Black and Michael Janner and Yilun Du and Ilya Kostrikov and Sergey Levine},
      year={2023},
      eprint={2305.13301},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}


WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment