TL; DR: In RLHF, there is a tension between the reward learning phase, which uses human preferences in the form of comparisons, and the RL fine-tuning phase, which optimizes a single, non-comparable reward. What if we perform RL in a comparative fashion?
Figure 1:
This diagram illustrates the difference between reinforcement learning. absolute Roy and Relative Feedback By adding a new component – the pairwise policy gradient, we can combine the reward modeling stage and the RL stage, enabling direct updates based on pairwise responses.
Large language models (LLMs) have powered increasingly capable virtual assistants, e.g GPT-4, Clade-2, Bard And Bing Chat. These systems can answer complex user questions, write code, and even generate poetry. The basic technique of these amazing virtual assistants is Reinforcement Learning with Human Feedback (RLHF). The purpose of RLHF is to adapt the model to human values and eliminate unintended behaviors, which can often arise because the model is subjected to a large amount of low-quality data during its pre-training phase. should be brought forward.
Proximity Policy Reform (PPO), has been reported to reveal the dominant RL optimizer, in this process. Instability And Implementation complications. More importantly, there is a persistent paradox in the RLHF process: despite training the reward model using comparisons between different responses, the RL fine-tuning step operates without comparisons on individual responses. This inconsistency can exacerbate problems, especially in the domain of challenging language generation.
Given this background, an interesting question arises: Is it possible to design an RL algorithm that learns in a comparative fashion? To explore this, we optimize the pairwise proximity policy (P3O), a method that synchronizes the training process in both the reward learning stage and the RL fine-tuning stage of RLHF, provides a satisfactory solution to this problem.
background
Figure 2:
Description of one to three stages of RLHF OpenAI blog post. Note that the third step falls under reinforcement learning with full feedback as shown on the left side of Figure 1.
In traditional RL settings, reward is manually determined by the designer or provided by a well-defined reward function, as in Atari games. However, defining a good reward is not straightforward, in order to advance a model toward helpful and innocuous responses. RLHF solves this problem by learning a reward function from human feedback, especially in the case of comparisons, and then applying RL to optimize the learned reward function.
The RLHF pipeline is divided into several phases, which are detailed below:
Supervised fine-tuning stage: A pre-trained model undergoes maximum likelihood loss on a high-quality dataset, where it learns to answer human questions through simulation.
Reward modeling phase: The SFT model is called with the prompts (x) to generate pairs of responses (y_1,y_2sim pi^{text{SFT}}(yvert x)). These generated responses form a dataset. Response pairs are presented to human labelers who prefer one response over the other, denoted as (y_w succ y_l). A comparative loss is then used to train the reward model (r_phi):
[mathcal{L}_R = mathbb{E}_{(x,y_l,y_w)simmathcal{D}}log sigmaleft(r_phi(y_w|x)-r_phi(y_l|x)right)]
RL fine-tuning stage: The SFT model serves as the starting point for this step, and an RL algorithm optimizes the policy to maximize the reward while limiting deviations from the initial policy. Formally, this is done by:
[max_{pi_theta}mathbb{E}_{xsim mathcal{D}, ysim pi_theta(cdotvert x)}left[r_phi(yvert x)-beta D_{text{KL}}(pi_theta(cdotvert x)Vert pi^{text{SFT}}(cdotvert x))right]]
An inherent challenge with this approach is the non-uniqueness of the reward. For example, the reward function (r(yvert x)), a simple transformation of the prompt's reward makes (r(yvert x)+delta(x)) another valid reward. function These two reward functions result in the same loss for any response pair, but they differ significantly when optimized against RL.In an extreme case, if the additive noise of the reward function causes a large range, then an RL algorithm can be misled to increase the probability of responses with high rewards, even though those rewards are not meaningful. In other words, in the policy prompt (x). may be affected by reward scale information, yet fails to know the useful part-relative preference represented by the reward gap. To solve this problem, we aim to develop an RL algorithm Is. A rewarding translation.
Derivation of P3O
Our idea stems from the vanilla policy gradient (VPG). VPG is a widely adopted first-order RL optimizer, favored for its simplicity and ease of implementation. In context bandit (CB) configuration, the VPG is generated as:
[nabla mathcal{L}^{text{VPG}} = mathbb{E}_{ysimpi_{theta}} r(y|x)nablalogpi_{theta}(y|x)]
Through some algebraic manipulation, we can rewrite the policy gradient in a comparative form that involves two responses to the same prompt. We name it. Pairwise policy gradients:
[mathbb{E}_{y_1,y_2simpi_{theta}}left(r(y_1vert x)-r(y_2vert x)right)nablaleft(logfrac{pi_theta(y_1vert x)}{pi_theta(y_2vert x)}right)/2]
Unlike VPG, which relies directly on absolute reward magnitude, PPG uses reward variance. This enables us to bypass the aforementioned problem of reward translation. To further increase performance, we add a replay buffer using Sampling importance And avoid large gradient updates. engrave.
Importance sampling: we sample a batch of responses from the replay buffer consisting of the responses generated by (pi_{text{old}}) and then the importance sampling ratio for each response pair. Calculate the The gradient is the weighted sum of the gradients calculated from each response pair.
engrave: We clip the importance sampling ratio as well as the gradient update to penalize excessively large updates. This technique enables the algorithm to trade and reward KL divergence more efficiently.
There are two different ways to apply the clipping technique, distinguished by separate or combined clipping. The resulting algorithm is called Pairwise Proximal Policy Optimization (P3O), whose variants are V1 or V2, respectively. You can find more details in our original. Paper.
appraisal
Figure 3:
KL-Reward Frontier For TL;DR, both sequential KL and reward are averaged over 200 test prompts and computed every 500 gradient steps. We find that a simple linear function fits the curve well. P3O has the best KL-Reward trade-off of the three.
We explore two different open-ended text generation tasks, Abstract And Question Answer. In summary, we use TL; DR dataset where the prompt (x) is a forum post from Reddit, and (y) is a corresponding abstract. To answer the question, we use Anthropic Helpful and Harmless (H.H) prompt (x) is a human question from various topics, and the policy should learn to give an attractive and helpful answer (y).
We compare our algorithms. P3O With several effective and representative approaches to LLM alignment. Let's start with us SFT A policy trained by maximum likelihood. For the RL algorithm, we consider the dominant approach. PPO and newly proposed DPO. DPO optimizes the direct policy for closed-form solutions of the KL-constrained RL problem. Although this is suggested as an offline alignment method, we make it online with the help of proxy reward function.
Figure 4:
KL-Reward Frontier For HH, each point represents an average of outcomes over 280 test prompts and is calculated every 500 gradient updates. The left two figures compare P3O-V1 and PPO with different base model sizes. The two figures on the right compare P3O-V2 and DPO. The results show that P3O can not only achieve higher reward but also better KL control.
Too much deviation from the reference policy will cause the online policy to cut corners of the reward model and generate inconsistent impulses, as indicated in previous works. We are not only interested in a well-established metric in the RL literature – reward, but also in how far the learned policy deviates from the initial policy, as measured by KL-divergence. Therefore, we investigate the effectiveness of each algorithm by its obtained reward and deviation from the KL-reference policy (KL Reward Frontier). In Figure 4 and Figure 5, we find that P3O has strongly dominant boundaries at different model sizes compared to PPO and DPO.
Figure 5:
The left figure shows the win rate evaluated by GPT-4. The exact figure presents the win rate based on a direct comparison of the proxy prize. Despite the high correlation between the two figures, we found that the prize win rate must be adjusted by KL to be consistent with the GPT-4 win rate.
To directly evaluate the quality of the responses generated, we also perform Head-to-head comparison between each pair of algorithms in the HH dataset. We use two metrics for evaluation: (1) rewardintended target during online RL, (2) GPT-4As a faithful proxy for human assessment of response support. For the latter metric, we point out that previous studies have shown that GPT-4 judgments correlate strongly with humans, with human agreement with the GPT-4 generally consistent with inter-human interpretation. is equal to or greater than the contract.
Figure 5 presents the summary results of the pairwise comparisons. The average KL-divergence and reward ranking of these models is DPO > P3O > PPO > SFT. Although DPO marginally outperforms P3O in reward, it has significantly higher KL-divergence, which may be detrimental to breed quality. As a result, the prize winning rate of DPO is 49.5% compared to P3O, but only 45.4% as evaluated by GPT-4. Compared to other methods, P3O shows a GPT-4 win rate of 57.0% against PPO and 69.3% against SFT. This result is consistent with our results from the KL-Reward frontier metric, confirming that P3O may be better aligned with human preferences than previous baselines.
Result
In this blog post, we present new insights into aligning large language models with human preferences through reinforcement learning. We proposed a reinforcement learning with relative feedback framework, as shown in Figure 1. Under this framework, we develop a new policy gradient algorithm – P3O. This approach combines the fundamental principles of reward modeling and RL fine-tuning through comparative training. Our results show that P3O outperforms the previous methods in terms of the KL-Reward frontier as well as the GPT-4 win rate.
BibTex
This blog is based on our recent one. Paper And Blog. If this blog influences your work, please consider citing it with:
@article{wu2023pairwise,
title={Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment},
author={Wu, Tianhao and Zhu, Banghua and Zhang, Ruoyu and Wen, Zhaojin and Ramchandran, Kannan and Jiao, Jiantao},
journal={arXiv preprint arXiv:2310.00212},
year={2023}
}