Supercharge your LLM with Retrieval Augmented Fine Tuning.

Introduction

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Large language models (LLMs) have become increasingly valuable for answering questions in specialized domains such as medical or legal documents. To increase their performance, it is common to inject domain-specific knowledge into LLMs through techniques such as Retrieval-Augmented Generation (RAG) or fine-tuning. In this blog post, we explore a fine-tuning technique called Retrieval Augmented Fine-Tuning (RAFT) and evaluate its effectiveness in adapting pre-trained LLMs for RAG in specialized domains.

RAG today

RAG is a way to enhance LLMs when dealing with knowledge that is not already “backed in” during the training phase. This often includes specific domains or more up-to-date information. A common way to build a RAG system is to retrieve the shredded documents from the vector store and inject them directly into the LLM prompt. For example, a typical indication for an LLM would look like this:

“Context information is given below:n{Context}nAnswer the question using the context information and not the previous information.nQuestion: {Question}nAnswer : “

Check us out RAG in 4 lines of code Guide

Although these systems are easy to build, there may still be room to squeeze out additional performance. The debate revolves around whether RAG or fine-tuning is better for a given use case. A recent paper called RAFT studies this problem and proposes a new method to use fine-tuning with pre-trained LLM retrieval query answer (QA) data.

What is RAFT?

Introduced by Retrieval Augmented Fine-Tuning (RAFT). Zhang et alis a method designed to enhance the performance of LLMs in specific domains. RAFT enhances the quality of responses by leveraging chain-of-thought (CoT) responses generated from the provided data. Basically, RAFT improves the reasoning and response generation capabilities of the model by using large pre-trained models. This process involves generating responses with a larger model and then fine-tuning those responses to a smaller, more specialized model. This approach helps generate high-quality CoT responses, significantly increasing model performance. By doing so, RAFT bridges the gap between general-purpose LLMs and the specialized knowledge required for specific domains.

AD 4nXesxVRkQrJte9ECgbmEgEDL5kvd0WfuuHXc3dWooG3y8mSGJblyJK6jpty9Vq0mIwq C1gcxrDrIgbJXM7eyzkEDt5N35tUQettacKJSWvHzEQAxdglg9xj 3jYoN0KD8jb TQxaxYp dchYzhd NKG8?key=cBjuRRbKNtqZuW1rjDTmLA

Figure 1: Sets of distractor documents with an example LLM prompt description for producing CoT responses as well as relevant context.

Why use RAFT?

One of the main advantages of RAFT is its ability to fix chat or models without having to reconfigure for chat functionalities. This efficiency saves time and resources that would otherwise be spent reconfiguring the model for communication purposes. By focusing on domain-specific fine-tuning, RAFT ensures that LLM can produce more accurate and context-relevant responses.

The original RAFT paper presents experiments using the Llama2-7B model, demonstrating its effectiveness in various specialized domains. In particular, while using RAG often improves QA performance over using LLM alone, fine-tuning and RAFT consistently outperform RAG by a large margin.

This begs the question: How does RAFT perform with newer models like Llama3-8B? By comparing these models, we can gain insight into the scalability and improvements offered by the latest developments in LLMs.

How does RAFT work on new LLMs?

is in the published code for RAFT. This Github repository. We used all the default settings with some small changes:

  • While the paper uses GPT-4 to generate questions and answers, we chose Llama3-70B-instruction Model as we host it ourselves.
  • We generated 1 query per segment and included 3 distractor documents per data point.
  • Instead of supervised fine-tuning, we used LORA.

For the data, we used Hot Pot QA A data set, especially a fragmented context of a data set, to generate data points (ie queries, CoT responses). Direct questions and answers from the HotpotQA dataset are not included in the generated data, so the model will not miss them. We made samples with only 100 pieces for the sake of time. The resulting data set is available at A huggable face.

Since we are focused on limited compute environments, we are interested in models in the 7-8B range or smaller. Thus, we have chosen Llama3 8B And Llama3.1 8B Direct models and their 4-bit quantized variants for our experiments.

We also compare the results using Llama2-7B-Chat as a baseline. For training, we used TRL SFT Trainer. We used lm-evaluation-harness Fine-tuned models were evaluated by EleutherAI and on HotpotQA's validation set (1k samples) on an NVIDIA A100-SXM4-40GB.

Results

Figure 2 below shows the F1 scores of the optimal and pre-trained models. Indeed, we observe significant performance gains from fine-tuning on RAFT-style data for most tested models. In particular, the increase in efficiency was greater than 60% for the Llama3 variant and greater than 100% for Llama2 7B. On the other hand, fine-tuning the Llama3.1 8B increases the comparison by 16%.

Using 4-bit quantized variants of the Llama3 models, we were able to maintain 91-94% efficiency while using only 25% of GPU memory dedicated to model weights.

For LoRA configurations, we have found that using “all-liners” as target modules is more efficient than using a subset of target modules. Also, using a higher LoRA rank (64) we are able to obtain a higher score than using a lower LoRA rank (16). Here we report the best scores from tuning the hyperparameter.

AD 4nXcDOgHgThN8uyISSLlbpLkkymumFRJ2L113pVlNkmbyh0HE9UdtFSupODUg n3YlK9qNjAFefBnSgRQMXDQ3Rc3Ll G z4WACq5 n0sjuC15h2PPWP8rgvTJYWBv 0fQuwXrnK0I8BtpPCn9lw1yEP0 g?key=cBjuRRbKNtqZuW1rjDTmLA

Figure 2: F1 scores of the fine-tuned (blue) and pre-trained (orange) models on 1000 samples of the hotpot QA dataset.

Communication and boundaries

Initial runs show that CoT responses seem to cut off when max_new_tokens=512. By setting max_new_tokens=800, we observe that the models were able to generate complete CoT responses. This leads to about 2x the performance of the lower configuration, but on the other hand consumes more time and GPU memory.

Time and cost are also important factors to consider. Generating a dataset (100 rows) takes ~30 minutes. The data set costs $0.24 (2 calls/row) at the current estimated price ($0.0012/request). Once we have the data set, it takes an average of 10 minutes to fix the model. At the current intensive training price ($4/hr), training costs $0.67. Sophisticated models cost less than $1 end-to-end! But of course, some datasets may require different training requirements. Tuning hyperparameters can also increase costs.

We used Llama3-70B-instruct as a question and answer generator. But there are higher-end models. LMSYS Chatbot Field Which can lead to better quality questions and answers.

What's next?

RAFT appears to be an effective method for adapting small LLMs to domain-specific data. From context fragments, questions and CoT responses can be easily generated by RAFT to build data sets for fine-tuning instruct models. This not only removes the need to fine-tune the base model alignment, but also substantially reduces the amount of data required for fine-tuning in general. If you want RAFT to be available on the Clarify platform, send us a message. Community Discord Channel!

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment