1/16/2024: DPO > RLHF
Today is a light news day so let’s talk about some exciting developments in AI research. The paper titled Direct Preference Optimization: Your Language Model is Secretly a Reward Model was dropped on arxiv. Andrew Ng wrote a long post about it as shown above. In essence, the paper proposed an elegant loss function to optimize LLM based on paired preference data. When RLHF was proposed, it was quite a clunky process because researchers have to use the preference data to figure out a reward model and use PPO to find the optimal sequence based on the reward model. With DPO (direct preference optimization), a straightforward loss function can be directly embedded to optimize LLM parameters based on the preference data. The experiment results look promising as shown below.
Andrew Ng said this paper could have a huge impact on LLMs. It certainly will make LLM training simpler. AFAIK, most of the LLM power still comes from the sheer amount of data and computation. The preference data is used for value alignment and at times it could be controversial as different people have different biases and different political views. This is a cool and beautiful paper. Let’s see what kind of impact it will have down the road. I believe it will be great for various special purpose open-source models. But I am afraid the monolithic models like GPT-4, Claude or Gemini will have too much power dictating what is right or wrong if they don’t have a transparent way for collecting the preference data.