On August 17, 2023, a team of DeepMind researchers from Google presented a new method for improving the quality of large language models (LLMs) by aligning them with human preferences. To prove the efficiency of this method, they chose the field of machine translation (MT).
The proposed method, known as Reinforcement Self-Training (ReST), draws inspiration from Growing Reinforcement Learning (RL). It initially works by having the LLM generate new synthetic training data first. Then, the LLM adjusts itself using this generated data guided by a reward model that evaluates its performance and directs the learning process by providing feedback.
Why does DeepMind choose machine translation as a test for ReST? First, the researchers see machine translation as a very “impact application” for the LLM. MT also has “strong baselines and well-defined evaluation procedures”, making it an ideal criterion for evaluating the effectiveness of ReST.
Furthermore, the availability of “several reliable scoring and evaluation methods”, including Metric X, BLEURT, and COMET, which can serve as reward models, makes it possible to objectively assess the effectiveness of ReST, which enhances the credibility of the research.
To ensure the versatility of their approach, the researchers tested ReST on various reference datasets — including IWSLT 2014, WMT 2020, and the internal web domain dataset — and across different language pairs. “We chose a different language pair for each dataset to test the generality of the results,” they said.
In addition to the automated measures, the researchers conducted human assessments to ensure that ReST aligns with human preferences. These included human raters who rated the translations on a scale of 0 to 6, adding a qualitative dimension to the evaluation.
ReST improves translation quality
The results show that ReST can significantly improve translation quality, as evidenced by machine measurements and human evaluation of machine translation criteria.
According to the researchers, what sets ReST apart is its efficiency. It is superior to online reinforcement learning methods in terms of sample and computing efficiency because it generates training data offline, allowing for data reuse.
As Var El, a technology optimist and AI accelerator, explained in A tweetReST “marked another step towards fully autonomous machines and the beginning of the end for manual fine-tuning”.
Well that’s great, enhanced self-training, a new way to tune your RL. One more step towards fully autonomous machining and the beginning of the end of manual fine-tuning (maximum 1 year)
https://t.co/uuaUnSivBM pic.twitter.com/MKjkYDZ9hD– Far Al (@far__el) August 21, 2023
Beyond machine translation, ReST shows promise in various generative learning settings, including recapitulation, turn-based dialogue, and generative audio and video paradigms, the authors emphasized.
The authors conclude that this adaptability positions ReST as a versatile methodology for enhancing reinforcement learning from human feedback (RLHF) across a broad range of language-related tasks.
Authors: Caglar Gulshir, Tom Le Payne, Srivatsan Srinivasan, Ksenia Konyushkova, Lottie Wirts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miauzen Wang, Chenji Guo, Wolfgang Macheri, Arnaud Doucet, Orhan Virat, Nandou de Freitas