Large language models distill broad knowledge from text corpora. However, they can be inconsistent when it comes to completing user specified tasks. To tackle this challenge, we propose a novel offline RL motivated method, implicit language Q-learning (ILQL), which we demonstrate to be a more effective and easy-to-use task learning method for language models than current approaches based on supervised finetuning or reinforcement learning.
Reinforcement learning (RL) is a promising approach for language task learning. However, contemporary methods involve high systems complexity and expensive human interaction. Instead, an ideal RL method should be: (1) easy to use, (2) able to optimize user specified rewards, (3) practical in interactive settings, (4) able to leverage existing data, and (5) temporally compositional.
Most contemporary approaches to task learning in NLP meet some of the above criteria, but not all. The paradigm of offline RL, on the other hand, has the potential to satisfy all of these criteria by combining supervised learning's ability to leverage existing data (criteria 4) with the powerful reward learning abstraction of RL (criteria 2, 3, 5).
We present Implicit Language Q Learning (ILQL), an effective, easy-to-use offline RL method for language task learning. ILQL combines the strong utility optimization power of dynamic programming based offline RL methods (criteria 5) with supervised learning's ease of use (criteria 1).
The lack of easily configurable task settings and reliable simulated evaluations has arguably slowed progress in applying sophisticated RL algorithms to language and dialogue tasks. To address this, we present Wordle as an easy-to-use but challenging objective benchmark task to test the capabilities of offline RL algorithms.
We evaluate ILQL on a range of tasks, datasets, and reward functions:
We find that ILQL is generally both better performing and more stable to train than baseline methods in each setting.
We present a set of example dialogues produced by our Visual Dialogue agents (in green) optimized for different rewards. Each of our reward functions, from left to right, penalizes the asking of yes/no questions to increasing degrees. As the yes/no penalty increases, our agents learn to ask fewer such questions.