Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution
Vihang P. Patil, Markus Hofmarcher, Marius-Constantin Dinu, Matthias Dorfer, Patrick M. Blies, Johannes Brandstetter, Jose A. Arjona-Medina, and Sepp Hochreiter
We earlier developed RUDDER, a new method for model-free reinforcement learning (RL) with delayed rewards. RUDDER solves complex RL tasks with sparse and delayed rewards by reward redistribution that is obtained via return decomposition. RUDDER replaces the expected future rewards with the immediate rewards by redistributing them to the key events associated with changes in return expectations.
Here, we introduce a modified method, Align-RUDDER, which facilitates discovering the key events with high rewards. To this end, Align-RUDDER uses profile models known from bioinformatics, instead of LSTM networks. LSTM networks require a large number of examples for learning and often struggle to explore the relevant events. Align-RUDDER learns events with high rewards from demonstrations instead of exploration, and only requires few demonstrations. From multiple sequence alignment of the demonstrations, a profile model is obtained. State-action sequences are aligned to the profile model and receive scores based on their similarity. From the differences in the alignment scores of the consecutive sequences, the state-action pairs with high rewards are identified and rewards are redistributed to them. This allows adjusting the policy so that the relevant events are achieved more often, thus increasing the return.
Implementation of Align-RUDDER is available on GitHub.