Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, and Sepp Hochreiter
We introduce RUDDER (RetUrn Decomposition for DElayed Rewards), a new method for model-free reinforcement learning (RL) with delayed rewards. Model-free RL is used in many real-world applications, where appropriate models are not available or difficult to learn. RL relies on credit assignment for received rewards to past actions, which typically have a long horizon. This task becomes particularly challenging when the rewards are episodic or sparse. RUDDER overcomes delayed rewards problem by reward redistribution that is obtained via return decomposition.
RUDDER identifies the key events (state-action pairs) associated with a change in return expectation and assigns credit by redistributing rewards to them. This replaces the expected future reward by the immediate reward and simplifies the estimation of the long-term reward. RUDDER constructs optimal reward redistribution, where the expected future rewards are equal to zero, which significantly speeds up learning.
The reward redistribution is obtained via return decomposition using contribution analysis. RUDDER uses Long Short-Term Memory (LSTM) network to predict the expected returns for state-actions sequences. LSTM detects key events that correlate with the reward the most and stores this information in memory cells. From the analysis of the LSTM’s memory, the contribution of each state-action pair to the return can be determined.
RUDDER is evaluated on several artificial tasks with delayed rewards and is shown to outperform other methods.
Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 13566; e-print also at arXiv:1806.07857v3, 2019-09-10.