Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, and Sepp Hochreiter

RUDDER
RUDDER: identifying key events (left) and reward redistribution (right).

We introduce RUDDER (RetUrn Decomposition for DElayed Rewards), a new method for model-free reinforcement learning (RL) with delayed rewards. Model-free RL is used in many real-world applications, where appropriate models are not available or difficult to learn. RL relies on credit assignment for received rewards to past actions, which typically have a long horizon. This task becomes particularly challenging when the rewards are episodic or sparse. RUDDER overcomes delayed rewards problem by reward redistribution that is obtained via return decomposition.

RUDDER identifies the key events (state-action pairs) associated with a change in return expectation and assigns credit by redistributing rewards to them. This replaces the expected future reward by the immediate reward and simplifies the estimation of the long-term reward. RUDDER constructs optimal reward redistribution, where the expected future rewards are equal to zero, which significantly speeds up learning.

The reward redistribution is obtained via return decomposition using contribution analysis. RUDDER uses Long Short-Term Memory (LSTM) network to predict the expected returns for state-actions sequences. LSTM detects key events that correlate with the reward the most and stores this information in memory cells. From the analysis of the LSTM’s memory, the contribution of each state-action pair to the return can be determined.

RUDDER is evaluated on several artificial tasks with delayed rewards and is shown to outperform other methods.

Comparison of RUDDER to other methods
Learning time in episodes vs delay of the reward on artificial tasks: comparison of RUDDER to other methods.

Source code and demos are available on GitHub.

An extended discussion is available in a JKU news post on RUDDER.

Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 13566;  e-print also at arXiv:1806.07857v3, 2019-09-10.

Download
View paper
IARAI Authors
Dr Sepp Hochreiter
Research
Reinforcement Learning
Keywords
Delayed Reward, Long Short-Term Memory, Reinforcement Learning, Reward Redistribution

©2023 IARAI - INSTITUTE OF ADVANCED RESEARCH IN ARTIFICIAL INTELLIGENCE

Imprint | Privacy Policy

Stay in the know with developments at IARAI

We can let you know if there’s any

updates from the Institute.
You can later also tailor your news feed to specific research areas or keywords (Privacy)
Loading

Log in with your credentials

or    

Forgot your details?

Create Account