Markus Holzleitner, Lukas Gruber, José Arjona-Medina, Johannes Brandstetter, and Sepp Hochreiter
In this work, we prove the convergence of general actor-critic reinforcement learning (RL) algorithms. Actor-critic RL algorithms simultaneously learn a policy function (the actor) and a value function (the critic) that estimates values, advantages, or redistributed rewards. The critic is responsible for credit assignment, i.e. identifying the state-action pairs that lead to receiving a reward. Using this credit assignment, the actor is updated to increase the return.
To prove the convergence, we employ the recently developed techniques from the two time scale stochastic approximation theory. Stochastic approximation algorithms are iterative procedures to find stationary points of functions when only noisy observations are provided. Two time scale stochastic approximation algorithms use two coupled iterations moving at different speeds. Recent advancements introduced controlled Markov process, which can treat policies that become greedy during learning and describes how to treat previous policies. Earlier convergence proofs assume linear neural networks, cannot treat episodic samples, and do not consider that policies become greedy. In our work, both policy and value functions can be deep neural networks of arbitrary complexity, albeit without sharing weights. Our results are also valid for actor-critic methods that use episodic samples.
We apply our convergence proof to proximal policy optimization (PPO) and RUDDER. PPO is a policy optimization method that uses multiple epochs of stochastic gradient ascent to perform each policy update. RUDDER solves the problem of sparse and delayed rewards by reward redistribution, which assigns rewards to relevant state-action pairs; the critic is a reward redistribution network, typically an LSTM. We discuss the technical assumptions and the details of the proofs.