Finite-Time Performance of Distributed Temporal Difference Learning with Linear Function Approximation
While many distributed reinforcement learning (RL) has emerged as one of the important paradigms in distributed control, we are only beginning to understand the fundamental behavior of these algorithms. Two recent papers from the DCIST alliance provide important progress in this direction.
In the multi-agent policy evaluation problem, a group of agents operate in a common environment under a fixed control policy, and work together to discover the value (global discounted accumulative reward) associated with each environmental state. Over a series of time steps, the agents act, get rewarded, update their local estimate of the value function, then communicate with their neighbors. To solve this problem, a distributed variant of the popular temporal difference learning (TD) method is proposed. The main contribution is to provide a finite-analysis on the performance of this distributed TD algorithm for both constant and time-varying step sizes. In addition, the results also provide a mathematical explanation for observations that have appeared previously in the literature about the choice of the algorithm parameter to yield the best performance of (distributed) TD learning.
This work is currently being applied in the study of more complex control problems in robotic networks using reinforcement learning. A similar distributed Q-learning algorithm is being used to design an optimal sequence of coordinated behaviors for multi-robot systems operating in an unknown environment. Simulations actualized in Georgia Tech’s Robotarium have demonstrated the effectiveness of these methods in executing complex tasks with a network of autonomous robots.
Sources:
Conference: Thinh T. Doan, Siva Theja Maguluri, Justin Romberg “Finite-Time Performance of Distributed Temporal Difference Learning with Linear Function Approximation.” Conference: Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.
Journal: Submitted to SIAM Journal on Mathematics of Data Science.
Points of Contact: Justin Romberg (PI) and Thinh T. Doan.