Finite-Time Performance of Distributed Temporal Difference Learning with Linear Function Approximation

While many distributed reinforcement learning (RL) has emerged as one of the important paradigms in distributed control, we are only beginning to understand the fundamental behavior of these algorithms.  Two recent papers from the DCIST alliance provide important progress in this direction.

In the multi-agent policy evaluation problem, a group of agents operate in a common environment under a fixed control policy, and work together to discover the value (global discounted accumulative reward) associated with each environmental state.  Over a series of time steps, the agents act, get rewarded, update their local estimate of the value function, then communicate with their neighbors.  To solve this problem, a distributed variant of the popular temporal difference learning (TD) method is proposed. The main contribution is to provide a finite-analysis on the performance of this distributed TD algorithm for both constant and time-varying step sizes. In addition, the results also provide a mathematical explanation for observations that have appeared previously in the literature about the choice of the algorithm parameter to yield the best performance of (distributed) TD learning.

This work is currently being applied in the study of more complex control problems in robotic networks using reinforcement learning. A similar distributed Q-learning algorithm is being used to design an optimal sequence of coordinated behaviors for multi-robot systems operating in an unknown environment.  Simulations actualized in Georgia Tech’s Robotarium have demonstrated the effectiveness of these methods in executing complex tasks with a network of autonomous robots.

Conference: Thinh T. Doan, Siva Theja Maguluri, Justin Romberg “Finite-Time Performance of Distributed Temporal Difference Learning with Linear Function Approximation.” Conference: Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.
Journal: Submitted to SIAM Journal on Mathematics of Data Science.

Points of Contact: Justin Romberg (PI) and Thinh T. Doan.

Learning to Learn with Probabilistic Task Embeddings

To operate successfully in a complex and changing environment, learning agents must be able to acquire new skills quickly. Humans display remarkable skill in this area — we can learn to recognize a new object from one example, adapt to driving a different car in a matter of minutes, and add a new slang word to our vocabulary after hearing it once. Meta-learning is a promising approach for enabling such capabilities in machines. In this paradigm, the agent adapts to a new task from limited data by leveraging a wealth of experience collected in performing related tasks. For agents that must take actions and collect their own experience, meta-reinforcement learning (meta-RL) holds the promise of enabling fast adaptation to new scenarios. Unfortunately, while the trained policy can adapt quickly to new tasks, the meta-training process requires large amounts of data from a range of training tasks, exacerbating the sample inefficiency that plagues RL algorithms. As a result, existing meta-RL algorithms are largely feasible only in simulated environments.

The lustre of off-policy meta-RL

While policy gradient RL algorithms can achieve high performance on complex high-dimensional control tasks (e.g., controlling a simulated humanoid robot to
run), they are woefully sample inefficient. For example, the state-of-the-art policy gradient method PPO requires 100 million samples to learn a good policy for humanoid. If we were to run this algorithm on a real robot, running continuously with a 20 Hz controller and without counting time for resets, it would take nearly two months to learn this policy. This sample inefficiency is largely because the data to form the policy gradient update must be sampled from the current policy, precluding the re-use of previously collected data during training. Recent off-policy algorithms (TD3, SAC) have matched the performance of policy gradient algorithms while requiring up to 100X fewer samples. If we could leverage these algorithms for meta-RL, weeks of data collection could be reduced to half a day, putting meta-learning within reach of our robotic arms. Off-policy learning offers further benefits beyond better sample efficiency when training from scratch. We could also make use of previously collected static datasets, and leverage data from other robots in other locations.

Source: K. Rakelly, “BAIR: Berkley Artificial Intelligence Research,” June 10, 2019.

Task: RA1.A1: The Swarm’s Knowledge Base: Contextual Perceptual Representations

Localization and Mapping using Instance-specific Mesh Models

A recent paper by members of the DCIST alliance proposes an approach for building semantic maps, containing object poses and shapes, in real time, onboard an autonomous robot equippend with a monocular camera. Rich understanding of the geometry and context of a robot’s surroundigs is important for specification and safe, efficient execution of complex missions. This work develops a deformable mesh model of object shape that can be optimized online based on semantic information (object parts and segmentation) extracted from camera images. Multi-view constraints on the object shape are obtained by detecting objects and extracting category-specific keypoints and segmentation masks. The paper shows that the errors between projections of the mesh model and the observed keypoints and masks can be differentiated in order to obtain accurate instance-specific object shapes. The potential of this approach to build large-scale object-level maps will be investigated in DCIST autonomous navigation and situational awareness tasks.


Additional Details:

Source: Q. Feng, Y. Meng, M. Shan, and N. Atanasov, “Localization and Mapping using Instance-specific Mesh Models,“ IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), November 2019.

Task: RA1.A1: The Swarm’s Knowledge Base: Contextual Perceptual Representations

Points of Contact: Nikolay Atanasov (PI) and Qiaojun Feng.