I am a PhD student at the University of Wisconsin-Madison advised by Josiah Hanna. I am broadly interested in improving the data efficiency of reinforcement learning (RL). Recently, my research has focused on
- Using adaptive, off-policy sampling to improve the data efficiency of on-policy learning.
- Understanding how to most effectively leverage data augmentation for off-policy learning
Feel free to drop an email if you're interested in chatting!
News
- [Nov 2024] I received the Top Review Award at NeurIPS 2024!
- [July 2024] I joined Amazon's Rufus Team as a research intern!
- [May 2024] 1 paper accepted at RLC 2024!
- [January 2024] 1 paper accepted at ICLR 2024!
- [October 2023] I gave a talk on adaptive off-policy sampling for on-policy learning at the University of Edinburgh RL reading group!
- [July 2023] 1 paper accepted at IEEE SmartGridComm 2023!
- [May 2022] 1 paper accepted at CoLLAs 2022!
Publications and Preprints
Nicholas E. Corrado, Josiah P. Hanna
Under Review, 2024
[arXiv] [bibtex]
Abstract
On-policy reinforcement learningx (RL) algorithms perform policy updates using i.i.d. trajectories collected by the current policy. However, after observing only a finite number of trajectories, on-policy sampling may produce data that fails to match the expected on-policy data distribution. This sampling error leads to noisy updates and data inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error than on-policy sampling can produce. Motivated by this observation, we introduce an adaptive, off-policy sampling method to improve the data efficiency of on-policy policy gradient algorithms. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy that increases the probability of sampling actions that are under-sampled with respect to the current policy. Rather than discarding data from old policies -- as is commonly done in on-policy algorithms -- PROPS uses data collection to adjust the distribution of previously collected data to be approximately on-policy. We empirically evaluate PROPS on both continuous-action MuJoCo benchmark tasks as well as discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) improves the data efficiency of on-policy policy gradient algorithms. Our work improves the RL community's understanding of a nuance in the on-policy vs off-policy dichotomy: on-policy learning requires on-policy data, not on-policy sampling.
Nicholas E. Corrado, Yuxiao Qu, John U. Balis, Adam Labiosa, & Josiah P. Hanna
Reinforcement Learning Conference (RLC), 2024
[project page] [arXiv] [bibtex]
Abstract
In offline reinforcement learning (RL), an RL agent learns to solve a task using only a fixed dataset of previously collected data. While offline RL has been successful in learning real-world robot control policies, it typically requires large amounts of expert-quality data to learn effective policies that generalize to out-of-distribution states. Unfortunately, such data is often difficult and expensive to acquire in real-world tasks. Several recent works have leveraged data augmentation (DA) to inexpensively generate additional data, but most DA works apply augmentations in a random fashion and ultimately produce highly suboptimal augmented experience. In this work, we propose Guided Data Augmentation (GuDA), a human-guided DA framework that generates expert-quality augmented data. The key insight behind GuDA is that while it may be difficult to demonstrate the sequence of actions required to produce expert data, a user can often easily characterize when an augmented trajectory segment represents progress toward task completion. Thus, a user can restrict the space of possible augmentations to automatically reject suboptimal augmented data. To extract a policy from GuDA, we use off-the-shelf offline reinforcement learning and behavior cloning algorithms. We evaluate GuDA on a physical robot soccer task as well as simulated D4RL navigation tasks, a simulated autonomous driving task, and a simulated soccer task. Empirically, GuDA enables learning given a small initial dataset of potentially suboptimal experience and outperforms a random DA strategy as well as a model-based DA strategy.
Nicholas E. Corrado, Josiah P. Hanna
International Conference on Learning Representations (ICLR), 2024
[arXiv] [video] [poster] [slides] [bibtex]
Abstract
Recently, data augmentation (DA) has emerged as a method for leveraging domain knowledge to inexpensively generate additional data in reinforcement learning (RL) tasks, often yielding substantial improvements in data efficiency. While prior work has demonstrated the utility of incorporating augmented data directly into model-free RL updates, it is not well-understood when a particular DA strategy will improve data efficiency. In this paper, we seek to identify general aspects of DA responsible for observed learning improvements. Our study focuses on sparse-reward tasks with dynamics-invariant data augmentation functions, serving as an initial step towards a more general understanding of DA and its integration into RL training. Experimentally, we isolate three relevant aspects of DA: state-action coverage, reward density, and the number of augmented transitions generated per update (the augmented replay ratio). From our experiments, we draw two conclusions: (1) increasing state-action coverage often has a much greater impact on data efficiency than increasing reward density, and (2) decreasing the augmented replay ratio substantially improves data efficiency. In fact, certain tasks in our empirical study are solvable only when the replay ratio is sufficiently low.
Nicholas E. Corrado, Michael Livesay, Tyson Bailey, & Drew Levin.
IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids, (IEEE SmartGridComm), 2023
[paper] [bibtex]
Abstract
Interoperable, internet-connected distributed energy resource (DER) devices in power systems create a new attack vector for malicious cyber-actors. Modern control systems primarily focus on transmission and sub-transmission operations and rarely incorporate DER or other distribution-connected equipment. While recent advances have expanded grid operator visibility and control functionality to the distribution level, these control systems struggle to scale with thousands of networked devices. Thus, to defend against potential attacks on DER devices, we must develop new real-time control algorithms to ensure grid stability. In this work, we present a new approach to power distribution control based on deep reinforcement learning (RL) algorithms which can learn optimal control policies through experience. We evaluate four RL algorithms in novel voltage stabilization tasks based on the IEEE 13-bus and EPRI Ckt5 power distribution models. We demonstrate that RL can successfully perform voltage regulation and outperform a greedy algorithm. Our results coupled with the established capability of RL to scale in high-dimensional settings opens a path towards real-time cyber defense of power systems with RL agents.
Nicholas E. Corrado, Yuxiao Qu, & Josiah P. Hanna
Conference on Lifelong Learning Agents (CoLLAs), 2022
[paper] [bibtex]
Abstract
Deep reinforcement learning has shown incredible promise at training high-performing agents to solve high-dimensional continuous control tasks in a particular training environment. However, to be useful in real-world settings, long-lived agents must perform well across a range of environmental conditions. Naively applying deep RL to a task where environment conditions may vary from episode to episode can be data inefficient. To address this inefficiency, we introduce a method that discovers structure in an agent’s high-dimensional continuous action space to speed up learning across a range of environmental conditions. Whereas prior work on finding so-called latent action spaces requires expert demonstrations or on-task experience, we instead propose to discover the latent, lower-dimensional action space in a simulated source environment and then transfer the learned action space for training in the target environment. We evaluate our novel method on randomized variants of simulated MuJoCo environments and find that, when there is a lower-dimensional action-space to exploit, our method significantly increases data efficiency. For instance, in the Ant environment, our method reduces the 8-dimensional action-space to a 3-dimensional action-space and doubles the average return achieved after a training budget of 2 million timesteps.