Nicholas E. Corrado

PhD student, University of Wisconsin-Madison

I am a PhD student at the University of Wisconsin-Madison advised by Josiah Hanna. My research is motivated by the following question: How can we develop fully autonomous agents that can learn to solve real-world tasks from experience? Towards answering this question, I study reinforcement learning (RL), a branch of machine learning that enables autonomous agents to learn through trial and error interaction. Since current RL algorithms often require an impractical amount of data to achieve success in real-world scenarios, my work focuses on developing RL algorithms that can learn effectively from limited interaction.

Recently, my research has focused on developing new data collection methods for RL:

  1. Using adaptive, off-policy sampling to improve the data efficiency of on-policy learning.
  2. Understanding how to most effectively leverage data augmentation for off-policy learning

Previously, I was a research intern with Amazon's Rufus team (working on multi-objective alignment for LLMs) and Sandia National Laboratories (working on RL for power grid management). During the first year of my PhD, I worked in databases with Jignesh Patel. I received a BPhil in physics and a B.S. in mathematics from University of Pittsburgh where I studied high-energy physics with Vladimir Savinov.

Feel free to drop an email if you're interested in chatting!

News

Publications and Preprints

On-Policy Policy Gradient Learning Without On-Policy Sampling
Nicholas E. Corrado, Josiah P. Hanna
Under Review, 2024
[arXiv] [bibtex]

Abstract

On-policy reinforcement learningx (RL) algorithms perform policy updates using i.i.d. trajectories collected by the current policy. However, after observing only a finite number of trajectories, on-policy sampling may produce data that fails to match the expected on-policy data distribution. This sampling error leads to noisy updates and data inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error than on-policy sampling can produce. Motivated by this observation, we introduce an adaptive, off-policy sampling method to improve the data efficiency of on-policy policy gradient algorithms. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy that increases the probability of sampling actions that are under-sampled with respect to the current policy. Rather than discarding data from old policies -- as is commonly done in on-policy algorithms -- PROPS uses data collection to adjust the distribution of previously collected data to be approximately on-policy. We empirically evaluate PROPS on both continuous-action MuJoCo benchmark tasks as well as discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) improves the data efficiency of on-policy policy gradient algorithms. Our work improves the RL community's understanding of a nuance in the on-policy vs off-policy dichotomy: on-policy learning requires on-policy data, not on-policy sampling.

Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning
Nicholas E. Corrado, Yuxiao Qu, John U. Balis, Adam Labiosa, & Josiah P. Hanna
Reinforcement Learning Conference (RLC), 2024
[project page] [arXiv] [bibtex]

Abstract

In offline reinforcement learning (RL), an RL agent learns to solve a task using only a fixed dataset of previously collected data. While offline RL has been successful in learning real-world robot control policies, it typically requires large amounts of expert-quality data to learn effective policies that generalize to out-of-distribution states. Unfortunately, such data is often difficult and expensive to acquire in real-world tasks. Several recent works have leveraged data augmentation (DA) to inexpensively generate additional data, but most DA works apply augmentations in a random fashion and ultimately produce highly suboptimal augmented experience. In this work, we propose Guided Data Augmentation (GuDA), a human-guided DA framework that generates expert-quality augmented data. The key insight behind GuDA is that while it may be difficult to demonstrate the sequence of actions required to produce expert data, a user can often easily characterize when an augmented trajectory segment represents progress toward task completion. Thus, a user can restrict the space of possible augmentations to automatically reject suboptimal augmented data. To extract a policy from GuDA, we use off-the-shelf offline reinforcement learning and behavior cloning algorithms. We evaluate GuDA on a physical robot soccer task as well as simulated D4RL navigation tasks, a simulated autonomous driving task, and a simulated soccer task. Empirically, GuDA enables learning given a small initial dataset of potentially suboptimal experience and outperforms a random DA strategy as well as a model-based DA strategy.

Understanding when Dynamics-Invariant Data Augmentations Benefit Model-free Reinforcement Learning Updates
Nicholas E. Corrado, Josiah P. Hanna
International Conference on Learning Representations (ICLR), 2024
[arXiv] [video] [poster] [slides] [bibtex]

Abstract

Recently, data augmentation (DA) has emerged as a method for leveraging domain knowledge to inexpensively generate additional data in reinforcement learning (RL) tasks, often yielding substantial improvements in data efficiency. While prior work has demonstrated the utility of incorporating augmented data directly into model-free RL updates, it is not well-understood when a particular DA strategy will improve data efficiency. In this paper, we seek to identify general aspects of DA responsible for observed learning improvements. Our study focuses on sparse-reward tasks with dynamics-invariant data augmentation functions, serving as an initial step towards a more general understanding of DA and its integration into RL training. Experimentally, we isolate three relevant aspects of DA: state-action coverage, reward density, and the number of augmented transitions generated per update (the augmented replay ratio). From our experiments, we draw two conclusions: (1) increasing state-action coverage often has a much greater impact on data efficiency than increasing reward density, and (2) decreasing the augmented replay ratio substantially improves data efficiency. In fact, certain tasks in our empirical study are solvable only when the replay ratio is sufficiently low.

Deep Reinforcement Learning for Distribution Power System Cyber-Resilience via Distributed Energy Resource Control
Nicholas E. Corrado, Michael Livesay, Tyson Bailey, & Drew Levin.
IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids, (IEEE SmartGridComm), 2023
[paper] [bibtex]

Abstract

Interoperable, internet-connected distributed energy resource (DER) devices in power systems create a new attack vector for malicious cyber-actors. Modern control systems primarily focus on transmission and sub-transmission operations and rarely incorporate DER or other distribution-connected equipment. While recent advances have expanded grid operator visibility and control functionality to the distribution level, these control systems struggle to scale with thousands of networked devices. Thus, to defend against potential attacks on DER devices, we must develop new real-time control algorithms to ensure grid stability. In this work, we present a new approach to power distribution control based on deep reinforcement learning (RL) algorithms which can learn optimal control policies through experience. We evaluate four RL algorithms in novel voltage stabilization tasks based on the IEEE 13-bus and EPRI Ckt5 power distribution models. We demonstrate that RL can successfully perform voltage regulation and outperform a greedy algorithm. Our results coupled with the established capability of RL to scale in high-dimensional settings opens a path towards real-time cyber defense of power systems with RL agents.

Simulation-Acquired Latent Action Spaces for Dynamics Generalization
Nicholas E. Corrado, Yuxiao Qu, & Josiah P. Hanna
Conference on Lifelong Learning Agents (CoLLAs), 2022
[paper] [bibtex]

Abstract

Deep reinforcement learning has shown incredible promise at training high-performing agents to solve high-dimensional continuous control tasks in a particular training environment. However, to be useful in real-world settings, long-lived agents must perform well across a range of environmental conditions. Naively applying deep RL to a task where environment conditions may vary from episode to episode can be data inefficient. To address this inefficiency, we introduce a method that discovers structure in an agent’s high-dimensional continuous action space to speed up learning across a range of environmental conditions. Whereas prior work on finding so-called latent action spaces requires expert demonstrations or on-task experience, we instead propose to discover the latent, lower-dimensional action space in a simulated source environment and then transfer the learned action space for training in the target environment. We evaluate our novel method on randomized variants of simulated MuJoCo environments and find that, when there is a lower-dimensional action-space to exploit, our method significantly increases data efficiency. For instance, in the Ant environment, our method reduces the 8-dimensional action-space to a 3-dimensional action-space and doubles the average return achieved after a training budget of 2 million timesteps.

Github

Personal

Outside of research, I am a jazz guitarist — in fact, I almost pursued music professionally after high school. I am heavily influenced by jazz manouche (commonly called gypsy jazz) but also draw inspiration from bossa, Dixieland, and modal jazz styles. When I was 13, I had the great fortune of meeting and playing with Joe Negri (aka "Handyman Negri" from Mister Rogers' Neighborhood). He is a jazz legend in my hometown of Pittsburgh and arguably one of the best jazz guitarists in the US.