Nicholas E. Corrado

I am a Ph.D. student in Computer Sciences at the University of Wisconsin-Madison, advised by Josiah Hanna. My research focuses on improving the data efficiency of reinforcement learning (RL) algorithms by designing better data collection strategies. Since RL algorithms typically require an impractical amount of interaction with a task to perform well, my work asks: What data should we collect to learn as efficiently as possible? and How do we efficiently collect such data? Towards this end, I currently work on:

Adaptive sampling algorithms for on-policy policy gradient methods
Synthetic data generation for off-policy and offline RL

Previously, I was a research intern with Amazon's Rufus team (working on multi-objective alignment for LLMs) and Sandia National Laboratories (working on RL for power grid management). During the first year of my PhD, I worked in databases with Jignesh Patel. I received a BPhil in physics and a B.S. in mathematics from University of Pittsburgh where I studied high-energy physics with Vladimir Savinov.

Feel free to drop an email if you're interested in chatting!

News

[July 2025] 🏆 "When Can Model-Free Reinforcement Learning be Enough for Thinking?" was awarded Most-Thought-Provoking Paper at Finding the Frame Workshop @ RLC 2025
[July 2025] "When Can Model-Free Reinforcement Learning be Enough for Thinking?" and "On-Policy Policy Gradient Learning Without On-Policy Sampling" accepted at Finding the Frame Workshop @ RLC 2025!
[May 2025] "AutoMixAlign: Adaptive Data Mixing for Multi-Task Preference Optimization in LLMs" accepted at ACL 2025 Main Conference!
[Nov 2024] I received the Top Review Award at NeurIPS 2024!
[July 2024] I joined Amazon's Rufus Team as a research intern working with Julian Katz-Samuels!
[May 2024] 1 paper accepted at RLC 2024!
[January 2024] 1 paper accepted at ICLR 2024!
[October 2023] I gave a talk on adaptive off-policy sampling for on-policy learning at the University of Edinburgh RL reading group!

Publications and Preprints

When Can Model-Free Reinforcement Learning be Enough for Thinking?
Josiah P. Hanna, Nicholas E. Corrado
Under Review
🏆 Most-Thought-Provoking Paper at Finding the Frame Workshop at the Reinforcement Learning Conference (RLC), 2025
[arXiv]

Abstract

Recent work on large language models has demonstrated the use of model-free reinforcement learning (RL) to train reasoning-like capabilities. The emergence of "thinking" through model-free RL is interesting as thinking actions neither produce reward nor change the external world state to one where the agent is more likely to get reward. This paper seeks to build a domain-independent understanding of when model-free RL will lead to "thinking" as a strategy for reward maximization. To build this understanding, we first introduce a theoretical model which we call a \textit{thought Markov decision process} (MDP). Thought MDPs minimally extend the classical MDP model to include an abstract notion of thought state and thought action. Using the thought MDP model, we prove the importance of policy initialization in determining whether or not thinking emerges and show formally that thought actions are equivalent to the agent choosing to perform a step of policy improvement before continuing to act. We then show that open-source LLMs satisfy the conditions that our theory predicts are necessary for model-free RL to produce thinking-like behavior. Finally, we hypothesize sufficient conditions that would enable thinking to be learned outside of language generation and introduce a toy domain where a combination of multi-task pre-training and designated thought actions enable more data-efficient RL compared to non-thinking agents.

AutoMixAlign: Adaptive Data Mixing for Multi-Task Preference Optimization in LLMs
Nicholas E. Corrado, Julian Katz-Samuels, Adithya Devraj, Hyokun Yun, Chao Zhang, Yi Xu, Yi Pan, Bing Yin, Trishul Chilimbi
ACL (Main Conference), 2025
[arXiv] [bibtex]

Abstract

When aligning large language models (LLMs), their performance on various tasks (such as being helpful, harmless, and honest) depends heavily on the composition of their training data. However, selecting a data mixture that achieves strong performance across all tasks is challenging. Existing approaches rely on large ablation studies, heuristics, or human intuition, but these can be prohibitively expensive and suboptimal. We study this problem in the setting of preference optimization via DPO and introduce AutoMixAlign (AMA), a theoretically-grounded algorithm that adaptively mixes datasets during training to balance performance across tasks. AMA first trains \textit{specialist models} for each task to determine losses that correspond to strong task performance. Then, it trains a generalist model using a novel minimax optimization that prioritizes tasks for which generalist model losses deviate most from specialist model losses. To optimize this problem, we propose two algorithms: (1) AMA-R, which adaptively reweights the objective to prioritize tasks, and (2) AMA-S, which adaptively adjusts how much data is sampled from each task to prioritize tasks. Both algorithms achieve a convergence rate of $O(1/\sqrt{T})$ in the convex case. AMA-R's convergence result follows from Sagawa et al. (2019), and we provide a convergence proof for AMA-S using online learning techniques such as EXP3. We evaluate AMA on several multitask alignment setups and find that AMA outperforms the standard alignment approach -- which simply optimizes the total loss across all tasks -- and also outperforms model merging methods.

On-Policy Policy Gradient Learning Without On-Policy Sampling
Nicholas E. Corrado, Josiah P. Hanna
Under Review
Also appears in Finding the Frame Workshop at the Reinforcement Learning Conference (RLC), 2025
[arXiv] [bibtex]

Abstract

On-policy reinforcement learningx (RL) algorithms perform policy updates using i.i.d. trajectories collected by the current policy. However, after observing only a finite number of trajectories, on-policy sampling may produce data that fails to match the expected on-policy data distribution. This sampling error leads to noisy updates and data inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error than on-policy sampling can produce. Motivated by this observation, we introduce an adaptive, off-policy sampling method to improve the data efficiency of on-policy policy gradient algorithms. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy that increases the probability of sampling actions that are under-sampled with respect to the current policy. Rather than discarding data from old policies -- as is commonly done in on-policy algorithms -- PROPS uses data collection to adjust the distribution of previously collected data to be approximately on-policy. We empirically evaluate PROPS on both continuous-action MuJoCo benchmark tasks as well as discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) improves the data efficiency of on-policy policy gradient algorithms. Our work improves the RL community's understanding of a nuance in the on-policy vs off-policy dichotomy: on-policy learning requires on-policy data, not on-policy sampling.

Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning
Nicholas E. Corrado, Yuxiao Qu, John U. Balis, Adam Labiosa, & Josiah P. Hanna
Reinforcement Learning Conference (RLC), 2024
[project page] [arXiv] [bibtex]

Abstract

In offline reinforcement learning (RL), an RL agent learns to solve a task using only a fixed dataset of previously collected data. While offline RL has been successful in learning real-world robot control policies, it typically requires large amounts of expert-quality data to learn effective policies that generalize to out-of-distribution states. Unfortunately, such data is often difficult and expensive to acquire in real-world tasks. Several recent works have leveraged data augmentation (DA) to inexpensively generate additional data, but most DA works apply augmentations in a random fashion and ultimately produce highly suboptimal augmented experience. In this work, we propose Guided Data Augmentation (GuDA), a human-guided DA framework that generates expert-quality augmented data. The key insight behind GuDA is that while it may be difficult to demonstrate the sequence of actions required to produce expert data, a user can often easily characterize when an augmented trajectory segment represents progress toward task completion. Thus, a user can restrict the space of possible augmentations to automatically reject suboptimal augmented data. To extract a policy from GuDA, we use off-the-shelf offline reinforcement learning and behavior cloning algorithms. We evaluate GuDA on a physical robot soccer task as well as simulated D4RL navigation tasks, a simulated autonomous driving task, and a simulated soccer task. Empirically, GuDA enables learning given a small initial dataset of potentially suboptimal experience and outperforms a random DA strategy as well as a model-based DA strategy.

Understanding when Dynamics-Invariant Data Augmentations Benefit Model-free Reinforcement Learning Updates
Nicholas E. Corrado, Josiah P. Hanna
International Conference on Learning Representations (ICLR), 2024
[arXiv] [video] [poster] [slides] [bibtex]

Abstract

Recently, data augmentation (DA) has emerged as a method for leveraging domain knowledge to inexpensively generate additional data in reinforcement learning (RL) tasks, often yielding substantial improvements in data efficiency. While prior work has demonstrated the utility of incorporating augmented data directly into model-free RL updates, it is not well-understood when a particular DA strategy will improve data efficiency. In this paper, we seek to identify general aspects of DA responsible for observed learning improvements. Our study focuses on sparse-reward tasks with dynamics-invariant data augmentation functions, serving as an initial step towards a more general understanding of DA and its integration into RL training. Experimentally, we isolate three relevant aspects of DA: state-action coverage, reward density, and the number of augmented transitions generated per update (the augmented replay ratio). From our experiments, we draw two conclusions: (1) increasing state-action coverage often has a much greater impact on data efficiency than increasing reward density, and (2) decreasing the augmented replay ratio substantially improves data efficiency. In fact, certain tasks in our empirical study are solvable only when the replay ratio is sufficiently low.

Deep Reinforcement Learning for Distribution Power System Cyber-Resilience via Distributed Energy Resource Control
Nicholas E. Corrado, Michael Livesay, Tyson Bailey, & Drew Levin.
IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids, (IEEE SmartGridComm), 2023
[paper] [bibtex]

Abstract

Interoperable, internet-connected distributed energy resource (DER) devices in power systems create a new attack vector for malicious cyber-actors. Modern control systems primarily focus on transmission and sub-transmission operations and rarely incorporate DER or other distribution-connected equipment. While recent advances have expanded grid operator visibility and control functionality to the distribution level, these control systems struggle to scale with thousands of networked devices. Thus, to defend against potential attacks on DER devices, we must develop new real-time control algorithms to ensure grid stability. In this work, we present a new approach to power distribution control based on deep reinforcement learning (RL) algorithms which can learn optimal control policies through experience. We evaluate four RL algorithms in novel voltage stabilization tasks based on the IEEE 13-bus and EPRI Ckt5 power distribution models. We demonstrate that RL can successfully perform voltage regulation and outperform a greedy algorithm. Our results coupled with the established capability of RL to scale in high-dimensional settings opens a path towards real-time cyber defense of power systems with RL agents.

Simulation-Acquired Latent Action Spaces for Dynamics Generalization
Nicholas E. Corrado, Yuxiao Qu, & Josiah P. Hanna
Conference on Lifelong Learning Agents (CoLLAs), 2022
[paper] [bibtex]

Abstract

Deep reinforcement learning has shown incredible promise at training high-performing agents to solve high-dimensional continuous control tasks in a particular training environment. However, to be useful in real-world settings, long-lived agents must perform well across a range of environmental conditions. Naively applying deep RL to a task where environment conditions may vary from episode to episode can be data inefficient. To address this inefficiency, we introduce a method that discovers structure in an agent’s high-dimensional continuous action space to speed up learning across a range of environmental conditions. Whereas prior work on finding so-called latent action spaces requires expert demonstrations or on-task experience, we instead propose to discover the latent, lower-dimensional action space in a simulated source environment and then transfer the learned action space for training in the target environment. We evaluate our novel method on randomized variants of simulated MuJoCo environments and find that, when there is a lower-dimensional action-space to exploit, our method significantly increases data efficiency. For instance, in the Ant environment, our method reduces the 8-dimensional action-space to a 3-dimensional action-space and doubles the average return achieved after a training budget of 2 million timesteps.

Personal

Outside of research, I am a jazz guitarist — in fact, I almost pursued music professionally after high school. I am heavily influenced by jazz manouche (commonly called gypsy jazz) but also draw inspiration from bossa, Dixieland, and modal jazz styles. When I was 13, I had the great fortune of meeting and playing with Joe Negri (aka "Handyman Negri" from Mister Rogers' Neighborhood). He is a jazz legend in my hometown of Pittsburgh and arguably one of the best jazz guitarists in the US.