Paper: Dichotomy of Control: Separating What You Can Control from What You Can not by Anonymous
TL;DR: Authors propose dichotomy of control (DoC) for supervised learning in stochastic environments by separating things within a policy's control (actions) from those outside of a policy’s control (env stochasticity) through a mutual information constraint.
Paper: https://openreview.net/pdf?id=DEGjDDV22pI
Supplementary Material: https://openreview.net/attachment?id=DEGjDDV22pI&name=supplementary_material
TL;DR: Authors propose dichotomy of control (DoC) for supervised learning in stochastic environments by separating things within a policy's control (actions) from those outside of a policy’s control (env stochasticity) through a mutual information constraint.
Paper: https://openreview.net/pdf?id=DEGjDDV22pI
Supplementary Material: https://openreview.net/attachment?id=DEGjDDV22pI&name=supplementary_material
Podcast: John Schulman
John Schulman is a cofounder of OpenAI, and currently a researcher and engineer at OpenAI.
In this podcast he talks about tuning GPT-3 to follow instructions (InstructGPT) and answer long-form questions using the internet (WebGPT), AI alignment, AGI timelines, and more!
Link: https://www.talkrl.com/episodes/john-schulman
John Schulman is a cofounder of OpenAI, and currently a researcher and engineer at OpenAI.
In this podcast he talks about tuning GPT-3 to follow instructions (InstructGPT) and answer long-form questions using the internet (WebGPT), AI alignment, AGI timelines, and more!
Link: https://www.talkrl.com/episodes/john-schulman
TalkRL: The Reinforcement Learning Podcast
TalkRL: The Reinforcement Learning Podcast | John Schulman
John Schulman, OpenAI cofounder and researcher, inventor of PPO/TRPO talks RL from human feedback, tuning GPT-3 to follow instructions (InstructGPT) and answer long-form questions using the interne...
Podcast: Sven Mika
Sven Mika is the Reinforcement Learning Team Lead at Anyscale, and lead committer of RLlib. He holds a PhD in biomathematics, bioinformatics, and computational biology from Witten/Herdecke University.
He talks about RLlib present and future, Ray and Ray Summit 2022, applied RL in Games / Finance / RecSys, and more!
Link: https://www.talkrl.com/episodes/sven-mika
Sven Mika is the Reinforcement Learning Team Lead at Anyscale, and lead committer of RLlib. He holds a PhD in biomathematics, bioinformatics, and computational biology from Witten/Herdecke University.
He talks about RLlib present and future, Ray and Ray Summit 2022, applied RL in Games / Finance / RecSys, and more!
Link: https://www.talkrl.com/episodes/sven-mika
TalkRL: The Reinforcement Learning Podcast
TalkRL: The Reinforcement Learning Podcast | Sven Mika
Sven Mika of Anyscale on RLlib present and future, Ray and Ray Summit 2022, applied RL in Games / Finance / RecSys, and more!
Today we’re announcing the Farama Foundation – a new nonprofit organization designed in part to house major existing open source reinforcement learning (“RL”) libraries in a neutral nonprofit body.
https://farama.org/Announcing-The-Farama-Foundation
https://farama.org/Announcing-The-Farama-Foundation
The Farama Foundation
Announcing The Farama Foundation - The future of open source reinforcement learning
Today we’re announcing the Farama Foundation – a new nonprofit organization designed in part to house major existing open source reinforcement learning (“RL”) libraries in a neutral nonprofit body.
Blog Post
Fully Autonomous Real-World Reinforcement Learning with Applications to Mobile Manipulation
Link: https://bair.berkeley.edu/blog/2023/01/20/relmm/
Fully Autonomous Real-World Reinforcement Learning with Applications to Mobile Manipulation
Link: https://bair.berkeley.edu/blog/2023/01/20/relmm/
The Berkeley Artificial Intelligence Research Blog
Fully Autonomous Real-World Reinforcement Learning with Applications to Mobile Manipulation
The BAIR Blog
Podcast: Rohin Shah
Dr. Rohin Shah is a Research Scientist at DeepMind, and the editor and main contributor of the Alignment Newsletter.
He talks about Value Alignment, Learning from Human feedback, Assistance paradigm, the BASALT MineRL competition, his Alignment Newsletter, and more!
Link: https://www.talkrl.com/episodes/rohin-shah
Dr. Rohin Shah is a Research Scientist at DeepMind, and the editor and main contributor of the Alignment Newsletter.
He talks about Value Alignment, Learning from Human feedback, Assistance paradigm, the BASALT MineRL competition, his Alignment Newsletter, and more!
Link: https://www.talkrl.com/episodes/rohin-shah
TalkRL: The Reinforcement Learning Podcast
TalkRL: The Reinforcement Learning Podcast | Rohin Shah
DeepMind Research Scientist Dr. Rohin Shah on Value Alignment, Learning from Human feedback, Assistance paradigm, the BASALT MineRL competition, his Alignment Newsletter, and more!
Paper: Deploying Offline Reinforcement Learning with Human Feedback
Tl;Dr: Reinforcement learning (RL) has shown promise for decision-making tasks in real-world applications. One practical framework involves training parameterized policy models from an offline dataset and subsequently deploying them in an online environment. However, this approach can be risky since the offline training may not be perfect, leading to poor performance of the RL models that may take dangerous actions. To address this issue, we propose an alternative framework that involves a human supervising the RL models and providing additional feedback in the online deployment phase. We formalize this online deployment problem and develop two approaches. The first approach uses model selection and the upper confidence bound algorithm to adaptively select a model to deploy from a candidate set of trained offline RL models. The second approach involves fine-tuning the model in the online deployment phase when a supervision signal arrives. We demonstrate the effectiveness of these approaches for robot locomotion control and traffic light control tasks through empirical validation.
Paper: https://arxiv.org/abs/2303.07046
Tl;Dr: Reinforcement learning (RL) has shown promise for decision-making tasks in real-world applications. One practical framework involves training parameterized policy models from an offline dataset and subsequently deploying them in an online environment. However, this approach can be risky since the offline training may not be perfect, leading to poor performance of the RL models that may take dangerous actions. To address this issue, we propose an alternative framework that involves a human supervising the RL models and providing additional feedback in the online deployment phase. We formalize this online deployment problem and develop two approaches. The first approach uses model selection and the upper confidence bound algorithm to adaptively select a model to deploy from a candidate set of trained offline RL models. The second approach involves fine-tuning the model in the online deployment phase when a supervision signal arrives. We demonstrate the effectiveness of these approaches for robot locomotion control and traffic light control tasks through empirical validation.
Paper: https://arxiv.org/abs/2303.07046
Paper: Misspecification in Inverse Reinforcement Learning
Tl;Dr: The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function R from a policy π. To do this, we need a model of how π relates to R. In the current literature, the most common models are optimality, Boltzmann rationality, and causal entropy maximisation. One of the primary motivations behind IRL is to infer human preferences from human behaviour. However, the true relationship between human preferences and human behaviour is much more complex than any of the models currently used in IRL. This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data. In this paper, we provide a mathematical analysis of how robust different IRL models are to misspecification, and answer precisely how the demonstrator policy may differ from each of the standard models before that model leads to faulty inferences about the reward function R. We also introduce a framework for reasoning about misspecification in IRL, together with formal tools that can be used to easily derive the misspecification robustness of new IRL models.
Paper: https://arxiv.org/abs/2212.03201
Tl;Dr: The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function R from a policy π. To do this, we need a model of how π relates to R. In the current literature, the most common models are optimality, Boltzmann rationality, and causal entropy maximisation. One of the primary motivations behind IRL is to infer human preferences from human behaviour. However, the true relationship between human preferences and human behaviour is much more complex than any of the models currently used in IRL. This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data. In this paper, we provide a mathematical analysis of how robust different IRL models are to misspecification, and answer precisely how the demonstrator policy may differ from each of the standard models before that model leads to faulty inferences about the reward function R. We also introduce a framework for reasoning about misspecification in IRL, together with formal tools that can be used to easily derive the misspecification robustness of new IRL models.
Paper: https://arxiv.org/abs/2212.03201
Podcast: Jakob Foerster
Jakob Foerster is an Associate Professor at University of Oxford.
He talks about Multi-Agent learning, Cooperation vs Competition, Emergent Communication, Zero-shot coordination, Opponent Shaping, agents for Hanabi and Prisoner's Dilemma, and more!
Link: https://www.talkrl.com/episodes/jakob-foerster
Jakob Foerster is an Associate Professor at University of Oxford.
He talks about Multi-Agent learning, Cooperation vs Competition, Emergent Communication, Zero-shot coordination, Opponent Shaping, agents for Hanabi and Prisoner's Dilemma, and more!
Link: https://www.talkrl.com/episodes/jakob-foerster
Jakobfoerster
Jakob N. Foerster
Associate Professor, Department of Engineering Science, University of Oxford
Research Scientist, FAIR, Meta AI, Meta
Supernumeray Fellow, St Anne's College, Oxford
Lab Website (FLAIR)
Google scholar profile
Patents
Youtube channel
Twitter:@j_foerst /…
Research Scientist, FAIR, Meta AI, Meta
Supernumeray Fellow, St Anne's College, Oxford
Lab Website (FLAIR)
Google scholar profile
Patents
Youtube channel
Twitter:@j_foerst /…
Podcast: Martin Riedmiller
Martin Riedmiller is a research scientist and team lead at DeepMind.
He talks about controlling nuclear fusion plasma in a tokamak with RL, the original Deep Q-Network, Neural Fitted Q-Iteration, Collect and Infer, AGI for control systems, and tons more!
Link: https://www.talkrl.com/episodes/martin-riedmiller
Martin Riedmiller is a research scientist and team lead at DeepMind.
He talks about controlling nuclear fusion plasma in a tokamak with RL, the original Deep Q-Network, Neural Fitted Q-Iteration, Collect and Infer, AGI for control systems, and tons more!
Link: https://www.talkrl.com/episodes/martin-riedmiller
Google
Martin Riedmiller
About me
Blog : To keep doing RL research, stop calling yourself an RL researcher
by Pierluca D'Oro
Link: https://www.scienceofaiagents.com/p/to-keep-doing-rl-research-stop-calling
by Pierluca D'Oro
Link: https://www.scienceofaiagents.com/p/to-keep-doing-rl-research-stop-calling
Scienceofaiagents
To keep doing RL research, stop calling yourself an RL researcher
On the role of RL researchers in the era of LLM agents.
Paper: Offline Actor-Critic Reinforcement Learning Scales to Large Models
Tl;Dr: We show that offline actor-critic reinforcement learning can scale to large models - such as transformers - and follows similar scaling laws as supervised learning. We find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset containing both sub-optimal and expert behavior on 132 continuous control tasks. We introduce a Perceiver-based actor-critic model and elucidate the key model features needed to make offline RL work with self- and cross-attention modules. Overall, we find that: i) simple offline actor critic algorithms are a natural choice for gradually moving away from the currently predominant paradigm of behavioral cloning, and ii) via offline RL it is possible to learn multi-task policies that master many domains simultaneously, including real robotics tasks, from sub-optimal demonstrations or self-generated data.
Link: https://arxiv.org/abs/2402.05546
Tl;Dr: We show that offline actor-critic reinforcement learning can scale to large models - such as transformers - and follows similar scaling laws as supervised learning. We find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset containing both sub-optimal and expert behavior on 132 continuous control tasks. We introduce a Perceiver-based actor-critic model and elucidate the key model features needed to make offline RL work with self- and cross-attention modules. Overall, we find that: i) simple offline actor critic algorithms are a natural choice for gradually moving away from the currently predominant paradigm of behavioral cloning, and ii) via offline RL it is possible to learn multi-task policies that master many domains simultaneously, including real robotics tasks, from sub-optimal demonstrations or self-generated data.
Link: https://arxiv.org/abs/2402.05546
arXiv.org
Offline Actor-Critic Reinforcement Learning Scales to Large Models
We show that offline actor-critic reinforcement learning can scale to large models - such as transformers - and follows similar scaling laws as supervised learning. We find that offline...
Paper: Mixtures of Experts Unlock Parameter Scaling for Deep RL
Tl;Dr: The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance scales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
Link: https://arxiv.org/abs/2402.08609
Tl;Dr: The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance scales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
Link: https://arxiv.org/abs/2402.08609
arXiv.org
Mixtures of Experts Unlock Parameter Scaling for Deep RL
The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance scales proportionally to its size. Analogous scaling laws...
Podcast: Sharath Chandra Raparthy
Sharath Chandra Raparthy is an AI Resident at FAIR in Meta, and did his Master's at Mila.
He talks about In-Context Learning for Sequential Decision Tasks, GFlowNets, and more!
Link: https://www.talkrl.com/episodes/sharath-chandra-raparthy
Sharath Chandra Raparthy is an AI Resident at FAIR in Meta, and did his Master's at Mila.
He talks about In-Context Learning for Sequential Decision Tasks, GFlowNets, and more!
Link: https://www.talkrl.com/episodes/sharath-chandra-raparthy
TalkRL: The Reinforcement Learning Podcast
TalkRL: The Reinforcement Learning Podcast | Sharath Chandra Raparthy
Sharath Chandra Raparthy on In-Context Learning for Sequential Decision Tasks, GFlowNets, and more! Sharath Chandra Raparthy is an AI Resident at FAIR at Meta, and did his Master's at Mila. Featu...
Paper: In deep reinforcement learning, a pruned network is a good network
Tl;Dr: Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage prior insights into the advantages of sparse training techniques and demonstrate that gradual magnitude pruning enables agents to maximize parameter effectiveness. This results in networks that yield dramatic performance improvements over traditional networks and exhibit a type of "scaling law", using only a small fraction of the full network parameters.
Link: https://arxiv.org/abs/2402.12479
Tl;Dr: Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage prior insights into the advantages of sparse training techniques and demonstrate that gradual magnitude pruning enables agents to maximize parameter effectiveness. This results in networks that yield dramatic performance improvements over traditional networks and exhibit a type of "scaling law", using only a small fraction of the full network parameters.
Link: https://arxiv.org/abs/2402.12479
arXiv.org
In value-based deep reinforcement learning, a pruned network is a...
Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage prior insights into the advantages of sparse training...
In our newly published paper, we formulate a mean field game (MFG) to minimize Age of Information (AoI) by optimizing cruise control, and then we develop a novel solution based on Proximal Policy Optimization (PPO) to jointly optimize continuous and discrete actions. Specifically, UAV swarms are employed to collect time critical sensory data. Time-critical data collection is influenced by the velocity of the UAVs and their coordinated interactions in the swarms, which can be modeled using MFG. This raises the importance of an age-optimal cruise control based on MFG for UAVs. However, determining the equilibrium online is difficult in practical scenarios, and thus we propose, a new mean field hybrid proximal policy optimization (MF-HPPO) scheme to minimize the average AoI by optimizing the UAV’s trajectories and data collection scheduling of the ground sensors given mixed continuous and discrete actions. MF-HPPO highly reduces the complexity while minimizing the average AoI.
Please check out our paper for more information:
https://ieeexplore.ieee.org/abstract/document/10508811
Paper: https://arxiv.org/abs/2405.00056
Please check out our paper for more information:
https://ieeexplore.ieee.org/abstract/document/10508811
Paper: https://arxiv.org/abs/2405.00056
arXiv.org
Age of Information Minimization using Multi-agent UAVs based on...
Unmanned Aerial Vehicle (UAV) swarms play an effective role in timely data collection from ground sensors in remote and hostile areas. Optimizing the collective behavior of swarms can improve data...
RLHF Book by Nathan Lambert
https://rlhfbook.com/
https://rlhfbook.com/
Rlhfbook
RLHF Book by Nathan Lambert
The Reinforcement Learning from Human Feedback Book
A short introduction to RLHF and post-training focused on language models