Reinforcement Learning Global
621 subscribers
530 links
This channel is for Reinforcement Learning (RL) Researchers, Professionals, Students & anybody who wants to know more about RL. We will be sharing latest Research papers, Algorithms, Advancements, Applications and related Courses.
Download Telegram
Paper: Dichotomy of Control: Separating What You Can Control from What You Can not by Anonymous

TL;DR: Authors propose dichotomy of control (DoC) for supervised learning in stochastic environments by separating things within a policy's control (actions) from those outside of a policy’s control (env stochasticity) through a mutual information constraint.

Paper: https://openreview.net/pdf?id=DEGjDDV22pI

Supplementary Material: https://openreview.net/attachment?id=DEGjDDV22pI&name=supplementary_material
Podcast: John Schulman

John Schulman is a cofounder of OpenAI, and currently a researcher and engineer at OpenAI.

In this podcast he talks about tuning GPT-3 to follow instructions (InstructGPT) and answer long-form questions using the internet (WebGPT), AI alignment, AGI timelines, and more!

Link: https://www.talkrl.com/episodes/john-schulman
Podcast: Sven Mika

Sven Mika is the Reinforcement Learning Team Lead at Anyscale, and lead committer of RLlib. He holds a PhD in biomathematics, bioinformatics, and computational biology from Witten/Herdecke University. 

He talks about RLlib present and future, Ray and Ray Summit 2022, applied RL in Games / Finance / RecSys, and more!

Link: https://www.talkrl.com/episodes/sven-mika
Podcast: Rohin Shah

Dr. Rohin Shah is a Research Scientist at DeepMind, and the editor and main contributor of the Alignment Newsletter.

He talks about Value Alignment, Learning from Human feedback, Assistance paradigm, the BASALT MineRL competition, his Alignment Newsletter, and more!

Link: https://www.talkrl.com/episodes/rohin-shah
Paper: Deploying Offline Reinforcement Learning with Human Feedback

Tl;Dr: Reinforcement learning (RL) has shown promise for decision-making tasks in real-world applications. One practical framework involves training parameterized policy models from an offline dataset and subsequently deploying them in an online environment. However, this approach can be risky since the offline training may not be perfect, leading to poor performance of the RL models that may take dangerous actions. To address this issue, we propose an alternative framework that involves a human supervising the RL models and providing additional feedback in the online deployment phase. We formalize this online deployment problem and develop two approaches. The first approach uses model selection and the upper confidence bound algorithm to adaptively select a model to deploy from a candidate set of trained offline RL models. The second approach involves fine-tuning the model in the online deployment phase when a supervision signal arrives. We demonstrate the effectiveness of these approaches for robot locomotion control and traffic light control tasks through empirical validation.

Paper: https://arxiv.org/abs/2303.07046
Paper: Misspecification in Inverse Reinforcement Learning

Tl;Dr: The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function R from a policy π. To do this, we need a model of how π relates to R. In the current literature, the most common models are optimality, Boltzmann rationality, and causal entropy maximisation. One of the primary motivations behind IRL is to infer human preferences from human behaviour. However, the true relationship between human preferences and human behaviour is much more complex than any of the models currently used in IRL. This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data. In this paper, we provide a mathematical analysis of how robust different IRL models are to misspecification, and answer precisely how the demonstrator policy may differ from each of the standard models before that model leads to faulty inferences about the reward function R. We also introduce a framework for reasoning about misspecification in IRL, together with formal tools that can be used to easily derive the misspecification robustness of new IRL models.

Paper: https://arxiv.org/abs/2212.03201
Podcast: Jakob Foerster

Jakob Foerster is an Associate Professor at University of Oxford.

He talks about Multi-Agent learning, Cooperation vs Competition, Emergent Communication, Zero-shot coordination, Opponent Shaping, agents for Hanabi and Prisoner's Dilemma, and more!

Link: https://www.talkrl.com/episodes/jakob-foerster
Podcast: Martin Riedmiller

Martin Riedmiller is a research scientist and team lead at DeepMind.

He talks about controlling nuclear fusion plasma in a tokamak with RL, the original Deep Q-Network, Neural Fitted Q-Iteration, Collect and Infer, AGI for control systems, and tons more! 

Link: https://www.talkrl.com/episodes/martin-riedmiller
Paper: Offline Actor-Critic Reinforcement Learning Scales to Large Models

Tl;Dr: We show that offline actor-critic reinforcement learning can scale to large models - such as transformers - and follows similar scaling laws as supervised learning. We find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset containing both sub-optimal and expert behavior on 132 continuous control tasks. We introduce a Perceiver-based actor-critic model and elucidate the key model features needed to make offline RL work with self- and cross-attention modules. Overall, we find that: i) simple offline actor critic algorithms are a natural choice for gradually moving away from the currently predominant paradigm of behavioral cloning, and ii) via offline RL it is possible to learn multi-task policies that master many domains simultaneously, including real robotics tasks, from sub-optimal demonstrations or self-generated data.

Link: https://arxiv.org/abs/2402.05546
Paper: Mixtures of Experts Unlock Parameter Scaling for Deep RL

Tl;Dr: The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance scales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.

Link: https://arxiv.org/abs/2402.08609
Paper: In deep reinforcement learning, a pruned network is a good network

Tl;Dr: Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage prior insights into the advantages of sparse training techniques and demonstrate that gradual magnitude pruning enables agents to maximize parameter effectiveness. This results in networks that yield dramatic performance improvements over traditional networks and exhibit a type of "scaling law", using only a small fraction of the full network parameters.

Link: https://arxiv.org/abs/2402.12479
In our newly published paper, we formulate a mean field game (MFG) to minimize Age of Information (AoI) by optimizing cruise control, and then we develop a novel solution based on Proximal Policy Optimization (PPO) to jointly optimize continuous and discrete actions. Specifically, UAV swarms are employed to collect time critical sensory data. Time-critical data collection is influenced by the velocity of the UAVs and their coordinated interactions in the swarms, which can be modeled using MFG. This raises the importance of an age-optimal cruise control based on MFG for UAVs. However, determining the equilibrium online is difficult in practical scenarios, and thus we propose, a new mean field hybrid proximal policy optimization (MF-HPPO) scheme to minimize the average AoI by optimizing the UAV’s trajectories and data collection scheduling of the ground sensors given mixed continuous and discrete actions. MF-HPPO highly reduces the complexity while minimizing the average AoI.
Please check out our paper for more information:
https://ieeexplore.ieee.org/abstract/document/10508811
Paper: https://arxiv.org/abs/2405.00056
A short introduction to RLHF and post-training focused on language models