Minimal overview of model-based RL methods(updated regularly)
Reinforcement Learning (RL) aims to derive an optimal policy for a Markov Decision Process (MDP). When we possess a perfect world model, achieving this optimal policy is straightforward. However, in most real-world scenarios, such a model is unknown and must be learned from data or manually designed.
In Model-Free RL, we focus on deriving a controller without explicitly learning a model of the environment. Utilizing a replay buffer, which essentially acts as a non-parametric world model containing state-action transitions, we estimate the value function directly. In contrast, Model-Based RL strives to learn the underlying MDP by estimating transition probabilities \(P(s' | s, a)\) and the reward function \(R(s, a)\). This acquired parametric world model allows us to derive optimal policies through planning methods.
The debate regarding the superior approach—model-free or model-based—is ongoing. Model-free methods often demand extensive interaction with the environment, making them costly for real-world training. Consequently, these methods are primarily applied in simulated environments, with a few exceptions.
On the other hand, model-based reinforcement learning holds promise for superior sample efficiency (the amount of experience that an agent needs to generate in an environment during training in order to reach a certain level of performance). Yet, accurately learning a generalized model of the environment poses a significant challenge. The compounding error problem and the potential exploitation of learned model deficiencies are some of the challenges.
Before getting into the details, let’s clarify what we mean by “world models” to avoid any potential confusion .
An internal world model serves as a mechanism to represent both the system and its environment. For instance, a robot equipped with an internal model possesses knowledge about the underlying physics of the world and can generate and test “what-if” hypotheses, such as the consequences of different actions:
Holland et al. writes:
an internal model allows a system to look ahead to the future consequences of current actions, without actually committing itself to those actions”.
Yann LeCun highlights the necessity of world models for minimizing costly and dangerous real-world interactions
Interactions in the real world are expensive and dangerous, intelligent agents should learn as much as they can about the world without interaction (by observation) so as to minimize the number of expensive and dangerous trials necessary to learn a particular task.
In simple mathematical terms, world models learns a function \(F: S_{t} \times A→ S_{t+1}\) or predicting the next state given the observed state(s) and current action.
This picture illustrates the general paradigm of model-based RL:
Iterate for several episodes:
We have three questions to answer to accomplish this:
We need to learn a model that approximates the unknown MDP. We focus on the fully observable case (state \(x_t\) is observed).
Data consists of trajectories \((x_0, a_0, r_0, x_1, a_1, r_1, …)\)
Common settings:
Key Insight:
\(x_{t+1}\) is conditionally independent of \(x_{1:t-1}\) given \(x_t\), \(a_t\)
In fully observed environments, due to conditional independence, learning \(F\) and \(R\) is essentially a regression (density estimation) problem (supervised learning)
This allows the model to focus on capturing the changes accurately.
Errors in the model estimate compound when planning over multiple time-steps.
This compounding error is exploited by planning algorithm (MPC, policy search)
World Models, PlaNet (Hafner, ICML 19), Dreamer (Hafner, ICLR 20), MuZero(Schrittwieser, Nature 20), IRIS( Micheli + ICLR 23)
Dreamer v1, v2, DayDreamer ( application of Dreamer in 5 different real robots)
To start, assume we have a known deterministic model for the reward and dynamics
\[x_{t+1} = f (x_t, a_t)\]Then, our objective becomes to plan sequences of actions such that over a fixed horizon, our chosen actions will yield the maximum reward. Notice, here we cannot explicitly optimize over an infinite horizon due to model errors and noise.
Key idea: plan over a finite horizon h, carry out first action, then replan ( inside world model):
How to optimize for action sequences with the maximum cumulative reward during h-steps?
Answer:
Challenges(especially for large \(h\)): local minima, vanishing/exploding gradients. Therefore, in practice, this optimization is done with a variety of gradient-free sample-based optimizers below.
We use MPC to select actions via our model predictions in imagination. At each time step, the method perfroms a short-horizon trajectory optimization, using the model to predict the outcomes of different action sequences.
In short, random shooting is a direct, brute-force method for trajectory optimization. It can be a good choice where the action space isn’t too vast, or computational resources are abundant, making it inapplicable for robotic manipulation tasks. Nagabandi et al(2017) has first used Random Shooting method in continuous control tasks with learned models. Since Random Shooting has number of drawbacks due to scalability with dimensions, a lot of the recent literature focused on integrating a better planing algorithm called Cross-Entropy Method or its improved variants.
A smarter way to do random shooting is called cross-entropy method CEM. Highly recommended to refer to A Tutorial on Cross-Entropy Method
It begins as a random shooting approach, but then does this sampling for multiple iterations \(m [0, M]\) at each time step. The top \(J\) highest scoring actions sequences (elites) from each iteration are used to update mean and variance of the sampling distribution for the next iteration.
In simpler terms, you try out random stuff at the beginning and you evaluate this random actions based on some criteria (how close you are to the target or how much reward you receive). You choose \(J\) number of highest scoring elites from this sequences and in the next iteration of trials, you will try to sample closer to those elites. You’ll keep doing this.
In CEM, action plan has no temporal correlation and it doesn’t have a memory of the actions that led to better behaviours in the past iterations.
iCEM improves over CEM by adding simple but smart tricks:
The figure is from Cristina Pinneri’s presentation
The result: \(2.7-22 \times\) more sample efficient and faster \(1.2-10 \times\) faster than CEM.
From Benchmarking Model-Based Reinforcement Learning