Exploring Model-Based RL

Minimal overview of model-based RL methods(updated regularly)

Reinforcement Learning (RL) aims to derive an optimal policy for a Markov Decision Process (MDP). When we possess a perfect world model, achieving this optimal policy is straightforward. However, in most real-world scenarios, such a model is unknown and must be learned from data or manually designed.

Model-Free RL vs Model-Based RL

In Model-Free RL, we focus on deriving a controller without explicitly learning a model of the environment. Utilizing a replay buffer, which essentially acts as a non-parametric world model containing state-action transitions, we estimate the value function directly. In contrast, Model-Based RL strives to learn the underlying MDP by estimating transition probabilities \(P(s' | s, a)\) and the reward function \(R(s, a)\). This acquired parametric world model allows us to derive optimal policies through planning methods.

The debate regarding the superior approach—model-free or model-based—is ongoing. Model-free methods often demand extensive interaction with the environment, making them costly for real-world training. Consequently, these methods are primarily applied in simulated environments, with a few exceptions.

On the other hand, model-based reinforcement learning holds promise for superior sample efficiency (the amount of experience that an agent needs to generate in an environment during training in order to reach a certain level of performance). Yet, accurately learning a generalized model of the environment poses a significant challenge. The compounding error problem and the potential exploitation of learned model deficiencies are some of the challenges.

What is a (world) model?

Before delving into the details, let’s clarify what we mean by “world models” to avoid any potential confusion .

An internal world model serves as a mechanism to represent both the system and its environment. For instance, a robot equipped with an internal model possesses knowledge about the underlying physics of the world and can generate and test “what-if” hypotheses, such as the consequences of different actions:

What if I carry out action \(x\) ? and . . .
Where would I be if I turned right or left here?

Holland et al. writes:

an internal model allows a system to look ahead to the future consequences of current actions, without actually committing itself to those actions”.

Yann LeCun highlights the necessity of world models for minimizing costly and dangerous real-world interactions

Interactions in the real world are expensive and dangerous, intelligent agents should learn as much as they can about the world without interaction (by observation) so as to minimize the number of expensive and dangerous trials necessary to learn a particular task.

In simple mathematical terms, world models learns a function \(F: S_{t} \times A→ S_{t+1}\) or predicting the next state given the observed state(s) and current action.

General Algorithm

This picture illustrates the general paradigm of model-based RL:

Start with some initial policy \(\pi\) and no initial \(D\) replay buffer

Iterate for several episodes:

Learn a model from collected data \(D\)
Plan a new policy \(\pi\) based on the estimated models in imagination
Roll out policy \(\pi\) to collect additional data

We have three questions to answer to accomplish this:

How do we learn model?
How do we plan given a model?
How to we trade exploration and exploitation?

Model Learning

We need to learn a model that approximates the unknown MDP. We focus on the fully observable case (state \(x_t\) is observed).

“Data” in RL

Data consists of trajectories \((x_0, a_0, r_0, x_1, a_1, r_1, …)\)

Common settings:

Episodic settings: agent learns over multiple “training” episodes, each resulting in a new trajectory, after which the environment “resets”
Non-episodic/continous setting: agent learns “online”, yielding a single trajectory

Key Insight:

\(x_{t+1}\) is conditionally independent of \(x_{1:t-1}\) given \(x_t\), \(a_t\)

In fully observed environments, due to conditional independence, learning \(F\) and \(R\) is essentially a regression (density estimation) problem (supervised learning)

Delta state prediction

This allows the model to focus on capturing the changes accurately.

Key pitfalls in MBRL

Errors in the model estimate compound when planning over multiple time-steps.

This compounding error is exploited by planning algorithm (MPC, policy search)

Epistemic uncertainty

Aleatoric uncertainty

PETS Algorithm

Learning latent state representations

World Models, PlaNet (Hafner, ICML 19), Dreamer (Hafner, ICLR 20), MuZero(Schrittwieser, Nature 20), IRIS( Micheli + ICLR 23)

Dreamer v1, v2, DayDreamer ( application of Dreamer in 5 different real robots)

Planning

To start, assume we have a known deterministic model for the reward and dynamics

\[x_{t+1} = f (x_t, a_t)\]

Then, our objective becomes to plan sequences of actions such that over a fixed horizon, our chosen actions will yield the maximum reward. Notice, here we cannot explicitly optimize over an infinite horizon due to model errors and noise.

Key idea: plan over a finite horizon h, carry out first action, then replan ( inside world model):

at each iteration \(t\), observe \(s_t\)
optimize performance over horizon h (maximize cumulative reward of actions chosen by the planner)
carry out action \(a_t\), then replan

How to optimize for action sequences with the maximum cumulative reward during h-steps?

Answer:

For continuous actions,differentiable rewards, and differentiably dynamics, we can analytically compute gradients (backpropagation through time)

Challenges(especially for large \(h\)): local minima, vanishing/exploding gradients. Therefore, in practice, this optimization is done with a variety of gradient-free sample-based optimizers below.

Online Planning for closed-loop control

We use MPC to select actions via our model predictions in imagination. At each time step, the method perfroms a short-horizon trajectory optimization, using the model to predict the outcomes of different action sequences.

Random Shooting:

Randomly sample multiple action sequences
Simulate each action sequence to predict the resulting trajectory
Evaluate each trajectory using a cost or objective function
Selecting the action sequence associated with the best trajectory.

In short, random shooting is a direct, brute-force method for trajectory optimization. It can be a good choice where the action space isn’t too vast, or computational resources are abundant, making it inapplicable for robotic manipulation tasks. Nagabandi et al(2017) has first used Random Shooting method in continuous control tasks with learned models. Since Random Shooting has number of drawbacks due to scalability with dimensions, a lot of the recent literature focused on integrating a better planing algorithm called Cross-Entropy Method or its improved variants.

Cross-entropy method (CEM)

A smarter way to do random shooting is called cross-entropy method CEM. Highly recommended to refer to A Tutorial on Cross-Entropy Method

It begins as a random shooting approach, but then does this sampling for multiple iterations \(m [0, M]\) at each time step. The top \(J\) highest scoring actions sequences (elites) from each iteration are used to update mean and variance of the sampling distribution for the next iteration.

In simpler terms, you try out random stuff at the beginning and you evaluate this random actions based on some criteria (how close you are to the target or how much reward you receive). You choose \(J\) number of highest scoring elites from this sequences and in the next iteration of trials, you will try to sample closer to those elites. You’ll keep doing this.

Random Shooting vs CEM

While both Random Shooting and CEM are sampling-based methods, CEM iteratively refines its action samples. After each iteration in CEM, the action distribution is updated to focus more on the promising regions of the action space (based on the elite samples). In contrast, Random Shooting does not have this iterative refinement; it samples randomly every time.
As a result, CEM can potentially find better solutions with fewer samples than Random Shooting because it focuses on promising regions. However, the simplicity of Random Shooting might be favorable in some situations.

iCEM

In CEM, action plan has no temporal correlation and it doesn’t have a memory of the actions that led to better behaviours in the past iterations.

iCEM improves over CEM by adding simple but smart tricks:

Colored noise: ensures temporal correlation between action plans
Memory
Population decay

The figure is from Cristina Pinneri’s presentation

The result: \(2.7-22 \times\) more sample efficient and faster \(1.2-10 \times\) faster than CEM.

Planning Horizon Dilemma

From Benchmarking Model-Based Reinforcement Learning

Unsupervised Exploration in MBRL