๐ Accompanying Slides: Google Presentation
This tutorial introduces an Unsupervised Reinforcement Learning Algorithm called the Forward Backward Model through the lens of multi-task policy iteration. The key insight is that the occupancy measure makes it possible to decouple dynamics learning and value learning, enabling, thus we can build a policy iteration that can optimize for a family of rewards at the same time. By factorizing occupancy measure with the forward backward representation, FB model further enables the automatic construction of the family of rewards and their corresponding optimal policies simultaneously.
Apart from the elegant math formulation, the training of FB model can also be intuitively understood as a latent space goal conditioned reinforcement learning framework, where the latent space is automatically constructed with dynamics prediction as the self-supervision signal and the zero-shot reward inference process can be understood as a reward weighted latent space goal retrieval process.
Note: This tutorial focuses on the Forward Backward model formulation from a fresh perspective, emphasizing intuition and algorithmic understanding. It does not cover:
- Optimization processes and convergence analysis
- Specific optimization algorithms (PPO, SAC, etc.)
Before diving into the Forward Backward model, let's establish the mathematical foundations with clear definitions of occupancy measures and Q functions.
Definition: The Q function represents the expected discounted future reward:
where
Bellman Equation: The recursive relationship for Q functions:
The expectation is taken over:
- Next state:
$s' \sim P(s'|s,a)$ - Next action:
$a' \sim \pi(a'|s')$
๐ก Key Point: A Q function is defined by an MDP with reward and a policy. It captures the discounted future reward starting from state-action pair
$(s,a)$ and following policy$\pi$ .
Definition: The occupancy measure captures the expected discounted future state visitation probability:
Bellman Equation: The recursive relationship for occupancy measures:
here
๐ก Key Point: An occupancy measure is defined by a reward-free MDP and a policy. It captures the discounted future state distribution starting from state-action pair
$(s,a)$ and following policy$\pi$ .
The elegant connection between these concepts emerges when we consider arbitrary reward functions. Although the occupancy measure is defined for reward-free MDPs, we can use it to evaluate Q functions for any reward function
Intuition: Since the occupancy measure represents the expected discounted future state distribution, we can evaluate the Q function by integrating over all possible future states, weighted by their respective rewards.
๐ Mathematical Insight: Deriving the Occupancy Measure Bellman Equation
Starting with the Q function Bellman equation and substituting the occupancy measure relationship:
$$\int_{s^+} M^{\pi}(s^+ | s, a) \cdot r(s^+) = \int_{s^+} p(s^+ | s, a) r(s^+) + \gamma \cdot E_{s', a'}\left[ \int_{s^+} M^{\pi}(s^+ | s', a') \cdot r(s^+) \right]$$ Since this equation must hold for all possible reward functions
$r(s^+)$ , we can factor out the reward term (mathematically, this can be proven using the properties of delta functions):
$$M^{\pi}(s^+ | s, a) = p(s^+ | s, a) + \gamma \cdot E_{s', a'}\left[ M^{\pi}(s^+ | s', a') \right]$$
This section bridges classical reinforcement learning algorithms with our multi-task framework by progressively introducing the occupancy measure perspective.
Core Update Rule: Q-learning performs Bellman updates in the following form:
Policy Extraction: In discrete action spaces, the Q function directly represents the optimal policy:
In continuous action spaces, we cannot take that argmax, we need to have a policy to propose the action. If we think of argmax as a rule based selection, using a policy to propose the action is a neural approximation of the same operation.
So the algorithm splits to two steps, one evaluating the current policy and one optimizing the policy.
Two-Step Process:
-
Policy Evaluation: $$Q(s,a) \leftarrow r(s) + \gamma \mathbb{E}{a' \sim \pi(a'|s')} \mathbb{E}{s' \sim P(s'|s,a)}[Q(s',a')]$$
-
Policy Improvement:
$$\mathcal{L}(\pi) = - \mathbb{E}_{s}[Q(s,a)]$$
We abuse math notation and use the following notation to denote the policy improvement step:
Now we can rewrite the above DDPG with occupancy measure:
-
Dynamics Learning (Reward-Free):
$$M^\pi(s^+ | s, a) \leftarrow p(s^+ | s, a) + \gamma \mathbb{E}_{s', a'}\left[ M^\pi(s^+ | s', a') \right]$$ -
Policy Improvement (Reward-Dependent):
$$\pi(s) \leftarrow \argmax_{a} \int_{s^+} M^\pi(s^+ | s, a) \cdot r(s^+)$$
The first equation is the bellman equation for occupancy measure, it captures the policy related transition dynamics of a reward-free MDP. The second equation takes reward into account and optimizes the policy by maximizing the expected discounted future state visitation weighted by the reward it will receive at those future states.
What does this new interpretation from occupancy measure perspective give us?
- Dynamics Factorization: The first equation provides a way to summarize the transition dynamics of the MDP without considering rewards. This factorization allows us to learn the transition dynamics of the MDP in a reward-agnostic way, but we can still evaluate the Q function by integrating over all possible future states and weighting them by the reward function.
-
Multi-Task Potential: Only the second equation is related to the reward, we can plug in multiple reward functions
${r_i}$ to the second equation and optimize for the optimal policies for each reward function simultaneously, because the learned dynamics summary can be reused across different reward functions
Goal: Given a family of reward functions
Algorithm:
-
For each reward function
$r$ , learn the corresponding occupancy measure:$$M^{\pi_r}(s^+ | s, a) \leftarrow p(s^+ | s, a) + \gamma \mathbb{E}_{s', a'}\left[ M^{\pi_r}(s^+ | s', a') \right]$$ -
For each reward function
$r$ , optimize the corresponding policy:$$\pi_r(s) \leftarrow \argmax_{a} \int_{s^+} M^{\pi_r}(s^+ | s, a) \cdot r(s^+)$$
๐ก Key Insight: This framework enables us to learn multiple policies simultaneously while sharing the underlying dynamics knowledge across tasks.
Now we introduce the core contribution: the Forward Backward model, which provides an elegant factorization for multi-task policy learning.
The Forward Backward model factorizes the occupancy measure as a product of two components:
Components:
-
$F^\pi(s, a)$ : Forward model (policy-dependent) - projects state-action pairs to latent space -
$B(s^+)$ : Backward model (policy-independent) - projects states to latent space
-
$B(s^+)$ : Acts as an encoder mapping states into a meaningful latent representation that captures important state features -
$F^\pi(s, a)$ : Measures the expected discounted sum of future latent representations under policy$\pi$ and encoding scheme$B$
This factorization enables us to separate "where we might go" (forward dynamics) from "what makes states valuable" (backward encoding).
Problem: Given a learned FB model that is associated with policy
Solution: Use the factorized occupancy measure:
Key Insight: The term
Goal: Learn a family of policies
- Expressiveness: Good performance across a large class of reward functions
- Retrievability: Easy identification of the optimal policy for any given reward function
Simplified Representation: We can rewrite the Q function as:
where
Optimization Intuition: The optimal
...And its length be as long as possible?
We will get to a bit of math details here to answer the question.
Mathematical Constraint: Since occupancy measures represent probability distributions:
Factorized Form:
So let's answer the question: if we increase the length of
Scaling Invariance: There can be infinite solutions for
Solution: So we make it a rule that the length of
Result: Under this constraint, the optimal policy corresponds to
Objective: Given a reward-free MDP, learn both a family of reward functions and their corresponding optimal policies using a latent space representation.
To make a concrete algorithm, we will use a latent space
For a FB model, we map reward functions to latent space using the backward encoder
This creates a "latent goal" that summarizes the reward function's preferences because states with higher rewards will have a larger weight and thus a greater influence on the latent representation.
Q Function: For policy
Bellman Update: Learn the forward dynamics:
Policy Improvement: Optimize policy to maximize latent goal alignment:
In pratice, we use a neural network conditioned on the reward latent
Consider the visual representation of reward in state space:
Key Insights:
-
$z$ : Represents the "center of mass" of high-reward regions in latent space - a latent goal -
$F^{\pi_z}(s, a)$ : Predicts the expected sum of future latent representations - Optimization Goal: Align future latent trajectories with high-reward latent regions
Intuitive Understanding: Maximizing
-
$B(s^+)$ : State-to-Latent Encoder - Maps explicit states into a meaningful latent goal space -
$F(s, a)$ : Forward Dynamics Predictor - Predicts future dynamics from current state in latent space -
$z$ : Latent Goal - The projection of a reward function into the learned latent space, summary of a reward funcโs preference -
$\pi(s, z)$ : Goal-Conditioned Policy - The optimal policy for rewards whose latent projection is$z$ , learned by maximizing alignment with z
The Forward Backward model can be understood as a latent space goal-conditioned reinforcement learning framework where:
- Unsupervised Latent Construction: The latent space emerges naturally through dynamics prediction and occupancy measure approximation as self-supervision.
-
Zero-Shot Reward Inference: Goals are computed as weighted sums of latent representations
$B(s^+)$ for states in the experience buffer, weighted by the reward function.
