| jupytext |
|
||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| kernelspec |
|
Unlike the methods we've discussed so far, dynamic programming takes a step back and considers an entire family of related problems rather than a single optimization problem. This approach, while seemingly more complex at first glance, can often lead to efficient solutions.
Dynamic programming leverage the solution structure underlying many control problems that allows for a decomposition it into smaller, more manageable subproblems. Each subproblem is itself an optimization problem, embedded within the larger whole. This recursive structure is the foundation upon which dynamic programming constructs its solutions.
To ground our discussion, let us return to the domain of discrete-time optimal control problems (DOCPs). These problems frequently arise from the discretization of continuous-time optimal control problems. While the focus here will be on deterministic problems, these concepts extend naturally to stochastic problems by taking the expectation over the random quantities.
Consider a typical DOCP of Bolza type:
Rather than considering only the total cost from the initial time to the final time, dynamic programming introduces the concept of cost from an arbitrary point in time to the end. This leads to the definition of the "cost-to-go" or "value function"
This function represents the total cost incurred from stage
Given knowledge of the optimal behavior from
This equation is known as Bellman's equation, named after Richard Bellman, who formulated the principle of optimality:
An optimal policy has the property that whatever the previous state and decision, the remaining decisions must constitute an optimal policy with regard to the state resulting from the previous decision.
In other words, any sub-path of an optimal path, from any intermediate point to the end, must itself be optimal. This principle is the basis for the backward induction procedure which computes the optimal value function and provides closed-loop control capabilities without having to use an explicit NLP solver.
Dynamic programming can handle nonlinear systems and non-quadratic cost functions naturally. It provides a global optimal solution, when one exists, and can incorporate state and control constraints with relative ease. However, as the dimension of the state space increases, this approach suffers from what Bellman termed the "curse of dimensionality." The computational complexity and memory requirements grow exponentially with the state dimension, rendering direct application of dynamic programming intractable for high-dimensional problems.
Fortunately, learning-based methods offer efficient tools to combat the curse of dimensionality on two fronts: by using function approximation (e.g., neural networks) to avoid explicit discretization, and by leveraging randomization through Monte Carlo methods inherent in the learning paradigm. Most of this course is dedicated to those ideas.
The principle of optimality provides a methodology for solving optimal control problems. Beginning at the final time horizon and working backwards, at each stage the local optimization problem given by Bellman's equation is solved. This process, termed backward recursion or backward induction, constructs the optimal value function stage by stage.
:label: backward-recursion-deterministic
**Input:** Terminal cost function $c_\mathrm{T}(\cdot)$, stage cost functions $c_t(\cdot, \cdot)$, system dynamics $f_t(\cdot, \cdot)$, time horizon $\mathrm{T}$
**Output:** Optimal value functions $J_t^\star(\cdot)$ and optimal control policies $\pi_t^\star(\cdot)$ for $t = 1, \ldots, T$
1. Initialize $J_T^\star(\mathbf{x}) = c_\mathrm{T}(\mathbf{x})$ for all $\mathbf{x}$ in the state space
2. For $t = T-1, T-2, \ldots, 1$:
1. For each state $\mathbf{x}$ in the state space:
1. Compute $J_t^\star(\mathbf{x}) = \min_{\mathbf{u}} \left[ c_t(\mathbf{x}, \mathbf{u}) + J_{t+1}^\star(f_t(\mathbf{x}, \mathbf{u})) \right]$
2. Compute $\pi_t^\star(\mathbf{x}) = \arg\min_{\mathbf{u}} \left[ c_t(\mathbf{x}, \mathbf{u}) + J_{t+1}^\star(f_t(\mathbf{x}, \mathbf{u})) \right]$
2. End For
3. End For
4. Return $J_t^\star(\cdot)$, $\pi_t^\star(\cdot)$ for $t = 1, \ldots, T$
Upon completion of this backward pass, we now have access to the optimal control to take at any stage and in any state. Furthermore, we can simulate optimal trajectories from any initial state and applying the optimal policy at each stage to generate the optimal trajectory.
:label: thm-bolza-backward
**Setting.** Let $\mathbf{x}_{t+1}=\mathbf{f}_t(\mathbf{x}_t,\mathbf{u}_t)$ for $t=1,\dots,T-1$, with admissible action sets $\mathcal{U}_t(\mathbf{x})\neq\varnothing$. Let stage costs $c_t(\mathbf{x},\mathbf{u})$ and terminal cost $c_\mathrm{T}(\mathbf{x})$ be real-valued and bounded below. Assume for every $(t,\mathbf{x})$ the one-step problem
$$
\min_{\mathbf{u}\in\mathcal{U}_t(\mathbf{x})}\big\{c_t(\mathbf{x},\mathbf{u})+J_{t+1}^\star(\mathbf{f}_t(\mathbf{x},\mathbf{u}))\big\}
$$
admits a minimizer (e.g., compact $\mathcal{U}_t(\mathbf{x})$ and continuity suffice).
Define $J_T^\star(\mathbf{x}) \equiv c_\mathrm{T}(\mathbf{x})$ and for $t=T-1,\dots,1$
$$
J_t^\star(\mathbf{x}) \;\triangleq\; \min_{\mathbf{u}\in\mathcal{U}_t(\mathbf{x})}
\Big[c_t(\mathbf{x},\mathbf{u})+J_{t+1}^\star\big(\mathbf{f}_t(\mathbf{x},\mathbf{u})\big)\Big],
$$
and select any minimizer $\boldsymbol{\pi}_t^\star(\mathbf{x})\in\arg\min(\cdot)$.
**Claim.** For every initial state $\mathbf{x}_1$, the control sequence
$\boldsymbol{\pi}_1^\star(\mathbf{x}_1),\dots,\boldsymbol{\pi}_{T-1}^\star(\mathbf{x}_{T-1})$
generated by these selectors is optimal for the Bolza problem, and
$J_1^\star(\mathbf{x}_1)$ equals the optimal cost. Moreover, $J_t^\star(\cdot)$ is the optimal cost-to-go from stage $t$ for every state, i.e., backward induction recovers the entire value function.
We give a direct proof by backward induction. The general idea is that any feasible sequence can be improved by replacing its tail with an optimal continuation, so optimal solutions can be built stage by stage. This is sometimes called a "cut-and-paste" argument.
**Step 1 (verification of the recursion at a fixed stage).**
Fix $t\in\{1,\dots,T-1\}$ and $\mathbf{x}\in\mathbb{X}$. Consider any admissible control sequence $\mathbf{u}_t,\dots,\mathbf{u}_{T-1}$ starting from $\mathbf{x}_t=\mathbf{x}$ and define the induced states $\mathbf{x}_{k+1}=\mathbf{f}_k(\mathbf{x}_k,\mathbf{u}_k)$. Its total cost from $t$ is
$$
c_t(\mathbf{x}_t,\mathbf{u}_t)+\sum_{k=t+1}^{T-1}c_k(\mathbf{x}_k,\mathbf{u}_k)+c_\mathrm{T}(\mathbf{x}_T).
$$
By definition of $J_{t+1}^\star$, the tail cost satisfies
$$
\sum_{k=t+1}^{T-1}c_k(\mathbf{x}_k,\mathbf{u}_k)+c_\mathrm{T}(\mathbf{x}_T)
\;\ge\; J_{t+1}^\star(\mathbf{x}_{t+1})
\;=\; J_{t+1}^\star\big(\mathbf{f}_t(\mathbf{x},\mathbf{u}_t)\big).
$$
Hence the total cost is bounded below by
$$
c_t(\mathbf{x},\mathbf{u}_t)+J_{t+1}^\star\big(\mathbf{f}_t(\mathbf{x},\mathbf{u}_t)\big).
$$
Taking the minimum over $\mathbf{u}_t\in\mathcal{U}_t(\mathbf{x})$ yields
$$
\text{(any admissible cost from $t$)}\;\ge\;J_t^\star(\mathbf{x}).
\tag{$\ast$}
$$
**Step 2 (existence of an optimal prefix at stage $t$).**
By assumption, there exists $\boldsymbol{\pi}_t^\star(\mathbf{x})$ attaining the minimum in the definition of $J_t^\star(\mathbf{x})$. If we now **paste** to $\boldsymbol{\pi}_t^\star(\mathbf{x})$ an optimal tail policy from $t+1$ (whose existence we will establish inductively), the resulting sequence attains cost exactly
$$
c_t\big(\mathbf{x},\boldsymbol{\pi}_t^\star(\mathbf{x})\big)
+J_{t+1}^\star\!\Big(\mathbf{f}_t\big(\mathbf{x},\boldsymbol{\pi}_t^\star(\mathbf{x})\big)\Big)
=J_t^\star(\mathbf{x}),
$$
which matches the lower bound $(\ast)$; hence it is optimal from $t$.
**Step 3 (backward induction over time).**
Base case $t=T$. The statement holds because $J_T^\star(\mathbf{x})=c_\mathrm{T}(\mathbf{x})$ and there is no control to choose.
Inductive step. Assume the tail statement holds for $t+1$: from any state $\mathbf{x}_{t+1}$ there exists an optimal control sequence realizing $J_{t+1}^\star(\mathbf{x}_{t+1})$. Then by Steps 1–2, selecting $\boldsymbol{\pi}_t^\star(\mathbf{x}_t)$ at stage $t$ and concatenating the optimal tail from $t+1$ yields an optimal sequence from $t$ with value $J_t^\star(\mathbf{x}_t)$.
By backward induction, the claim holds for all $t$, in particular for $t=1$ and any initial $\mathbf{x}_1$. Therefore the backward recursion both **certifies** optimality (verification) and **constructs** an optimal policy (synthesis), while recovering the full family $\{J_t^\star\}_{t=1}^T$.
The Bolza DOCP over the whole horizon couples all controls through the dynamics and is typically posed as a single large nonlinear program. The proof shows you can solve **$T-1$ sequences of one-step problems** instead: at each $(t,\mathbf{x})$ minimize
$$
\mathbf{u}\mapsto c_t(\mathbf{x},\mathbf{u}) + J_{t+1}^\star(\mathbf{f}_t(\mathbf{x},\mathbf{u})).
$$
In finite state–action spaces this becomes pure table lookup and argmin. In continuous spaces you still solve local one-step minimizations, but you avoid a monolithic horizon-coupled NLP.
Unroll time to form a DAG whose nodes are $(t,\mathbf{x})$ and whose edges correspond to feasible controls with edge weight $c_t(\mathbf{x},\mathbf{u})$. The terminal node cost is $c_\mathrm{T}(\cdot)$. The Bolza problem is a shortest-path problem on this DAG. The equation
$$
J_t^\star(\mathbf{x})=\min_{\mathbf{u}}\{c_t(\mathbf{x},\mathbf{u})+J_{t+1}^\star(\mathbf{f}_t(\mathbf{x},\mathbf{u}))\}
$$
is exactly the dynamic programming recursion for shortest paths on acyclic graphs, hence backward induction is optimal.
Dynamic programming is often used in resource management and conservation biology to devise policies to be implemented by decision makers and stakeholders : for eg. in fishereries, or timber harvesting. Per {cite}Conroy2013, we consider a population of a particular species, whose abundance we denote by
Here,
In our model population model, the abundance of a specicy
where the
Applying the principle of optimality, we can express the optimal value function
with the boundary condition
It's worth noting that while this example uses a relatively simple model, the same principles can be applied to more complex scenarios involving stochasticity, multiple species interactions, or spatial heterogeneity.
:tags: [hide-input]
# label: dp-harvest-policy
# caption: Dynamic programming harvest example: printed output shows the optimal policy table, resulting population trajectory, and per-period harvests for an initial population of 50 fish.
%config InlineBackend.figure_format = 'retina'
import numpy as np
# Parameters
r_max = 0.3
K = 125
T = 20 # Number of time steps
N_max = 100 # Maximum population size to consider
h_max = 0.5 # Maximum harvest rate
h_step = 0.1 # Step size for harvest rate
# Create state and decision spaces
N_space = np.arange(1, N_max + 1)
h_space = np.arange(0, h_max + h_step, h_step)
# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))
# Terminal value function (F_T)
def terminal_value(N):
return 0
# State return function (F)
def state_return(N, h):
return N * h
# State dynamics function
def state_dynamics(N, h):
return N + r_max * N * (1 - N / K) - N * h
# Backward iteration
for t in range(T - 1, -1, -1):
for i, N in enumerate(N_space):
max_value = float('-inf')
best_h = 0
for h in h_space:
if h > 1: # Ensure harvest rate doesn't exceed 100%
continue
next_N = state_dynamics(N, h)
if next_N < 1: # Ensure population doesn't go extinct
continue
next_N_index = np.searchsorted(N_space, next_N)
if next_N_index == len(N_space):
next_N_index -= 1
value = state_return(N, h) + V[t + 1, next_N_index]
if value > max_value:
max_value = value
best_h = h
V[t, i] = max_value
policy[t, i] = best_h
# Function to simulate the optimal policy with conversion to Python floats
def simulate_optimal_policy(initial_N, T):
trajectory = [float(initial_N)] # Ensure first value is a Python float
harvests = []
for t in range(T):
N = trajectory[-1]
N_index = np.searchsorted(N_space, N)
if N_index == len(N_space):
N_index -= 1
h = policy[t, N_index]
harvests.append(float(N * h)) # Ensure harvest is a Python float
next_N = state_dynamics(N, h)
trajectory.append(float(next_N)) # Ensure next population value is a Python float
return trajectory, harvests
# Example usage
initial_N = 50
trajectory, harvests = simulate_optimal_policy(initial_N, T)
print("Optimal policy:")
print(policy)
print("\nPopulation trajectory:", trajectory)
print("Harvests:", harvests)
print("Total harvest:", sum(harvests))
In many real-world problems, such as our resource management example, the state space is inherently continuous. Dynamic programming, however, is usually defined on discrete state spaces. To reconcile this, we approximate the value function on a finite grid of points and use interpolation to estimate its value elsewhere.
In our earlier example, we acted as if population sizes could only be whole numbers: 1 fish, 2 fish, 3 fish. But real measurements don't fit neatly. What do you do with a survey that reports 42.7 fish? Our reflex in the code example was to round to the nearest integer, effectively saying "let's just call it 43." This corresponds to nearest-neighbor interpolation, also known as discretization. It's the zeroth-order case: you assume the value between grid points is constant and equal to the closest one. In practice, this amounts to overlaying a grid on the continuous landscape and forcing yourself to stand at the intersections. In our demo code, this step was carried out with numpy.searchsorted.
While easy to implement, nearest-neighbor interpolation can introduce artifacts:
- Decisions may change abruptly, even if the state only shifts slightly.
- Precision is lost, especially in regimes where small variations matter.
- The curse of dimensionality forces an impractically fine grid if many state variables are added.
To address these issues, we can use higher-order interpolation. Instead of taking the nearest neighbor, we estimate the value at off-grid points by leveraging multiple nearby values.
Suppose we have computed
$$ J_k^\star(\mathbf{x}k) = \min{\mathbf{u}_k} \Big[ c_k(\mathbf{x}_k, \mathbf{u}_k)
- I_{k+1}\big(\mathbf{f}_k(\mathbf{x}_k, \mathbf{u}_k)\big) \Big]. $$
For instance, in one dimension, linear interpolation gives:
where
In higher-dimensional spaces, naive interpolation becomes prohibitively expensive due to the curse of dimensionality. Several approaches such as tensorized multilinear interpolation, radial basis functions, and machine learning models address this challenge by extending a common principle: they approximate the value function at unobserved points using information from a finite set of evaluations. However, as dimensionality continues to grow, even tensor methods face scalability limits, which is why flexible parametric models like neural networks have become essential tools for high-dimensional function approximation.
:label: backward-recursion-interp
**Input:**
- Terminal cost $c_\mathrm{T}(\cdot)$
- Stage costs $c_t(\cdot,\cdot)$
- Dynamics $f_t(\cdot,\cdot)$
- Time horizon $T$
- State grid $\mathcal{X}_\text{grid}$
- Action set $\mathcal{U}$
- Interpolation method $\mathcal{I}(\cdot)$ (e.g., linear, cubic spline, RBF, neural network)
**Output:** Value functions $J_t^\star(\cdot)$ and policies $\pi_t^\star(\cdot)$ for all $t$
1. **Initialize terminal values:**
- Compute $J_T^\star(\mathbf{x}) = c_\mathrm{T}(\mathbf{x})$ for all $\mathbf{x} \in \mathcal{X}_\text{grid}$
- Fit interpolator: $I_T \leftarrow \mathcal{I}(\{\mathbf{x}, J_T^\star(\mathbf{x})\}_{\mathbf{x} \in \mathcal{X}_\text{grid}})$
2. **Backward recursion:**
For $t = T-1, T-2, \dots, 0$:
a. **Bellman update at grid points:**
For each $\mathbf{x} \in \mathcal{X}_\text{grid}$:
- For each $\mathbf{u} \in \mathcal{U}$:
- Compute next state: $\mathbf{x}_\text{next} = f_t(\mathbf{x}, \mathbf{u})$
- **Interpolate future cost:** $\hat{J}_{t+1}(\mathbf{x}_\text{next}) = I_{t+1}(\mathbf{x}_\text{next})$
- Compute total cost: $J_t(\mathbf{x}, \mathbf{u}) = c_t(\mathbf{x}, \mathbf{u}) + \hat{J}_{t+1}(\mathbf{x}_\text{next})$
- **Minimize over actions:** $J_t^\star(\mathbf{x}) = \min_{\mathbf{u} \in \mathcal{U}} J_t(\mathbf{x}, \mathbf{u})$
- Store optimal action: $\pi_t^\star(\mathbf{x}) = \arg\min_{\mathbf{u} \in \mathcal{U}} J_t(\mathbf{x}, \mathbf{u})$
b. **Fit interpolator for current stage:**
$I_t \leftarrow \mathcal{I}(\{\mathbf{x}, J_t^\star(\mathbf{x})\}_{\mathbf{x} \in \mathcal{X}_\text{grid}})$
3. **Return:** Interpolated value functions $\{I_t\}_{t=0}^T$ and policies $\{\pi_t^\star\}_{t=0}^{T-1}$
Here is a demonstration of the backward recursion procedure using linear interpolation.
:tags: [hide-input]
# label: dp-harvest-interp
# caption: Backward recursion with linear interpolation: console output summarizes the smoothed optimal policy, state trajectory, and harvest totals for the resource management example.
import numpy as np
# Parameters
r_max = 0.3
K = 125
T = 20 # Number of time steps
N_max = 100 # Maximum population size to consider
h_max = 0.5 # Maximum harvest rate
h_step = 0.1 # Step size for harvest rate
# Create state and decision spaces
N_space = np.arange(1, N_max + 1)
h_space = np.arange(0, h_max + h_step, h_step)
# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))
# Terminal value function (F_T)
def terminal_value(N):
return 0
# State return function (F)
def state_return(N, h):
return N * h
# State dynamics function
def state_dynamics(N, h):
return N + r_max * N * (1 - N / K) - N * h
# Function to linearly interpolate between grid points in N_space
def interpolate_value_function(V, N_space, next_N, t):
if next_N <= N_space[0]:
return V[t, 0] # Below or at minimum population, return minimum value
if next_N >= N_space[-1]:
return V[t, -1] # Above or at maximum population, return maximum value
# Find indices to interpolate between
lower_idx = np.searchsorted(N_space, next_N) - 1
upper_idx = lower_idx + 1
# Linear interpolation
N_lower = N_space[lower_idx]
N_upper = N_space[upper_idx]
weight = (next_N - N_lower) / (N_upper - N_lower)
return (1 - weight) * V[t, lower_idx] + weight * V[t, upper_idx]
# Backward iteration with interpolation
for t in range(T - 1, -1, -1):
for i, N in enumerate(N_space):
max_value = float('-inf')
best_h = 0
for h in h_space:
if h > 1: # Ensure harvest rate doesn't exceed 100%
continue
next_N = state_dynamics(N, h)
if next_N < 1: # Ensure population doesn't go extinct
continue
# Interpolate value for next_N
value = state_return(N, h) + interpolate_value_function(V, N_space, next_N, t + 1)
if value > max_value:
max_value = value
best_h = h
V[t, i] = max_value
policy[t, i] = best_h
# Function to simulate the optimal policy using interpolation
def simulate_optimal_policy(initial_N, T):
trajectory = [initial_N]
harvests = []
for t in range(T):
N = trajectory[-1]
# Interpolate optimal harvest rate
if N <= N_space[0]:
h = policy[t, 0]
elif N >= N_space[-1]:
h = policy[t, -1]
else:
lower_idx = np.searchsorted(N_space, N) - 1
upper_idx = lower_idx + 1
weight = (N - N_space[lower_idx]) / (N_space[upper_idx] - N_space[lower_idx])
h = (1 - weight) * policy[t, lower_idx] + weight * policy[t, upper_idx]
harvests.append(float(N * h)) # Ensure harvest is a Python float
next_N = state_dynamics(N, h)
trajectory.append(float(next_N)) # Ensure next population value is a Python float
return trajectory, harvests
# Example usage
initial_N = 50
trajectory, harvests = simulate_optimal_policy(initial_N, T)
print("Optimal policy:")
print(policy)
print("\nPopulation trajectory:", trajectory)
print("Harvests:", harvests)
print("Total harvest:", sum(harvests))
Due to pedagogical considerations, this example is using our own implementation of the linear interpolation procedure. However, a more general and practical approach would be to use a built-in interpolation procedure in Numpy. Because our state space has a single dimension, we can simply use scipy.interpolate.interp1d which offers various interpolation methods through its kind argument, which can take values in 'linear', 'nearest', 'nearest-up', 'zero', 'slinear', 'quadratic', 'cubic', 'previous', or 'next'. 'zero', 'slinear', 'quadratic' and 'cubic'.
Here's a more general implementation which here uses cubic interpolation through the scipy.interpolate.interp1d function:
:tags: [hide-input]
# label: dp-harvest-cubic
# caption: Cubic interpolation further smooths the optimal harvest policy—this output prints the leading rows of the policy table along with the resulting trajectory and harvest statistics.
import numpy as np
from scipy.interpolate import interp1d
# Parameters
r_max = 0.3
K = 125
T = 20 # Number of time steps
N_max = 100 # Maximum population size to consider
h_max = 0.5 # Maximum harvest rate
h_step = 0.1 # Step size for harvest rate
# Create state and decision spaces
N_space = np.arange(1, N_max + 1)
h_space = np.arange(0, h_max + h_step, h_step)
# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))
# Terminal value function (F_T)
def terminal_value(N):
return 0
# State return function (F)
def state_return(N, h):
return N * h
# State dynamics function
def state_dynamics(N, h):
return N + r_max * N * (1 - N / K) - N * h
# Function to create interpolation function for a given time step
def create_interpolator(V_t, N_space):
return interp1d(N_space, V_t, kind='cubic', bounds_error=False, fill_value=(V_t[0], V_t[-1]))
# Backward iteration with interpolation
for t in range(T - 1, -1, -1):
interpolator = create_interpolator(V[t+1], N_space)
for i, N in enumerate(N_space):
max_value = float('-inf')
best_h = 0
for h in h_space:
if h > 1: # Ensure harvest rate doesn't exceed 100%
continue
next_N = state_dynamics(N, h)
if next_N < 1: # Ensure population doesn't go extinct
continue
# Use interpolation to get the value for next_N
value = state_return(N, h) + interpolator(next_N)
if value > max_value:
max_value = value
best_h = h
V[t, i] = max_value
policy[t, i] = best_h
# Function to simulate the optimal policy using interpolation
def simulate_optimal_policy(initial_N, T):
trajectory = [initial_N]
harvests = []
for t in range(T):
N = trajectory[-1]
# Create interpolator for the policy at time t
policy_interpolator = interp1d(N_space, policy[t], kind='cubic', bounds_error=False, fill_value=(policy[t][0], policy[t][-1]))
h = policy_interpolator(N)
harvests.append(float(N * h)) # Ensure harvest is a Python float
next_N = state_dynamics(N, h)
trajectory.append(float(next_N)) # Ensure next population value is a Python float
return trajectory, harvests
# Example usage
initial_N = 50
trajectory, harvests = simulate_optimal_policy(initial_N, T)
print("Optimal policy (first few rows):")
print(policy[:5])
print("\nPopulation trajectory:", trajectory)
print("Harvests:", harvests)
print("Total harvest:", sum(harvests))
While our previous discussion centered on deterministic systems, many real-world problems involve uncertainty. Stochastic Dynamic Programming (SDP) extends our framework to handle stochasticity in both the objective function and system dynamics. This extension naturally leads us to consider more general policy classes and to formalize when simpler policies suffice.
Before diving into stochastic systems, we need to establish terminology for the different types of strategies a decision maker might employ. In the deterministic setting, we implicitly used feedback controllers of the form
A decision rule is a prescription for action selection in each state at a specified decision epoch. These rules can vary in their complexity based on two main criteria:
- Dependence on history: Markovian or History-dependent
- Action selection method: Deterministic or Randomized
Markovian decision rules depend only on the current state, while history-dependent rules consider the entire sequence of past states and actions. Formally, a history
The set of all possible histories at time
-
$H_1 = \mathcal{S}$ (just the initial state) $H_2 = \mathcal{S} \times \mathcal{A} \times \mathcal{S}$ $H_t = \mathcal{S} \times (\mathcal{A} \times \mathcal{S})^{t-1}$
Deterministic rules select an action with certainty, while randomized rules specify a probability distribution over the action space.
These classifications lead to four types of decision rules:
-
Markovian Deterministic (MD):
$\pi_t: \mathcal{S} \rightarrow \mathcal{A}_s$ -
Markovian Randomized (MR):
$\pi_t: \mathcal{S} \rightarrow \mathcal{P}(\mathcal{A}_s)$ -
History-dependent Deterministic (HD):
$\pi_t: H_t \rightarrow \mathcal{A}_s$ -
History-dependent Randomized (HR):
$\pi_t: H_t \rightarrow \mathcal{P}(\mathcal{A}_s)$
where
A policy
The set of all policies of class
The largest set
:class: tip
- **Decision rule (kernel).** A map from information to action distributions:
- Markov, deterministic: $\pi_t:\mathcal{S}\to\mathcal{A}_s$
- Markov, randomized: $\pi_t(\cdot\mid s)\in\Delta(\mathcal{A}_s)$
- History-dependent: $\pi_t(\cdot\mid h_t)\in\Delta(\mathcal{A}_{s_t})$
- **Policy (sequence).** $\boldsymbol{\pi}=(\pi_1,\pi_2,\ldots)$.
- **Stationary policy.** $\boldsymbol{\pi}=\mathrm{const}(\pi)$ with $\pi_t\equiv\pi \ \forall t$.
By convention, we identify $\pi$ with its stationary policy $\mathrm{const}(\pi)$ when no confusion arises.
In the stochastic setting, our system evolution takes the form:
Here,
In this context, our objective shifts from minimizing a deterministic cost to minimizing the expected total cost:
where the expectation is taken over the distributions of the random variables
In practice, this expectation is often computed by discretizing the distribution of
When dealing with stochastic systems, a central question arises: what information should our control policy use? In the most general case, a policy might use the entire history of observations and actions. However, as we'll see, the Markovian structure of our problems allows for dramatic simplifications.
Let
where we now explicitly use the transition probabilities
:label: stoch-principle-opt
Let $u_t^*$ be the optimal expected return from epoch $t$ onward. Then:
**a.** $u_t^*$ satisfies the optimality equations:
$$u_t^*(h_t) = \sup_{a \in A_{s_t}} \left\{ r_t(s_t, a) + \sum_{j \in S} p_t(j|s_t, a) u_{t+1}^*(h_t, a, j) \right\}$$
with boundary condition $u_N^*(h_N) = r_N(s_N)$.
**b.** Any policy $\pi^*$ that selects actions attaining the supremum (or maximum) in the above equation at each history is optimal.
Intuition: This formalizes Bellman's principle of optimality: "An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision." The recursive structure means that optimal local decisions (choosing the best action at each step) lead to global optimality, even with uncertainty captured by the transition probabilities.
A simplification occurs when we examine these history-based equations more closely. The Markov property of our system dynamics and rewards means that the optimal return actually depends on the history only through the current state:
:label: stoch-state-suff
In finite-horizon stochastic MDPs with Markovian dynamics and rewards, the optimal return $u_t^*(h_t)$ depends on the history only through the current state $s_t$. Thus we can write $u_t^*(h_t) = v_t^*(s_t)$ for some function $v_t^*$ that depends only on state and time.
Following {cite:t}`Puterman1994` Theorem 4.4.2. We proceed by backward induction.
**Base case:** At the terminal time $N$, we have $u_N^*(h_N) = r_N(s_N)$ by the boundary condition. Since the terminal reward depends only on the final state $s_N$ and not on how we arrived there, $u_N^*(h_N) = u_N^*(s_N)$.
**Inductive step:** Assume $u_{t+1}^*(h_{t+1})$ depends on $h_{t+1}$ only through $s_{t+1}$ for all $t+1, \ldots, N$. Then from the optimality equation:
$$u_t^*(h_t) = \sup_{a \in A_{s_t}} \left\{ r_t(s_t, a) + \sum_{j \in S} p_t(j|s_t, a) u_{t+1}^*(h_t, a, j) \right\}$$
By the induction hypothesis, $u_{t+1}^*(h_t, a, j)$ depends only on the next state $j$, so:
$$u_t^*(h_t) = \sup_{a \in A_{s_t}} \left\{ r_t(s_t, a) + \sum_{j \in S} p_t(j|s_t, a) u_{t+1}^*(j) \right\}$$
Since the expression in brackets depends on $h_t$ only through the current state $s_t$ (the rewards and transition probabilities are Markovian), we conclude that $u_t^*(h_t) = u_t^*(s_t)$.
Intuition: The Markov property means that the current state contains all information needed to predict future evolution. The past provides no additional value for decision-making. This result allows us to work with value functions
This state-sufficiency result, combined with the fact that randomization never helps when maximizing expected returns, leads to a dramatic simplification of the policy space:
:label: stoch-policy-reduction
For finite-horizon stochastic MDPs with finite state and action sets:
$$
\sup_{\pi \in \Pi^{\mathrm{HR}}} v_\pi(s,t) = \max_{\pi \in \Pi^{\mathrm{MD}}} v_\pi(s,t)
$$
That is, there exists an optimal policy that is both deterministic and Markovian.
Sketch following {cite:t}`Puterman1994` Lemma 4.3.1 and Theorem 4.4.2. First, Lemma 4.3.1 shows that for any function $w$ and any distribution $q$ over actions, $\sup_a w(a) \ge \sum_a q(a) w(a)$. Thus randomization cannot improve the expected value over choosing a single maximizing action. Second, by state sufficiency (Proposition {ref}`stoch-state-suff` and {cite:t}`Puterman1994` Thm. 4.4.2(a)), the optimal return depends on the history only through $(s_t,t)$. Therefore, selecting at each $(s_t,t)$ an action that attains the maximum yields a deterministic Markov decision rule which is optimal whenever the maximum is attained. If only a supremum exists, $\varepsilon$-optimal selectors exist by choosing actions within $\varepsilon$ of the supremum (see {cite:t}`Puterman1994` Thm. 4.3.4).
Intuition: Even in stochastic systems, randomization in the policy doesn't help when maximizing expected returns: you should always choose the action with the highest expected value. Combined with state sufficiency, this means simple state-to-action mappings are optimal.
These results justify focusing on deterministic Markov policies and lead to the backward recursion algorithm for stochastic systems:
:label: backward-recursion-stochastic
**Input:** Terminal cost function $c_\mathrm{T}(\cdot)$, stage cost functions $c_t(\cdot, \cdot, \cdot)$, system dynamics $\mathbf{f}_t(\cdot, \cdot, \cdot)$, time horizon $\mathrm{T}$, disturbance distributions
**Output:** Optimal value functions $J_t^\star(\cdot)$ and optimal control policies $\pi_t^\star(\cdot)$ for $t = 1, \ldots, T$
1. Initialize $J_T^\star(\mathbf{x}) = c_\mathrm{T}(\mathbf{x})$ for all $\mathbf{x}$ in the state space
2. For $t = T-1, T-2, \ldots, 1$:
1. For each state $\mathbf{x}$ in the state space:
1. Compute $J_t^\star(\mathbf{x}) = \min_{\mathbf{u}} \mathbb{E}_{\mathbf{w}_t}\left[c_t(\mathbf{x}, \mathbf{u}, \mathbf{w}_t) + J_{t+1}^\star(\mathbf{f}_t(\mathbf{x}, \mathbf{u}, \mathbf{w}_t))\right]$
2. Compute $\pi_t^\star(\mathbf{x}) = \arg\min_{\mathbf{u}} \mathbb{E}_{\mathbf{w}_t}\left[c_t(\mathbf{x}, \mathbf{u}, \mathbf{w}_t) + J_{t+1}^\star(\mathbf{f}_t(\mathbf{x}, \mathbf{u}, \mathbf{w}_t))\right]$
2. End For
3. End For
4. Return $J_t^\star(\cdot)$, $\pi_t^\star(\cdot)$ for $t = 1, \ldots, T$
While SDP provides us with a framework to for handling uncertainty, it makes the curse of dimensionality even more difficult to handle in practice. Both the state space and the disturbance space must be discretized. This can lead to a combinatorial explosion in the number of scenarios to be evaluated at each stage.
However, just as we tackled the challenges of continuous state spaces with discretization and interpolation, we can devise efficient methods to handle the additional complexity of evaluating expectations. This problem essentially becomes one of numerical integration. When the set of disturbances is continuous (as is often the case with continuous state spaces), we enter a domain where numerical quadrature methods could be applied. But these methods tend to scale poorly as the number of dimensions grows. This is where more efficient techniques, often rooted in Monte Carlo methods, come into play. Two ingredients tackle the curse of dimensionality:
- Function approximation (through discretization, interpolation, neural networks, etc.)
- Monte Carlo integration (simulation)
These two elements essentially distill the key ingredients of machine learning, which is the direction we'll be exploring in this course.
Building upon our previous deterministic model, we now introduce stochasticity to more accurately reflect the uncertainties inherent in real-world resource management scenarios {cite:p}Conroy2013. As before, we consider a population of a particular species, whose abundance we denote by
Here,
By expressing the harvest rate as a random variable, we mean to capture the fact that harvesting is a not completely under our control: we might obtain more or less what we had intended to. Furthermore, we generalize the population dynamics to the stochastic case via:
$$
x_{t+1} = x_t + r_tx_t(1 - x_t/K) - h_tx_t $$
where
where
where the expectation is taken over the harvest and growth rate random variables. The boundary condition remains
:tags: [hide-input]
# label: dp-harvest-stochastic
# caption: Stochastic resource management simulation: the cell reports the optimal policy sample, average trajectory, and visualizes ensemble trajectories plus the distribution of total harvest.
import numpy as np
from scipy.interpolate import interp1d
# Parameters
r_max = 0.3
K = 125
T = 30 # Number of time steps
N_max = 100 # Maximum population size to consider
h_max = 0.5 # Maximum harvest rate
h_step = 0.1 # Step size for harvest rate
# Create state and decision spaces
N_space = np.linspace(1, N_max, 100) # Using more granular state space
h_space = np.arange(0, h_max + h_step, h_step)
# Stochastic parameters
h_outcomes = np.array([0.75, 1.0, 1.25])
h_probs = np.array([0.25, 0.5, 0.25])
r_outcomes = np.array([0.85, 1.05, 1.15]) * r_max
r_probs = np.array([0.25, 0.5, 0.25])
# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))
# State return function (F)
def state_return(N, h):
return N * h
# State dynamics function (stochastic)
def state_dynamics(N, h, r):
return N + r * N * (1 - N / K) - h * N
# Function to create interpolation function for a given time step
def create_interpolator(V_t, N_space):
return interp1d(N_space, V_t, kind='linear', bounds_error=False, fill_value=(V_t[0], V_t[-1]))
# Backward iteration with stochastic dynamics
for t in range(T - 1, -1, -1):
interpolator = create_interpolator(V[t+1], N_space)
for i, N in enumerate(N_space):
max_value = float('-inf')
best_h = 0
for h in h_space:
if h > 1: # Ensure harvest rate doesn't exceed 100%
continue
expected_value = 0
for h_factor, h_prob in zip(h_outcomes, h_probs):
for r_factor, r_prob in zip(r_outcomes, r_probs):
realized_h = h * h_factor
realized_r = r_factor
next_N = state_dynamics(N, realized_h, realized_r)
if next_N < 1: # Ensure population doesn't go extinct
continue
# Use interpolation to get the value for next_N
value = state_return(N, realized_h) + interpolator(next_N)
expected_value += value * h_prob * r_prob
if expected_value > max_value:
max_value = expected_value
best_h = h
V[t, i] = max_value
policy[t, i] = best_h
# Function to simulate the optimal policy using interpolation (stochastic version)
def simulate_optimal_policy(initial_N, T, num_simulations=100):
all_trajectories = []
all_harvests = []
for _ in range(num_simulations):
trajectory = [initial_N]
harvests = []
for t in range(T):
N = trajectory[-1]
# Create interpolator for the policy at time t
policy_interpolator = interp1d(N_space, policy[t], kind='linear', bounds_error=False, fill_value=(policy[t][0], policy[t][-1]))
intended_h = policy_interpolator(N)
# Apply stochasticity
h_factor = np.random.choice(h_outcomes, p=h_probs)
r_factor = np.random.choice(r_outcomes, p=r_probs)
realized_h = intended_h * h_factor
harvests.append(N * realized_h)
next_N = state_dynamics(N, realized_h, r_factor)
trajectory.append(next_N)
all_trajectories.append(trajectory)
all_harvests.append(harvests)
return all_trajectories, all_harvests
# Example usage
initial_N = 50
trajectories, harvests = simulate_optimal_policy(initial_N, T)
# Calculate average trajectory and total harvest
avg_trajectory = np.mean(trajectories, axis=0)
avg_total_harvest = np.mean([sum(h) for h in harvests])
print("Optimal policy (first few rows):")
print(policy[:5])
print("\nAverage population trajectory:", avg_trajectory)
print("Average total harvest:", avg_total_harvest)
# Plot results
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
# Apply book style
try:
import scienceplots
plt.style.use(['science', 'notebook'])
except (ImportError, OSError):
pass # Use matplotlib defaults
plt.figure(figsize=(12, 6))
plt.subplot(121)
for traj in trajectories[:20]: # Plot first 20 trajectories
plt.plot(range(T+1), traj, alpha=0.3)
plt.plot(range(T+1), avg_trajectory, 'r-', linewidth=2)
plt.title('Population Trajectories')
plt.xlabel('Time')
plt.ylabel('Population')
plt.subplot(122)
plt.hist([sum(h) for h in harvests], bins=20)
plt.title('Distribution of Total Harvest')
plt.xlabel('Total Harvest')
plt.ylabel('Frequency')
plt.tight_layout()
We now examine a special case where the backward recursion admits a closed-form solution. When the system dynamics are linear and the cost function is quadratic, the optimization at each stage can be solved analytically. Moreover, the value function itself maintains a quadratic structure throughout the recursion, and the optimal policy reduces to a simple linear feedback law. This result eliminates the need for discretization, interpolation, or any function approximation. The infinite-dimensional problem collapses to tracking a finite set of matrices.
Consider a discrete-time linear system:
where
The cost function to be minimized is quadratic:
where
What we now have to observe is that if the terminal cost is quadratic, then the value function at every earlier stage remains quadratic. This is not immediately obvious, but it follows from the structure of Bellman's equation combined with the linearity of the dynamics.
We claim that the optimal cost-to-go from any stage
for some positive semidefinite matrix
Let's verify this structure and derive the recursion for
Substituting the dynamics $\mathbf{x}_{t+1} = A_t\mathbf{x}_t + B_t\mathbf{u}t$ and the quadratic form for $J{t+1}^\star$:
Expanding the last term:
The expression inside the minimization becomes:
Collecting terms involving
This is a quadratic function of
Since
Define the gain matrix:
so that
Substituting
where
Putting everything together, the backward induction procedure under the LQR setting then becomes:
:label: backward-recursion-lqr
**Input:** System matrices $A_t, B_t$, cost matrices $Q_t, R_t, Q_T$, time horizon $T$
**Output:** Cost matrices $P_t$ and gain matrices $K_t$ for $t = 0, \ldots, T-1$
1. **Initialize:** $P_T = Q_T$
2. **For** $t = T-1, T-2, \ldots, 0$:
1. Compute the gain matrix:
$$K_t = (R_t + B_t^\top P_{t+1} B_t)^{-1} B_t^\top P_{t+1} A_t$$
2. Compute the cost matrix via the Riccati equation:
$$P_t = Q_t + A_t^\top P_{t+1} A_t - A_t^\top P_{t+1} B_t (R_t + B_t^\top P_{t+1} B_t)^{-1} B_t^\top P_{t+1} A_t$$
3. **End For**
4. **Return:** $\{P_0, \ldots, P_T\}$ and $\{K_0, \ldots, K_{T-1}\}$
**Optimal policy:** $\mathbf{u}_t^\star = -K_t\mathbf{x}_t$
**Optimal cost-to-go:** $J_t^\star(\mathbf{x}_t) = \frac{1}{2}\mathbf{x}_t^\top P_t \mathbf{x}_t$
Rather than expressing the stochasticity in our system through a disturbance term as a parameter to a deterministic difference equation, we often work with an alternative representation (more common in operations research) which uses the Markov Decision Process formulation. The idea is that when we model our system in this way with the disturbance term being drawn indepently of the previous stages, the induced trajectory are those of a Markov chain. Hence, we can re-cast our control problem in that language, leading to the so-called Markov Decision Process framework in which we express the system dynamics in terms of transition probabilities rather than explicit state equations. In this framework, we express the probability that the system is in a given state using the transition probability function:
This function gives the probability of transitioning to state
Given the control theory formulation of our problem via a deterministic dynamics function and a noise term, we can derive the corresponding transition probability function through the following relationship:
Here,
For a system with deterministic dynamics and no disturbance, the transition probabilities become much simpler and be expressed using the indicator function. Given a deterministic system with dynamics:
The transition probability function can be expressed as:
With this transition probability function, we can recast our Bellman optimality equation:
Here,
This formulation offers several advantages:
-
It makes the Markovian nature of the problem explicit: the future state depends only on the current state and action, not on the history of states and actions.
-
For discrete-state problems, the entire system dynamics can be specified by a set of transition matrices, one for each possible action.
-
It allows us to bridge the gap with the wealth of methods in the field of probabilistic graphical models and statistical machine learning techniques for modelling and analysis.
The presentation above was intended to bridge the gap between the control-theoretic perspective and the world of closed-loop control through the idea of determining the value function of a parametric optimal control problem. We then saw how the backward induction procedure was applicable to both the deterministic and stochastic cases by taking the expectation over the disturbance variable. We then said that we can alternatively work with a representation of our system where instead of writing our model as a deterministic dynamics function taking a disturbance as an input, we would rather work directly via its transition probability function, which gives rise to the Markov chain interpretation of our system in simulation.
Note that the notation used in control theory tends to differ from that found in operations research communities, in which the field of dynamic programming flourished. We summarize those (purely notational) differences in this section.
In operations research, the system state at each decision epoch is typically denoted by
The dynamics of the system are described by a transition probability function
It's worth noting that in operations research, we typically work with reward maximization rather than cost minimization, which is more common in control theory. However, we can easily switch between these perspectives by simply negating the quantity. That is, maximizing a reward function is equivalent to minimizing its negative, which we would then call a cost function.
The reward function is denoted by
Combined together, these elemetns specify a Markov decision process, which is fully described by the tuple:
where
Let's go back to the starting point and define what it means for a policy to be optimal in a Markov Decision Problem. For this, we will be considering different possible search spaces (policy classes) and compare policies based on the ordering of their value from any possible start state. The value of a policy
where
In finite-horizon MDPs, our goal is to identify an optimal policy, denoted by
We call $\boldsymbol{\pi}^$ an optimal policy because it yields the highest possible value across all states and all policies within the policy class $\Pi^{\text{HR}}$. We denote by $v^$ the maximum value achievable by any policy:
In reinforcement learning literature, $v^$ is typically referred to as the "optimal value function," while in some operations research references, it might be called the "value of an MDP." An optimal policy $\boldsymbol{\pi}^$ is one for which its value function equals the optimal value function:
This notion of optimality applies to every state. Policies optimal in this sense are sometimes called "uniformly optimal policies." A weaker notion of optimality, often encountered in reinforcement learning practice, is optimality with respect to an initial distribution of states. In this case, we seek a policy
where
The maximum value can be achieved by searching over the space of deterministic Markovian Policies. Consequently:
This equality significantly simplifies the computational complexity of our algorithms, as the search problem can now be decomposed into
:label: backward-induction
**Input:** State space $S$, Action space $A$, Transition probabilities $p_t$, Reward function $r_t$, Time horizon $N$
**Output:** Optimal value functions $v^*$
1. Initialize:
- Set $t = N$
- For all $s_N \in S$:
$$v^*(s_N, N) = r_N(s_N)$$
2. For $t = N-1$ to $1$:
- For each $s_t \in S$:
a. Compute the optimal value function:
$$v^*(s_t, t) = \max_{a \in A_{s_t}} \left\{r_t(s_t, a) + \sum_{j \in S} p_t(j | s_t, a) v^*(j, t+1)\right\}$$
b. Determine the set of optimal actions:
$$A_{s_t,t}^* = \arg\max_{a \in A_{s_t}} \left\{r_t(s_t, a) + \sum_{j \in S} p_t(j | s_t, a) v^*(j, t+1)\right\}$$
3. Return the optimal value functions $u_t^*$ and optimal action sets $A_{s_t,t}^*$ for all $t$ and $s_t$
Note that the same procedure can also be used for finding the value of a policy with minor changes;
:label: backward-policy-evaluation
**Input:**
- State space $S$
- Action space $A$
- Transition probabilities $p_t$
- Reward function $r_t$
- Time horizon $N$
- A markovian deterministic policy $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_{N-1})$
**Output:** Value function $v^{\boldsymbol{\pi}}$ for policy $\boldsymbol{\pi}$
1. Initialize:
- Set $t = N$
- For all $s_N \in S$:
$$v^{\boldsymbol{\pi}}(s_N, N) = r_N(s_N)$$
2. For $t = N-1$ to $1$:
- For each $s_t \in S$:
a. Compute the value function for the given policy:
$$v^{\boldsymbol{\pi}}(s_t, t) = r_t(s_t, \pi_t(s_t)) + \sum_{j \in S} p_t(j | s_t, \pi_t(s_t)) v^{\boldsymbol{\pi}}(j, t+1)$$
3. Return the value function $v^{\boldsymbol{\pi}}(s_t, t)$ for all $t$ and $s_t$
This code could also finally be adapted to support randomized policies using:
Pharmaceutical development is the process of bringing a new drug from initial discovery to market availability. This process is lengthy, expensive, and risky, typically involving several stages:
- Drug Discovery: Identifying a compound that could potentially treat a disease.
- Preclinical Testing: Laboratory and animal testing to assess safety and efficacy.
. Clinical Trials: Testing the drug in humans, divided into phases:
- Phase I: Testing for safety in a small group of healthy volunteers.
- Phase II: Testing for efficacy and side effects in a larger group with the target condition.
- Phase III: Large-scale testing to confirm efficacy and monitor side effects.
- Regulatory Review: Submitting a New Drug Application (NDA) for approval.
- Post-Market Safety Monitoring: Continuing to monitor the drug's effects after market release.
This process can take 10-15 years and cost over $1 billion {cite}Adams2009. The high costs and risks involved call for a principled approach to decision making. We'll focus on the clinical trial phases and NDA approval, per the MDP model presented by {cite}Chang2010:
-
States (
$S$ ): Our state space is$S = {s_1, s_2, s_3, s_4}$ , where:-
$s_1$ : Phase I clinical trial -
$s_2$ : Phase II clinical trial -
$s_3$ : Phase III clinical trial -
$s_4$ : NDA approval
-
-
Actions (
$A$ ): At each state, the action is choosing the sample size$n_i$ for the corresponding clinical trial. The action space is$A = {10, 11, ..., 1000}$ , representing possible sample sizes. -
Transition Probabilities (
$P$ ): The probability of moving from one state to the next depends on the chosen sample size and the inherent properties of the drug. We define:-
$P(s_2|s_1, n_1) = p_{12}(n_1) = \sum_{i=0}^{\lfloor\eta_1 n_1\rfloor} \binom{n_1}{i} p_0^i (1-p_0)^{n_1-i}$ where$p_0$ is the true toxicity rate and$\eta_1$ is the toxicity threshold for Phase I.
-
-
Of particular interest is the transition from Phase II to Phase III which we model as:
$P(s_3|s_2, n_2) = p_{23}(n_2) = \Phi\left(\frac{\sqrt{n_2}}{2}\delta - z_{1-\eta_2}\right)$ where
$\Phi$ is the cumulative distribution function (CDF) of the standard normal distribution:$\Phi(x) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^x e^{-t^2/2} dt$ This is giving us the probability that we would observe a treatment effect this large or larger if the null hypothesis (no treatment effect) were true. A higher probability indicates stronger evidence of a treatment effect, making it more likely that the drug will progress to Phase III.
In this expression,
$\delta$ is called the "normalized treatment effect". In clinical trials, we're often interested in the difference between the treatment and control groups. The "normalized" part means we've adjusted this difference for the variability in the data. Specifically$\delta = \frac{\mu_t - \mu_c}{\sigma}$ where$\mu_t$ is the mean outcome in the treatment group,$\mu_c$ is the mean outcome in the control group, and$\sigma$ is the standard deviation of the outcome. A larger$\delta$ indicates a stronger treatment effect.Furthermore, the term
$z_{1-\eta_2}$ is the$(1-\eta_2)$ -quantile of the standard normal distribution. In other words, it's the value where the probability of a standard normal random variable being greater than this value is$\eta_2$ . For example, if$\eta_2 = 0.05$ , then$z_{1-\eta_2} \approx 1.645$ . A smaller$\eta_2$ makes the trial more conservative, requiring stronger evidence to proceed to Phase III.Finally,
$n_2$ is the sample size for Phase II. The$\sqrt{n_2}$ term reflects that the precision of our estimate of the treatment effect improves with the square root of the sample size. -
$P(s_4|s_3, n_3) = p_{34}(n_3) = \Phi\left(\frac{\sqrt{n_3}}{2}\delta - z_{1-\eta_3}\right)$ where$\eta_3$ is the significance level for Phase III.
-
Rewards (
$R$ ): The reward function captures the costs of running trials and the potential profit from a successful drug:-
$r(s_i, n_i) = -c_i(n_i)$ for$i = 1, 2, 3$ , where$c_i(n_i)$ is the cost of running a trial with sample size$n_i$ . -
$r(s_4) = g_4$ , where$g_4$ is the expected profit from a successful drug.
-
-
Discount Factor (
$\gamma$ ): We use a discount factor$0 < \gamma \leq 1$ to account for the time value of money and risk preferences.
:tags: [hide-input]
# label: dp-clinical-trials
# caption: Clinical trial phase-sizing via backward induction: the console output lists the phase values, recommended enrollment for each phase, and basic sanity checks on the resulting policy.
import numpy as np
from scipy.stats import binom
from scipy.stats import norm
def binomial_pmf(k, n, p):
return binom.pmf(k, n, p)
def transition_prob_phase1(n1, eta1, p0):
return np.sum([binomial_pmf(i, n1, p0) for i in range(int(eta1 * n1) + 1)])
def transition_prob_phase2(n2, eta2, delta):
return norm.cdf((np.sqrt(n2) / 2) * delta - norm.ppf(1 - eta2))
def transition_prob_phase3(n3, eta3, delta):
return norm.cdf((np.sqrt(n3) / 2) * delta - norm.ppf(1 - eta3))
def immediate_reward(n):
return -n # Negative to represent cost
def backward_induction(S, A, gamma, g4, p0, delta, eta1, eta2, eta3):
V = np.zeros(len(S))
V[3] = g4 # Value for NDA approval state
optimal_n = [None] * 3 # Store optimal n for each phase
# Backward induction
for i in range(2, -1, -1): # Iterate backwards from Phase III to Phase I
max_value = -np.inf
for n in A:
if i == 0: # Phase I
p = transition_prob_phase1(n, eta1, p0)
elif i == 1: # Phase II
p = transition_prob_phase2(n, eta2, delta)
else: # Phase III
p = transition_prob_phase3(n, eta3, delta)
value = immediate_reward(n) + gamma * p * V[i+1]
if value > max_value:
max_value = value
optimal_n[i] = n
V[i] = max_value
return V, optimal_n
# Set up the problem parameters
S = ['Phase I', 'Phase II', 'Phase III', 'NDA approval']
A = range(10, 1001)
gamma = 0.95
g4 = 10000
p0 = 0.1 # Example toxicity rate for Phase I
delta = 0.5 # Example normalized treatment difference
eta1, eta2, eta3 = 0.2, 0.1, 0.025
# Run the backward induction algorithm
V, optimal_n = backward_induction(S, A, gamma, g4, p0, delta, eta1, eta2, eta3)
# Print results
for i, state in enumerate(S):
print(f"Value for {state}: {V[i]:.2f}")
print(f"Optimal sample sizes: Phase I: {optimal_n[0]}, Phase II: {optimal_n[1]}, Phase III: {optimal_n[2]}")
# Sanity checks
print("\nSanity checks:")
print(f"1. NDA approval value: {V[3]}")
print(f"2. All values non-positive and <= NDA value: {all(v <= V[3] for v in V)}")
print(f"3. Optimal sample sizes in range: {all(10 <= n <= 1000 for n in optimal_n if n is not None)}")
It often makes sense to model control problems over infinite horizons. We extend the previous setting and define the expected total reward of policy
One drawback of this model is that we could easily encounter values that are
Therefore, it is often more convenient to work with an alternative formulation which guarantees the existence of a limit: the expected total discounted reward of policy
for
Finally, another possibility for the infinite-horizon setting is the so-called average reward or gain of policy
We won't be working with this formulation in this course due to its inherent practical and theoretical complexities.
Extending the previous notion of optimality from finite-horizon models, a policy
Furthermore, the value of a discounted MDP
More often, we refer to
As for the finite-horizon setting, the infinite horizon discounted model does not require history-dependent policies, since for any
The use of discounting can be motivated both from a modeling perspective and as a means to ensure that the total reward remains bounded. From the modeling perspective, we can view discounting as a way to weight more or less importance on the immediate rewards vs. the long-term consequences. There is also another interpretation which stems from that of a finite horizon model but with an uncertain end time. More precisely:
Let
:label: prop-5-3-1
Suppose that the horizon $\nu$ follows a geometric distribution with parameter $\gamma$, $0 \leq \gamma < 1$, independent of the policy such that
$P(\nu=n) = (1-\gamma) \gamma^{n-1}, \, n=1,2, \ldots$, then $v_\nu^{\boldsymbol{\pi}}(s) = v_\gamma^{\boldsymbol{\pi}}(s)$ for all $s \in \mathcal{S}$ .
See proposition 5.3.1 in {cite}`Puterman1994`.
By definition of the finite-horizon value function and the law of total expectation:
$$
v_\nu^{\boldsymbol{\pi}}(s) = \sum_{n=1}^{\infty} P(\nu=n) \cdot v_n^{\boldsymbol{\pi}}(s) = \sum_{n=1}^{\infty} (1-\gamma) \gamma^{n-1} \cdot E_s^{\boldsymbol{\pi}} \left\{\sum_{t=1}^n r(S_t, A_t)\right\}.
$$
Combining the expectation with the sum over $n$:
$$
v_\nu^{\boldsymbol{\pi}}(s) = E_s^{\boldsymbol{\pi}} \left\{\sum_{n=1}^{\infty} (1-\gamma) \gamma^{n-1} \sum_{t=1}^n r(S_t, A_t)\right\}.
$$
**Reordering the summations:** Under the bounded reward assumption $|r(s,a)| \leq R_{\max}$ and $\gamma < 1$, we have
$$
E_s^{\boldsymbol{\pi}} \left\{\sum_{n=1}^{\infty} \sum_{t=1}^n |r(S_t, A_t)| \cdot (1-\gamma) \gamma^{n-1}\right\} \leq R_{\max} \sum_{n=1}^{\infty} n (1-\gamma) \gamma^{n-1} = \frac{R_{\max}}{1-\gamma} < \infty,
$$
which justifies exchanging the order of summation by Fubini's theorem.
To reverse the order, note that the pair $(n,t)$ with $1 \leq t \leq n$ can be reindexed by fixing $t$ first and letting $n$ range from $t$ to $\infty$:
$$
\sum_{n=1}^{\infty} \sum_{t=1}^n = \sum_{t=1}^{\infty} \sum_{n=t}^{\infty}.
$$
Therefore:
\begin{align*}
v_\nu^{\boldsymbol{\pi}}(s) &= E_s^{\boldsymbol{\pi}} \left\{\sum_{t=1}^{\infty} r(S_t, A_t) \sum_{n=t}^{\infty} (1-\gamma) \gamma^{n-1}\right\}.
\end{align*}
**Evaluating the inner sum:** Using the substitution $m = n - t + 1$ (so $n = m + t - 1$):
\begin{align*}
\sum_{n=t}^{\infty} (1-\gamma) \gamma^{n-1} &= \sum_{m=1}^{\infty} (1-\gamma) \gamma^{m+t-2} \\
&= \gamma^{t-1} (1-\gamma) \sum_{m=1}^{\infty} \gamma^{m-1} \\
&= \gamma^{t-1} (1-\gamma) \cdot \frac{1}{1-\gamma} = \gamma^{t-1}.
\end{align*}
Substituting back:
$$
v_\nu^{\boldsymbol{\pi}}(s) = E_s^{\boldsymbol{\pi}} \left\{\sum_{t=1}^{\infty} \gamma^{t-1} r(S_t, A_t)\right\} = v_\gamma^{\boldsymbol{\pi}}(s).
$$
Let V be the set of bounded real-valued functions on a discrete state space S. This means any function $ f \in V $ satisfies the condition:
$$ |f| = \max_{s \in S} |f(s)| < \infty. $$ where notation $ |f| $ represents the sup-norm (or $ \ell_\infty $-norm) of the function $ f $.
When working with discrete state spaces, we can interpret elements of V as vectors and linear operators on V as matrices, allowing us to leverage tools from linear algebra. The sup-norm (
where
For a Markovian decision rule
\begin{align*} \mathbf{r}\pi(s) &\equiv r(s, \pi(s)), \quad \mathbf{r}\pi \in \mathbb{R}^{|S|}, \ [\mathbf{P}\pi]{s,j} &\equiv p(j \mid s, \pi(s)), \quad \mathbf{P}_\pi \in \mathbb{R}^{|S| \times |S|}. \end{align*}
For a randomized decision rule
\begin{align*} \mathbf{r}\pi(s) &\equiv \sum{a \in A_s} \pi(a \mid s) , r(s, a), \ [\mathbf{P}\pi]{s,j} &\equiv \sum_{a \in A_s} \pi(a \mid s) , p(j \mid s, a). \end{align*}
In both cases, $\mathbf{r}\pi$ denotes a reward vector in $\mathbb{R}^{|S|}$, with each component $\mathbf{r}\pi(s)$ representing the reward associated with state
For a nonstationary Markovian policy
Using vector notation, this can be expressed as:
This formulation leads to a recursive relationship:
where
For a stationary policy
This last expression is called a Neumann series expansion, and it's guaranteed to exists under the assumptions of bounded reward and discount factor strictly less than one.
:label: neumann-series
The **spectral radius** of a matrix $\mathbf{H}$ is defined as:
$$
\rho(\mathbf{H}) \equiv \max_{i} |\lambda_i(\mathbf{H})|
$$
where $\lambda_i(\mathbf{H})$ are the eigenvalues of $\mathbf{H}$.
**Neumann Series Existence:** For any matrix $\mathbf{H}$, the Neumann series
$$
\sum_{t=0}^{\infty} \mathbf{H}^t = \mathbf{I} + \mathbf{H} + \mathbf{H}^2 + \cdots
$$
converges if and only if $\rho(\mathbf{H}) < 1$. When this condition holds, the matrix $(\mathbf{I} - \mathbf{H})$ is invertible and
$$
(\mathbf{I} - \mathbf{H})^{-1} = \sum_{t=0}^{\infty} \mathbf{H}^t.
$$
Note that for any induced matrix norm
This inequality provides a practical way to verify the convergence condition
We can now verify that
-
Norm of the transition matrix: Since $\mathbf{P}\pi$ is a stochastic matrix (each row sums to 1 and all entries are non-negative), its $\ell\infty$-norm is:
$$ |\mathbf{P}\pi| = \max{s \in S} \sum_{j \in S} [\mathbf{P}\pi]{s,j} = \max_{s \in S} 1 = 1. $$
-
Norm of the scaled matrix: Using the homogeneity property of norms, we have:
$$ |\gamma \mathbf{P}\pi| = |\gamma| \cdot |\mathbf{P}\pi| = |\gamma| \cdot 1 = |\gamma|. $$
-
Bounding the spectral radius: Since the spectral radius is bounded by the matrix norm:
$$ \rho(\gamma \mathbf{P}\pi) \leq |\gamma \mathbf{P}\pi| = |\gamma|. $$
-
Verifying convergence: Since
$0 \leq \gamma < 1$ by assumption, we have:$$ \rho(\gamma \mathbf{P}_\pi) \leq |\gamma| < 1. $$
This strict inequality guarantees that
$(\mathbf{I} - \gamma \mathbf{P}_\pi)$ is invertible and the Neumann series converges.
Therefore, the Neumann series expansion converges and yields:
Consequently, for a stationary policy,
which can be rearranged to:
We can also characterize $\mathbf{v}\gamma^{\pi}$ as the solution to an operator equation. More specifically, define the transformation $\mathrm{L}\pi$ by
for any
While we often refer to $\mathrm{L}_\pi$ as a "linear operator" in the RL literature, it is technically an **affine operator** (or affine transformation), not a linear operator in the strict sense. To see why, recall that a linear operator $\mathcal{T}$ must satisfy:
1. **Additivity:** $\mathcal{T}(\mathbf{v}_1 + \mathbf{v}_2) = \mathcal{T}(\mathbf{v}_1) + \mathcal{T}(\mathbf{v}_2)$
2. **Homogeneity:** $\mathcal{T}(\alpha \mathbf{v}) = \alpha \mathcal{T}(\mathbf{v})$ for all scalars $\alpha$
However, $\mathrm{L}_\pi$ fails the additivity test:
$$
\mathrm{L}_\pi(\mathbf{v}_1 + \mathbf{v}_2) = \mathbf{r}_\pi + \gamma \mathbf{P}_\pi(\mathbf{v}_1 + \mathbf{v}_2) = \mathbf{r}_\pi + \gamma \mathbf{P}_\pi\mathbf{v}_1 + \gamma \mathbf{P}_\pi\mathbf{v}_2
$$
while
$$
\mathrm{L}_\pi(\mathbf{v}_1) + \mathrm{L}_\pi(\mathbf{v}_2) = (\mathbf{r}_\pi + \gamma \mathbf{P}_\pi\mathbf{v}_1) + (\mathbf{r}_\pi + \gamma \mathbf{P}_\pi\mathbf{v}_2) = 2\mathbf{r}_\pi + \gamma \mathbf{P}_\pi\mathbf{v}_1 + \gamma \mathbf{P}_\pi\mathbf{v}_2.
$$
The presence of the constant term $\mathbf{r}_\pi$ makes $\mathrm{L}_\pi$ affine rather than linear. An affine operator has the form $\mathcal{A}(\mathbf{v}) = \mathbf{b} + \mathcal{T}(\mathbf{v})$, where $\mathbf{b}$ is a constant vector and $\mathcal{T}$ is a linear operator. In our case, $\mathbf{b} = \mathbf{r}_\pi$ and $\mathcal{T}(\mathbf{v}) = \gamma \mathbf{P}_\pi\mathbf{v}$.
Despite this technical distinction, the term "linear operator" is commonly used in the reinforcement learning literature when referring to $\mathrm{L}_\pi$, following a slight abuse of terminology.
Therefore, we view $\mathrm{L}\pi$ as an operator mapping elements of $V$ to $V$: i.e., $\mathrm{L}\pi: V \rightarrow V$. The fact that the value function of a policy is the solution to a fixed-point equation can then be expressed with the statement:
This is a fixed-point equation: the value function $\mathbf{v}\gamma^{\pi}$ is a fixed point of the operator $\mathrm{L}\pi$.
The operator equation we encountered in MDPs, $\mathbf{v}\gamma^{\pi} = \mathrm{L}\pi \mathbf{v}_\gamma^{\pi}$, is a specific instance of a more general class of problems known as operator equations. These equations appear in various fields of mathematics and applied sciences, ranging from differential equations to functional analysis.
Operator equations can take several forms, each with its own characteristics and solution methods:
-
Fixed Point Form:
$x = \mathrm{T}(x)$ , where$\mathrm{T}: X \rightarrow X$ . Common in fixed-point problems, such as our MDP equation, we seek a fixed point $x^$ such that $x^ = \mathrm{T}(x^*)$. -
General Operator Equation:
$\mathrm{T}(x) = y$ , where$\mathrm{T}: X \rightarrow Y$ . Here,$X$ and$Y$ can be different spaces. We seek an$x \in X$ that satisfies the equation for a given$y \in Y$ . -
Nonlinear Equation:
$\mathrm{T}(x) = 0$ , where$\mathrm{T}: X \rightarrow Y$ . A special case of the general operator equation where we seek roots or zeros of the operator. -
Variational Inequality: Find
$x^* \in K$ such that $\langle \mathrm{T}(x^), x - x^ \rangle \geq 0$ for all$x \in K$ . Here,$K$ is a closed convex subset of$X$ , and$\mathrm{T}: K \rightarrow X^*$ (the dual space of$X$ ). These problems often arise in optimization, game theory, and partial differential equations.
For equations in fixed point form, a common numerical solution method is successive approximation, also known as fixed-point iteration:
:label: successive-approximation
**Input:** An operator $\mathrm{T}: X \rightarrow X$, an initial guess $x_0 \in X$, and a tolerance $\epsilon > 0$
**Output:** An approximate fixed point $x^*$ such that $\|x^* - \mathrm{T}(x^*)\| < \epsilon$
1. Initialize $n = 0$
2. **repeat**
3. Compute $x_{n+1} = \mathrm{T}(x_n)$
4. If $\|x_{n+1} - x_n\| < \epsilon$, **return** $x_{n+1}$
5. Set $n = n + 1$
6. **until** convergence or maximum iterations reached
The convergence of successive approximation depends on the properties of the operator
where
However, the contraction mapping condition is not the only one that can lead to convergence. For instance, if
In practice, when dealing with specific problems like MDPs or differential equations, the properties of the operator often naturally align with one of these convergence conditions. For example, in discounted MDPs, the Bellman operator is a contraction in the supremum norm, which guarantees the convergence of value iteration.
The Newton-Kantorovich method is a generalization of Newton's method from finite dimensional vector spaces to infinite dimensional function spaces: rather than iterating in the space of vectors, we are iterating in the space of functions.
Newton's method is often written as the familiar update:
$$ x_{k+1} = x_k - [DF(x_k)]^{-1} F(x_k), $$ which makes it look as though the essence of the method is "take a derivative and invert it." But the real workhorse behind Newton's method (both in finite and infinite dimensions) is linearization.
At each step, the idea is to replace the nonlinear operator
$$
F(x+h) \approx F(x) + Lh,
$$
where
To find a root of
$$
0 = F(x+h) \approx F(x) + Lh.
$$
Solving this linear equation gives the increment
But what exactly is a Fréchet derivative in infinite dimensions? To understand this, we need to generalize the concept of derivative from finite-dimensional calculus. In infinite-dimensional spaces, there are several notions of differentiability, each with different strengths and requirements:
1. Gâteaux (Directional) Derivative
We say that the Gâteaux derivative of
This quantity measures how the function
2. Hadamard Directional Derivative
Rather than considering a single direction of perturbation, we now consider a bundle of perturbations around
This is a stronger condition than Gâteaux differentiability because it requires the limit to be uniform over nearby directions. However, it still doesn't guarantee linearity in
3. Fréchet Derivative
The strongest and most natural notion:
This definition directly addresses the inadequacy of the previous notions. Unlike Gâteaux and Hadamard derivatives, the Fréchet derivative explicitly requires the existence of a linear operator
-
$L$ must be linear in$h$ (unlike the directional derivatives above) - The approximation error is
$o(|h|)$ , uniform in all directions - This is the "true" derivative: it generalizes the Jacobian matrix to infinite dimensions
- Notation:
$L = F'(x)$ or$DF(x)$
Relationship:
In the context of the Newton-Kantorovich method, we work with an operator
Now apart from those mathematical technicalities, Newton-Kantorovich has in essence the same structure as that of the original Newton's method. That is, it applies the following sequence of steps:
-
Linearize the Operator: Given an approximation $ x_n $, we consider the Fréchet derivative of $ F $, denoted by $ F'(x_n) $. This derivative is a linear operator that provides a local approximation of $ F $ near $ x_n $.
-
Set Up the Newton Step: The method then solves the linearized equation for a correction $ h_n $:
$$ F'(x_n) h_n = -F(x_n). $$ This equation represents a linear system where $ h_n $ is chosen so that the linearized operator $ F(x_n) + F'(x_n)h_n $ equals zero.
-
Update the Solution: The new approximation $ x_{n+1} $ is then given by:
$$ x_{n+1} = x_n + h_n. $$ This correction step refines $ x_n $, bringing it closer to the true solution.
-
Repeat Until Convergence: We repeat the linearization and update steps until the solution $ x_n $ converges to the desired tolerance, which can be verified by checking that $ |F(x_n)| $ is sufficiently small, or by monitoring the norm $ |x_{n+1} - x_n| $.
The convergence of Newton-Kantorovich does not hinge on $ F $ being a contraction over the entire domain (as it could be the case for successive approximation). The convergence properties of the Newton-Kantorovich method are as follows:
-
Local Convergence: Under mild conditions (e.g.,
$F$ is Fréchet differentiable and$F'(x)$ is invertible near the solution), the method converges locally. This means that if the initial guess is sufficiently close to the true solution, the method will converge. -
Global Convergence: Global convergence is not guaranteed in general. However, under stronger conditions (e.g.,
$F$ is analytic and satisfies certain bounds), the method can converge globally. -
Rate of Convergence: When the method converges, it typically exhibits quadratic convergence. This means that the error at each step is proportional to the square of the error at the previous step:
$$ |x_{n+1} - x^| \leq C|x_n - x^|^2 $$
where
$x^*$ is the true solution and$C$ is some constant. This quadratic convergence is significantly faster than the linear convergence typically seen in methods like successive approximation.
Recall that in the finite-horizon setting, the optimality equations are:
where
Intuitively, we would expect that by taking the limit of
which are called the optimality equations or Bellman equations for infinite-horizon MDPs.
We can adopt an operator-theoretic perspective by defining operators on the space
The Bellman optimality operator is then:
where
Note that while we write
For convenience, we define the greedy selector
In Puterman's terminology, such a greedy selector is called
-
Value iteration:
$v_{k+1} = \Bellman v_k$ , then extract$\pi = \mathrm{Greedy}(v^*)$ -
Policy iteration:
$\pi_{k+1} = \mathrm{Greedy}(v^{\pi_k})$ with$v^{\pi_k}$ solving$v = \mathrm{L}_{\pi_k}v$
The equivalence between these two forms can be shown mathematically, as demonstrated in the following proposition and proof.
The operator $\Bellman$ defined as a maximization over Markov deterministic decision rules:
$$(\Bellman \mathbf{v})(s) = \max_{\pi \in \Pi^{MD}} \left\{r(s,\pi(s)) + \gamma \sum_{j \in \mathcal{S}} p(j|s,\pi(s)) v(j)\right\}$$
is equivalent to the componentwise maximization over actions:
$$(\Bellman \mathbf{v})(s) = \max_{a \in \mathcal{A}_s} \left\{r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j)\right\}$$
Fix $s$. Let
$$
Q_v(s,a) \triangleq r(s,a)+\gamma\sum_{j}p(j\mid s,a)\,v(j).
$$
For any rule $\pi \in \Pi^{MD}$, we have $(\BellmanPi v)(s)=Q_v(s,\pi(s))\le \max_{a\in\mathcal{A}_s}Q_v(s,a)$.
Taking the maximum over $\pi$ gives
$$
\max_{\pi\in\Pi^{MD}}(\BellmanPi v)(s) \le \max_{a\in\mathcal{A}_s}Q_v(s,a).
$$
Conversely, choose a **greedy selector** $\pi^v\in\Pi^{MD}$ such that for each $s$,
$$\pi^v(s)\in\arg\max_{a\in\mathcal{A}_s}Q_v(s,a)$$
(possible since $\mathcal{A}_s$ is finite; otherwise use a measurable $\varepsilon$-greedy selector). Then
$$
(\Bellman _{\pi^v}v)(s)=Q_v(s,\pi^v(s))=\max_{a\in\mathcal{A}_s}Q_v(s,a),
$$
so $\max_{\pi}(\BellmanPi v)(s)\ge \max_{a}Q_v(s,a)$. Combining both inequalities yields equality.
The optimality equations are operator equations. Therefore, we can apply general numerical methods to solve them. Applying the successive approximation method to the Bellman optimality equation yields a method known as "value iteration" in dynamic programming. A direct application of the blueprint for successive approximation yields the following algorithm:
:label: value-iteration
**Input** Given an MDP $(S, A, P, R, \gamma)$ and tolerance $\varepsilon > 0$
**Output** Compute an $\varepsilon$-optimal value function $v$ and policy $\pi$
1. Initialize $v_0(s) = 0$ for all $s \in S$
2. $n \leftarrow 0$
3. **repeat**
1. For each $s \in S$:
1. $v_{n+1}(s) \leftarrow (\Bellman v_n)(s) = \max_{a \in A} \left\{r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a)v_n(j)\right\}$
2. $\delta \leftarrow \|v_{n+1} - v_n\|_\infty$
3. $n \leftarrow n + 1$
4. **until** $\delta < \frac{\varepsilon(1-\gamma)}{2\gamma}$
5. Extract greedy policy: $\pi \leftarrow \mathrm{Greedy}(v_n)$ where
$$\mathrm{Greedy}(v)(s) \in \arg\max_{a \in \mathcal{A}_s} \left\{r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a)v(j)\right\}$$
6. **return** $v_n, \pi$
The termination criterion in this algorithm is based on a specific bound that provides guarantees on the quality of the solution. This is in contrast to supervised learning, where we often use arbitrary termination criteria based on computational budget or early stopping when the learning curve flattens. This is because establishing implementable generalization bounds in supervised learning is challenging.
However, in the dynamic programming context, we can derive various bounds that can be implemented in practice. These bounds help us terminate our procedure with a guarantee on the precision of our value function and, correspondingly, on the optimality of the resulting policy.
:label: value-iteration-convergence
(Adapted from {cite:t}`Puterman1994` theorem 6.3.1)
Let $v_0$ be any initial value function, $\varepsilon > 0$ a desired accuracy, and let $\{v_n\}$ be the sequence of value functions generated by value iteration, i.e., $v_{n+1} = \Bellman v_n$ for $n \geq 0$, where $\Bellman$ is the Bellman optimality operator. Then:
1. $v_n$ converges to the optimal value function $v^*_\gamma$,
2. The algorithm terminates in finite time,
3. The resulting policy $\pi_\varepsilon$ is $\varepsilon$-optimal, and
4. When the algorithm terminates, $v_{n+1}$ is within $\varepsilon/2$ of $v^*_\gamma$.
Parts 1 and 2 follow directly from the fact that $\Bellman$ is a contraction mapping. Hence, by Banach's fixed-point theorem, it has a unique fixed point (which is $v^*_\gamma$), and repeated application of $\Bellman$ will converge to this fixed point. Moreover, this convergence happens at a geometric rate, which ensures that we reach the termination condition in finite time.
To show that the Bellman optimality operator $\Bellman$ is a contraction mapping, we need to prove that for any two value functions $v$ and $u$:
$$\|\Bellman v - \Bellman u\|_\infty \leq \gamma \|v - u\|_\infty$$
where $\gamma \in [0,1)$ is the discount factor and $\|\cdot\|_\infty$ is the supremum norm.
Let's start by writing out the definition of $\Bellman v$ and $\Bellman u$:
$$\begin{align*}
(\Bellman v)(s) &= \max_{a \in A} \left\{r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a)v(j)\right\}\\
(\Bellman u)(s) &= \max_{a \in A} \left\{r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a)u(j)\right\}
\end{align*}$$
For any state $s$, let $a_v$ be the action that achieves the maximum for $(\Bellman v)(s)$, and $a_u$ be the action that achieves the maximum for $(\Bellman u)(s)$. By the definition of these maximizers:
$$\begin{align*}
(\Bellman v)(s) &\geq r(s,a_u) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a_u)v(j)\\
(\Bellman u)(s) &\geq r(s,a_v) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a_v)u(j)
\end{align*}$$
Subtracting these inequalities:
$$\begin{align*}
(\Bellman v)(s) - (\Bellman u)(s) &\leq \gamma \sum_{j \in \mathcal{S}} p(j|s,a_v)(v(j) - u(j))\\
(\Bellman u)(s) - (\Bellman v)(s) &\leq \gamma \sum_{j \in \mathcal{S}} p(j|s,a_u)(u(j) - v(j))
\end{align*}$$
Taking the absolute value and using the fact that $\sum_{j \in \mathcal{S}} p(j|s,a) = 1$:
$$|(\Bellman v)(s) - (\Bellman u)(s)| \leq \gamma \max_{j \in \mathcal{S}} |v(j) - u(j)| = \gamma \|v - u\|_\infty$$
Since this holds for all $s \in \mathcal{S}$, taking the supremum over $s$ gives:
$$\|\Bellman v - \Bellman u\|_\infty \leq \gamma \|v - u\|_\infty$$
Thus, $\Bellman$ is a contraction mapping with contraction factor $\gamma$.
Now, let's prove parts 3 and 4. Suppose the algorithm has just terminated, i.e., $\|v_{n+1} - v_n\|_\infty < \frac{\varepsilon(1-\gamma)}{2\gamma}$ for some $n$. We want to show that our current value function $v_{n+1}$ and the policy $\pi_\varepsilon$ derived from it are close to optimal.
By the triangle inequality:
$$\|v^{\pi_\varepsilon}_\gamma - v^*_\gamma\|_\infty \leq \|v^{\pi_\varepsilon}_\gamma - v_{n+1}\|_\infty + \|v_{n+1} - v^*_\gamma\|_\infty$$
For the first term, since $v^{\pi_\varepsilon}_\gamma$ is the fixed point of $\mathrm{L}_{\pi_\varepsilon}$ and $\pi_\varepsilon$ is greedy with respect to $v_{n+1}$ (i.e., $\mathrm{L}_{\pi_\varepsilon}v_{n+1} = \Bellman v_{n+1}$):
$$
\begin{aligned}
\|v^{\pi_\varepsilon}_\gamma - v_{n+1}\|_\infty &= \|\mathrm{L}_{\pi_\varepsilon}v^{\pi_\varepsilon}_\gamma - v_{n+1}\|_\infty \\
&\leq \|\mathrm{L}_{\pi_\varepsilon}v^{\pi_\varepsilon}_\gamma - \mathrm{L}_{\pi_\varepsilon}v_{n+1}\|_\infty + \|\mathrm{L}_{\pi_\varepsilon}v_{n+1} - v_{n+1}\|_\infty \\
&= \|\mathrm{L}_{\pi_\varepsilon}v^{\pi_\varepsilon}_\gamma - \mathrm{L}_{\pi_\varepsilon}v_{n+1}\|_\infty + \|\Bellman v_{n+1} - v_{n+1}\|_\infty \\
&\leq \gamma\|v^{\pi_\varepsilon}_\gamma - v_{n+1}\|_\infty + \gamma\|v_{n+1} - v_n\|_\infty
\end{aligned}
$$
where we used that both $\Bellman$ and $\mathrm{L}_{\pi_\varepsilon}$ are contractions with factor $\gamma$, and that $v_{n+1} = \Bellman v_n$.
Rearranging:
$$\|v^{\pi_\varepsilon}_\gamma - v_{n+1}\|_\infty \leq \frac{\gamma}{1-\gamma}\|v_{n+1} - v_n\|_\infty$$
Similarly, since $v^*_\gamma$ is the fixed point of $\Bellman$:
$$\|v_{n+1} - v^*_\gamma\|_\infty = \|\Bellman v_n - \Bellman v^*_\gamma\|_\infty \leq \gamma\|v_n - v^*_\gamma\|_\infty \leq \frac{\gamma}{1-\gamma}\|v_{n+1} - v_n\|_\infty$$
Since $\|v_{n+1} - v_n\|_\infty < \frac{\varepsilon(1-\gamma)}{2\gamma}$:
$$\|v^{\pi_\varepsilon}_\gamma - v_{n+1}\|_\infty \leq \frac{\gamma}{1-\gamma} \cdot \frac{\varepsilon(1-\gamma)}{2\gamma} = \frac{\varepsilon}{2}$$
$$\|v_{n+1} - v^*_\gamma\|_\infty \leq \frac{\gamma}{1-\gamma} \cdot \frac{\varepsilon(1-\gamma)}{2\gamma} = \frac{\varepsilon}{2}$$
Combining these:
$$\|v^{\pi_\varepsilon}_\gamma - v^*_\gamma\|_\infty \leq \frac{\varepsilon}{2} + \frac{\varepsilon}{2} = \varepsilon$$
This completes the proof, showing that $v_{n+1}$ is within $\varepsilon/2$ of $v^*_\gamma$ (part 4) and $\pi_\varepsilon$ is $\varepsilon$-optimal (part 3).
We now apply the Newton-Kantorovich framework to the Bellman optimality equation. Let
The problem is to find
We consider three complementary perspectives for understanding and computing its derivative.
In tabular form, for finite state and action spaces, the Bellman operator can be written as a pointwise maximum of affine maps:
$$
(\Bellman v)(s) = \max_{a \in A(s)} \left{ r(s,a) + \gamma (P_a v)(s) \right},
$$
where
At any
Then the Hadamard directional derivative exists and is given by
If the active set is a singleton, this expression becomes linear in
where
Consider now a value function approximated as a linear combination of basis functions:
At a node
Define
so that
We do not need to differentiate the optimizer
This result is useful when solving the collocation equation
The third perspective applies the implicit function theorem to understand when the Bellman operator is differentiable despite containing a max operator. The maximization problem defines an implicit relationship between the value function and the optimal action, and the implicit function theorem tells us when this relationship is smooth enough to differentiate through.
The Bellman operator is defined as
The difficulty is that the max operator encodes a discrete selection: which action achieves the maximum. To apply the implicit function theorem, we reformulate this as follows. For each action
The optimal action at
Now suppose that at a particular
This strict inequality is the regularity condition needed for the implicit function theorem. It ensures that the optimal action is unique at
To see why, consider any perturbation
The perturbation term is bounded:
Thus
The implicit function theorem now applies: in this neighborhood, the mapping $v \mapsto a^(s; v)$ is constant (and hence smooth), taking the value $a^(s)$. This allows us to write
$$
(\Bellman v)(s) = Q_{a^(s)}(v, s) = r(s,a^(s)) + \gamma \sum_j p(j \mid s,a^(s)) v(j)
$$
as an explicit formula that holds throughout the neighborhood. Since $Q_{a^(s)}(\cdot, s)$ is an affine (hence smooth) function of
More precisely, for any perturbation
This is the Fréchet derivative:
where
The role of the implicit function theorem: It guarantees that when the maximizer is unique with a strict gap (the regularity condition), the argmax function
We return to the Newton-Kantorovich step:
Suppose
which is exactly policy evaluation for
Thus, policy iteration is Newton-Kantorovich applied to the Bellman optimality equation. At points of nondifferentiability (when ties occur), the operator is still semismooth, and policy iteration corresponds to a semismooth Newton method. The envelope theorem is what justifies the simplification of the Jacobian to
The three perspectives we developed above (the active set view, the envelope theorem, and the implicit function theorem) all point toward a deeper framework for understanding Newton-type methods on non-smooth operators. This framework, known as semismooth Newton methods, was developed precisely to handle operators like the Bellman operator that are piecewise smooth but not globally differentiable. The connection between policy iteration and semismooth Newton methods has been rigorously developed in recent work {cite}Gargiani2022.
The classical Newton-Kantorovich method assumes the operator is Fréchet differentiable everywhere. The derivative exists, is unique, and varies continuously with the base point. But the Bellman operator
Semismooth Newton methods address this by replacing the notion of a single Jacobian with a generalized derivative that captures the behavior of the operator near non-smooth points. The most commonly used generalized derivative is the Clarke subdifferential, which we can think of as the convex hull of all possible "candidate Jacobians" that arise from limits approaching the non-smooth point from different directions.
For the Bellman residual
In words, the generalized Jacobian is the set of all matrices
The semismooth Newton method for solving
What this tells us is that any choice from the Clarke subdifferential yields a valid Newton-like update. In the context of the Bellman equation, choosing
Under appropriate regularity conditions (specifically, when the residual function is BD-regular or CD-regular), the semismooth Newton method converges locally at a quadratic rate {cite}Gargiani2022. This means that near the solution, the error decreases quadratically:
$$
|v_{k+1} - v^| \leq C |v_k - v^|^2.
$$
This theoretical result explains an empirical observation that has long been noted in practice: policy iteration typically converges in very few iterations, often just a handful, even when the state and action spaces are enormous and the space of possible policies is exponentially large.
The semismooth Newton framework also suggests a spectrum of methods interpolating between value iteration and policy iteration. Value iteration can be interpreted as a Newton-like method where we choose
Between these extremes lie methods that use approximate Jacobians. One natural variant is to choose
This is known as
The connection to semismooth Newton methods places policy iteration within a broader mathematical framework that extends far beyond dynamic programming. Semismooth Newton methods are used in optimization (for complementarity problems and variational inequalities), in PDE-constrained optimization (for problems with control constraints), and in economics (for equilibrium problems). The Bellman equation, viewed through this lens, is simply one instance of a piecewise smooth equation, and the tools developed for such equations apply directly.
While we derived policy iteration-like steps from the Newton-Kantorovich method, it's worth examining policy iteration as a standalone algorithm, as it has been traditionally presented in the field of dynamic programming.
The policy iteration algorithm for discounted Markov decision problems is as follows:
:label: policy-iteration-standard
**Input:** MDP $(S, A, P, R, \gamma)$
**Output:** Optimal policy $\pi^*$
1. Initialize: $n = 0$, select an arbitrary decision rule $\pi_0 \in \Pi^{MD}$
2. **repeat**
3. (Policy evaluation) Obtain $\mathbf{v}^n$ by solving:
$$(\mathbf{I}-\gamma \mathbf{P}_{\pi_n}) \mathbf{v} = \mathbf{r}_{\pi_n}$$
4. (Policy improvement) Choose $\pi_{n+1} = \mathrm{Greedy}(\mathbf{v}^n)$ where:
$$\pi_{n+1} \in \arg\max_{\pi \in \Pi^{MD}}\left\{\mathbf{r}_\pi+\gamma \mathbf{P}_\pi \mathbf{v}^n\right\}$$
equivalently, for each $s$:
$$\pi_{n+1}(s) \in \arg\max_{a \in \mathcal{A}_s}\left\{r(s,a)+\gamma \sum_j p(j|s,a) \mathbf{v}^n(j)\right\}$$
Set $\pi_{n+1} = \pi_n$ if possible.
5. If $\pi_{n+1} = \pi_n$, **return** $\pi^* = \pi_n$
6. $n = n + 1$
7. **until** convergence
As opposed to value iteration, this algorithm produces a sequence of both deterministic Markovian decision rules