| jupytext |
|
||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| kernelspec |
|
The previous chapter showed how fitted Q-iteration handles large state spaces through function approximation. FQI maintains a unified structure across batch and online settings: a replay buffer $\mathcal{B}t$ inducing empirical distribution $\hat{P}{\mathcal{B}_t}$, a target map
However, this framework breaks down when the action space becomes large or continuous. Computing Bellman targets requires evaluating
This chapter addresses the continuous action problem while maintaining the FQI framework. We develop several approaches, unified by a common theme: amortization. Rather than solving the optimization problem
The strategies we examine are:
-
Explicit optimization (Section 2): Solve the maximization numerically for a subset of states, accepting the computational cost for exact solutions.
-
Policy network amortization (Sections 3-5): Learn a deterministic or stochastic policy network
$\pi_{\boldsymbol{w}}$ that approximates$\arg\max_a q(s,a;\boldsymbol{\theta})$ or the optimal stochastic policy, enabling fast action selection via a single forward pass. This includes both hard-max methods (DDPG, TD3) and soft-max methods (SAC, PCL).
Each approach represents a different point in the computation-accuracy trade-off, and all fit within the FQI template by modifying how targets are computed.
Recall that in fitted Q methods, the main idea is to compute the Bellman operator only at a subset of all states, relying on function approximation to generalize to the remaining states. At each step of the successive approximation loop, we build a dataset of input state-action pairs mapped to their corresponding optimality operator evaluations:
This dataset is then fed to our function approximator (neural network, random forest, linear model) to obtain the next set of parameters:
While this strategy allows us to handle very large or even infinite (continuous) state spaces, it still requires maximizing over actions (
:label: fitted-q-iteration-explicit
**Input:** MDP $(S, \mathcal{A}, P, R, \gamma)$, base points $\mathcal{B}$, function approximator $q(s,a; \boldsymbol{\theta})$, maximum iterations $N$, tolerance $\varepsilon > 0$
**Output:** Parameters $\boldsymbol{\theta}$ for Q-function approximation
1. Initialize $\boldsymbol{\theta}_0$
2. $n \leftarrow 0$
3. **repeat**
1. $\mathcal{D} \leftarrow \emptyset$ $\quad$ // Regression dataset
2. **for** each $(s,a,r,s') \in \mathcal{B}$ **do** $\quad$ // Monte Carlo with one sample
1. $y_{s,a} \leftarrow r + \gamma \texttt{maximize}(q(s', \cdot; \boldsymbol{\theta}_n))$
2. $\mathcal{D} \leftarrow \mathcal{D} \cup \{((s,a), y_{s,a})\}$
3. $\boldsymbol{\theta}_{n+1} \leftarrow \texttt{fit}(\mathcal{D})$
4. $\delta \leftarrow \frac{1}{|\mathcal{D}||\mathcal{A}|}\sum_{(s,a) \in \mathcal{D} \times \mathcal{A}} (q(s,a; \boldsymbol{\theta}_{n+1}) - q(s,a; \boldsymbol{\theta}_n))^2$
5. $n \leftarrow n + 1$
4. **until** $\delta < \varepsilon$ or $n \geq N$
5. **return** $\boldsymbol{\theta}_n$
The above pseudocode introduces a generic
While explicit optimization provides exact solutions, it becomes computationally expensive when we need to compute targets for millions of transitions in a replay buffer. Can we avoid solving an optimization problem at every decision? The answer is amortization.
This process is computationally intensive. We can "amortize" some of this computation by replacing the explicit optimization for each sample with a direct mapping that gives us an approximate maximizer directly. For Q-functions, recall that the operator is given by:
If $q^$ is the optimal state-action value function, then $v^(s) = \max_a q^*(s,a)$, and we can derive the optimal policy directly by computing the decision rule:
Since
Note that
:label: fitted-q-iteration-amortized
**Input:** MDP $(S, \mathcal{A}, P, R, \gamma)$, base points $\mathcal{B}$, subset for exact optimization $\mathcal{B}_{\text{opt}} \subset \mathcal{B}$, Q-function approximator $q(s,a; \boldsymbol{\theta})$, policy approximator $\pi_{\boldsymbol{w}}$, maximum iterations $N$, tolerance $\varepsilon > 0$
**Output:** Parameters $\boldsymbol{\theta}$ for Q-function, $\boldsymbol{w}$ for policy
1. Initialize $\boldsymbol{\theta}_0$, $\boldsymbol{w}_0$
2. $n \leftarrow 0$
3. **repeat**
1. $\mathcal{D}_q \leftarrow \emptyset$ $\quad$ // Q-function regression dataset
2. $\mathcal{D}_\pi \leftarrow \emptyset$ $\quad$ // Policy regression dataset
3. **for** each $(s,a,r,s') \in \mathcal{B}$ **do**
1. **// Determine next state's action via exact optimization or approximation**
2. **if** $s' \in \mathcal{B}_{\text{opt}}$ **then**
1. $a^*_{s'} \leftarrow \texttt{maximize}(q(s', \cdot; \boldsymbol{\theta}_n))$
2. $\mathcal{D}_\pi \leftarrow \mathcal{D}_\pi \cup \{(s', a^*_{s'})\}$
3. **else**
1. $a^*_{s'} \leftarrow \pi_{\boldsymbol{w}_n}(s')$
4. **// Compute Q-function target using chosen action**
5. $y_{s,a} \leftarrow r + \gamma q(s', a^*_{s'}; \boldsymbol{\theta}_n)$
6. $\mathcal{D}_q \leftarrow \mathcal{D}_q \cup \{((s,a), y_{s,a})\}$
4. **// Update both function approximators**
5. $\boldsymbol{\theta}_{n+1} \leftarrow \texttt{fit}(\mathcal{D}_q)$
6. $\boldsymbol{w}_{n+1} \leftarrow \texttt{fit}(\mathcal{D}_\pi)$
7. **// Compute convergence criteria**
8. $\delta_q \leftarrow \frac{1}{|\mathcal{D}_q|}\sum_{(s,a) \in \mathcal{D}_q} (q(s,a; \boldsymbol{\theta}_{n+1}) - q(s,a; \boldsymbol{\theta}_n))^2$
9. $\delta_\pi \leftarrow \frac{1}{|\mathcal{D}_\pi|}\sum_{(s,a^*) \in \mathcal{D}_\pi} \|a^* - \pi_{\boldsymbol{w}_{n+1}}(s)\|^2$
10. $n \leftarrow n + 1$
4. **until** $\max(\delta_q, \delta_\pi) < \varepsilon$ or $n \geq N$
5. **return** $\boldsymbol{\theta}_n$, $\boldsymbol{w}_n$
Note that the policy
A natural approach to handle this staleness would be to maintain only the most recent optimization data. We could modify our procedure to keep a sliding window of
where
This introduces a trade-off between using more data (larger
The main limitation of this approach, beyond the out-of-distribution drift, is that it requires computing exact optimal actions via the solver for states in
In this section, we consider deterministic parametrized policies of the form
When actions are continuous,
where
After training converges, the agent must select actions in real-time during deployment. Running interior-point methods or gradient-based optimizers at every decision creates unacceptable latency, especially in high-frequency control where decisions occur at 100Hz or faster.
The solution is to amortize the optimization cost by learning a separate policy network
This introduces a second approximation beyond the Q-function. We now have two function approximators: a critic
where the expectation is over states in the dataset or replay buffer. This gradient ascent pushes the actor toward actions that the critic considers valuable. By the chain rule, this equals $(\nabla_a q(s,a; \boldsymbol{\theta})|{a=\pi{\boldsymbol{w}}(s)}) \cdot (\nabla_{\boldsymbol{w}} \pi_{\boldsymbol{w}}(s))$, which can be efficiently computed via backpropagation through the composition of the two networks.
NFQCA {cite:p}Hafner2011 extends the NFQI template from the previous chapter to handle continuous action spaces by replacing the
The algorithm retains the two-level structure of NFQI: an outer loop performs approximate value iteration by computing Bellman targets, and an inner loop fits the Q-function to those targets. NFQCA adds a third component (policy improvement) that updates
Recall from the FQI chapter that NFQI computes Bellman targets using the hard max:
When
The policy
To train the policy, we maximize the expected Q-value under the distribution of states in the dataset. If we had access to the optimal Q-function
where $\hat{P}{\mathcal{D}}$ is the empirical distribution over states induced by the offline dataset $\mathcal{D}$. In practice, we use the current Q-function approximation $q(s,a; \boldsymbol{\theta}{n+1})$ after it has been fitted to the latest targets. The expectation is approximated by the sample average over states appearing in
This policy improvement step runs after the Q-function has been updated, using the newly-fitted critic to guide the actor toward higher-value actions. Both the Q-function fitting and policy improvement use gradient-based optimization on the respective objectives.
:label: nfqca
**Input:** Dataset $\mathcal{D} = \{(s_i, a_i, r_i, s'_i)\}_{i=1}^N$, Q-function $q(s,a; \boldsymbol{\theta})$, policy $\pi_{\boldsymbol{w}}(s)$, discount factor $\gamma$, learning rates $\alpha_q, \alpha_\pi$, inner optimization steps $K_q, K_\pi$
**Output:** Q-function parameters $\boldsymbol{\theta}$, policy parameters $\boldsymbol{w}$
1. Initialize $\boldsymbol{\theta}_0$, $\boldsymbol{w}_0$
2. $n \leftarrow 0$
3. **repeat** $\quad$ // **Outer loop: Policy Evaluation and Improvement**
4. $\quad$ **// Construct regression dataset with policy-based Bellman targets**
5. $\quad$ $\mathcal{D}_n^{\text{fit}} \leftarrow \emptyset$
6. $\quad$ **for** each $(s,a,r,s') \in \mathcal{D}$ **do**
7. $\quad\quad$ $a' \leftarrow \pi_{\boldsymbol{w}_n}(s')$ $\quad$ // Actor selects action (replaces max)
8. $\quad\quad$ $y_{s,a} \leftarrow r + \gamma q(s', a'; \boldsymbol{\theta}_n)$ $\quad$ // Critic evaluates
9. $\quad\quad$ $\mathcal{D}_n^{\text{fit}} \leftarrow \mathcal{D}_n^{\text{fit}} \cup \{((s,a), y_{s,a})\}$
10. $\quad$ **// Policy evaluation: fit Q-function to targets**
11. $\quad$ $\boldsymbol{\theta}_{n+1} \leftarrow \texttt{fit}_q(\mathcal{D}_n^{\text{fit}}, \boldsymbol{\theta}_n, K_q, \alpha_q)$
12. $\quad$ **// Policy improvement: maximize Q-function over dataset states**
13. $\quad$ $\boldsymbol{w}_{n+1} \leftarrow \texttt{fit}_\pi(\mathcal{D}, \boldsymbol{w}_n, \boldsymbol{\theta}_{n+1}, K_\pi, \alpha_\pi)$
14. $\quad$ $n \leftarrow n + 1$
15. **until** convergence or $n \geq n_{\max}$
16. **return** $\boldsymbol{\theta}_n$, $\boldsymbol{w}_n$
**where:**
- $\texttt{fit}_q(\mathcal{D}_n^{\text{fit}}, \boldsymbol{\theta}_n, K_q, \alpha_q)$ runs $K_q$ gradient steps minimizing $\frac{1}{|\mathcal{D}_n^{\text{fit}}|} \sum_{((s,a), y) \in \mathcal{D}_n^{\text{fit}}} (q(s,a; \boldsymbol{\theta}) - y)^2$, warm starting from $\boldsymbol{\theta}_n$
- $\texttt{fit}_\pi(\mathcal{D}, \boldsymbol{w}_n, \boldsymbol{\theta}_{n+1}, K_\pi, \alpha_\pi)$ runs $K_\pi$ gradient steps maximizing $\frac{1}{|\mathcal{D}|} \sum_{(s,a,r,s') \in \mathcal{D}} q(s, \pi_{\boldsymbol{w}}(s); \boldsymbol{\theta}_{n+1})$, warm starting from $\boldsymbol{w}_n$
The algorithm structure mirrors NFQI (Algorithm {prf:ref}fitted-q-iteration-batch in the FQI chapter) with two extensions. First, target computation (line 7-8) replaces the discrete max with a policy network call
Both fit operations use gradient descent with warm starting, consistent with the NFQI template. The Q-function minimizes squared Bellman error using targets computed with the current policy. The policy maximizes the Q-function via gradient ascent on the composition
computed via the chain rule (backpropagation through the actor into the critic). Modern automatic differentiation libraries handle this composition automatically.
We now extend NFQCA to the online setting with evolving replay buffers, mirroring how DQN extended NFQI in the FQI chapter. Just as DQN allowed $\mathcal{B}t$ and $\hat{P}{\mathcal{B}_t}$ to evolve during learning instead of using a fixed offline dataset, DDPG {cite:p}lillicrap2015continuous collects new transitions during training and stores them in a circular replay buffer.
Like DQN, DDPG uses the flattened FQI structure with target networks. But where DQN maintains a single target network $\boldsymbol{\theta}{\text{target}}$ for the Q-function, DDPG maintains two target networks: one for the critic $\boldsymbol{\theta}{\text{target}}$ and one for the actor
The online network now plays a triple role in DDPG: (1) the parameters being actively trained (
Since the policy
where
where
However, later work (including TD3, discussed below) found that simple uncorrelated Gaussian noise
:label: ddpg
**Input:** MDP $(S, \mathcal{A}, P, R, \gamma)$, Q-network $q(s,a; \boldsymbol{\theta})$, policy network $\pi_{\boldsymbol{w}}(s)$, learning rates $\alpha_q, \alpha_\pi$, target update frequency $K$, replay buffer capacity $B$, mini-batch size $b$, exploration noise $\eta$
**Output:** Q-function parameters $\boldsymbol{\theta}$, policy parameters $\boldsymbol{w}$
1. Initialize $\boldsymbol{\theta}_0$, $\boldsymbol{w}_0$ randomly
2. $\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}_0$, $\boldsymbol{w}_{\text{target}} \leftarrow \boldsymbol{w}_0$
3. Initialize replay buffer $\mathcal{B}$ with capacity $B$
4. Initialize exploration noise process $\eta$ (e.g., OU process or Gaussian)
5. $t \leftarrow 0$
6. **while** training **do**
1. Observe current state $s$
2. **// Use online network to collect data with exploration noise**
3. Select action: $a \leftarrow \pi_{\boldsymbol{w}_t}(s) + \eta_t$
4. Execute $a$, observe reward $r$ and next state $s'$
5. Store $(s,a,r,s')$ in $\mathcal{B}$, replacing oldest if full
6. Sample mini-batch of $b$ transitions $\{(s_i,a_i,r_i,s_i')\}_{i=1}^b$ from $\mathcal{B}$
7. **for** each sampled transition $(s_i,a_i,r_i,s_i')$ **do**
1. $y_i \leftarrow r_i + \gamma q(s'_i, \pi_{\boldsymbol{w}_{\text{target}}}(s'_i); \boldsymbol{\theta}_{\text{target}})$ $\quad$ // Actor target selects, critic target evaluates
8. $\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t - \alpha_q \nabla_{\boldsymbol{\theta}} \frac{1}{b}\sum_{i=1}^b (q(s_i,a_i;\boldsymbol{\theta}_t) - y_i)^2$
9. $\boldsymbol{w}_{t+1} \leftarrow \boldsymbol{w}_t + \alpha_\pi \frac{1}{b}\sum_{i=1}^b \nabla_a q(s_i,a;\boldsymbol{\theta}_{t+1})\Big|_{a=\pi_{\boldsymbol{w}_t}(s_i)} \cdot \nabla_{\boldsymbol{w}} \pi_{\boldsymbol{w}_t}(s_i)$
10. **if** $t \bmod K = 0$ **then**
1. $\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}_t$
2. $\boldsymbol{w}_{\text{target}} \leftarrow \boldsymbol{w}_t$
11. $t \leftarrow t + 1$
7. **return** $\boldsymbol{\theta}_t$, $\boldsymbol{w}_t$
The algorithm structure parallels DQN (Algorithm {prf:ref}dqn in the FQI chapter) with the continuous-action extensions from NFQCA. Lines 1-5 initialize both networks and their targets, following the same pattern as DQN but with an additional actor network. Line 3 uses the online actor with exploration noise for data collection, replacing DQN's
The policy gradient in line 9 uses the chain rule to backpropagate through the actor-critic composition:
This is identical to the NFQCA gradient, but now computed on mini-batches sampled from an evolving replay buffer rather than a fixed offline dataset. The critic gradient
DDPG inherits the overestimation bias from DQN's use of the max operator in Bellman targets. TD3 {cite:p}fujimoto2018addressing addresses this through three modifications to the DDPG template, following similar principles to Double DQN but adapted for continuous actions and taking a more conservative approach.
Recall from the Monte Carlo chapter that overestimation arises when we use the same noisy estimate both to select which action looks best and to evaluate that action. Double Q-learning breaks this coupling by maintaining two independent estimators with noise terms
When
Double DQN (Algorithm {prf:ref}double-dqn) implements this principle in the discrete action setting by using the online network $\boldsymbol{\theta}t$ for selection ($a^i \leftarrow \arg\max{a'} q(s_i',a'; \boldsymbol{\theta}t)$) and the target network $\boldsymbol{\theta}{\text{target}}$ for evaluation ($y_i \leftarrow r_i + \gamma q(s_i',a^i; \boldsymbol{\theta}{\text{target}})$). Since these networks experience different training noise, their errors are approximately independent, achieving the independence condition needed to eliminate evaluation bias. However, selection bias remains: the argmax still picks actions that received positive noise in the selection network, so $\mathbb{E}{\varepsilon^{(1)}}[\mu(s,a^\star)] \ge \max_a \mu(s,a)$.
TD3 takes a more conservative approach. Instead of decoupling selection from evaluation, TD3 maintains twin Q-networks
where $\tilde{a}i = \pi{\boldsymbol{w}_{\text{target}}}(s'_i)$. This minimum operation provides a pessimistic estimate: if the two Q-networks have independent errors $q^A(s',a) = q^(s',a) + \varepsilon^A$ and $q^B(s',a) = q^(s',a) + \varepsilon^B$, then
The connection to the conditional independence framework is subtle but important. While Double DQN uses independence to eliminate bias in expectation (one network selects, another evaluates), TD3 uses independence to construct a deliberate lower bound. Both approaches rely on maintaining two Q-functions with partially decorrelated errors, achieved through different initializations and stochastic gradients during training, but they aggregate these functions differently. Double DQN's decoupling targets unbiased estimation by breaking the correlation between selection and evaluation noise. TD3's minimum operation targets robust estimation by taking the most pessimistic view when the two networks disagree.
This trade-off between bias and robustness is deliberate. In actor-critic methods, the policy gradient pushes toward actions with high Q-values. Overestimation is particularly harmful because it can lead the policy to exploit erroneous high-value regions. Underestimation is generally safer: the policy may ignore some good actions, but it will not be misled into pursuing actions that only appear valuable due to approximation error. The minimum operation implements a "trust the pessimist" principle that complements the policy optimization objective.
TD3 also introduces two additional modifications beyond the clipped double Q-learning. First, target policy smoothing adds clipped noise to the target policy's actions when computing targets:
TD3 also replaces DDPG's hard target updates with exponential moving average (EMA) updates, following the smooth update scheme from Algorithm {prf:ref}nfqi-flattened-ema in the FQI chapter. Instead of copying $\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}t$ every $K$ steps, EMA smoothly tracks the online network: $\boldsymbol{\theta}{\text{target}} \leftarrow \tau \boldsymbol{\theta}t + (1-\tau)\boldsymbol{\theta}{\text{target}}$ at every update. For small
:label: td3
**Input:** MDP $(S, \mathcal{A}, P, R, \gamma)$, twin Q-networks $q^A(s,a; \boldsymbol{\theta}^A)$, $q^B(s,a; \boldsymbol{\theta}^B)$, policy network $\pi_{\boldsymbol{w}}(s)$, learning rates $\alpha_q, \alpha_\pi$, replay buffer capacity $B$, mini-batch size $b$, policy delay $d$, EMA rate $\tau$, target noise $\sigma$, noise clip $c$, exploration noise $\sigma_{\text{explore}}$
**Output:** Twin Q-function parameters $\boldsymbol{\theta}^A, \boldsymbol{\theta}^B$, policy parameters $\boldsymbol{w}$
1. Initialize $\boldsymbol{\theta}^A_0$, $\boldsymbol{\theta}^B_0$, $\boldsymbol{w}_0$ randomly
2. $\boldsymbol{\theta}^A_{\text{target}} \leftarrow \boldsymbol{\theta}^A_0$, $\boldsymbol{\theta}^B_{\text{target}} \leftarrow \boldsymbol{\theta}^B_0$, $\boldsymbol{w}_{\text{target}} \leftarrow \boldsymbol{w}_0$
3. Initialize replay buffer $\mathcal{B}$ with capacity $B$
4. $t \leftarrow 0$
5. **while** training **do**
1. Observe current state $s$
2. **// Data collection with exploration noise**
3. Select action: $a \leftarrow \pi_{\boldsymbol{w}_t}(s) + \varepsilon$, where $\varepsilon \sim \mathcal{N}(0, \sigma_{\text{explore}}^2)$
4. Execute $a$, observe reward $r$ and next state $s'$
5. Store $(s,a,r,s')$ in $\mathcal{B}$, replacing oldest if full
6. Sample mini-batch of $b$ transitions $\{(s_i,a_i,r_i,s_i')\}_{i=1}^b$ from $\mathcal{B}$
7. **// Compute targets with clipped double Q-learning**
8. **for** each sampled transition $(s_i,a_i,r_i,s_i')$ **do**
1. $\tilde{a}_i \leftarrow \pi_{\boldsymbol{w}_{\text{target}}}(s'_i) + \text{clip}(\varepsilon_i, -c, c)$, $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$ $\quad$ // Target policy smoothing
2. $y_i \leftarrow r_i + \gamma \color{blue}{\min\big(q^A(s'_i, \tilde{a}_i; \boldsymbol{\theta}^A_{\text{target}}), q^B(s'_i, \tilde{a}_i; \boldsymbol{\theta}^B_{\text{target}})\big)}$ $\quad$ // Minimum of twin targets
9. **// Update both Q-networks toward same targets**
10. $\boldsymbol{\theta}^A_{t+1} \leftarrow \boldsymbol{\theta}^A_t - \alpha_q \nabla_{\boldsymbol{\theta}} \frac{1}{b}\sum_{i=1}^b(q^A(s_i,a_i;\boldsymbol{\theta}^A_t) - y_i)^2$
11. $\boldsymbol{\theta}^B_{t+1} \leftarrow \boldsymbol{\theta}^B_t - \alpha_q \nabla_{\boldsymbol{\theta}} \frac{1}{b}\sum_{i=1}^b(q^B(s_i,a_i;\boldsymbol{\theta}^B_t) - y_i)^2$
12. **// Delayed policy update and target network updates**
13. **if** $t \bmod d = 0$ **then**
1. $\boldsymbol{w}_{t+1} \leftarrow \boldsymbol{w}_t + \alpha_\pi \frac{1}{b}\sum_{i=1}^b \nabla_a q^A(s_i,a;\boldsymbol{\theta}^A_{t+1})\Big|_{a=\pi_{\boldsymbol{w}_t}(s_i)} \cdot \nabla_{\boldsymbol{w}} \pi_{\boldsymbol{w}_t}(s_i)$
2. $\boldsymbol{\theta}^A_{\text{target}} \leftarrow \tau\boldsymbol{\theta}^A_{t+1} + (1-\tau)\boldsymbol{\theta}^A_{\text{target}}$ (EMA update)
3. $\boldsymbol{\theta}^B_{\text{target}} \leftarrow \tau\boldsymbol{\theta}^B_{t+1} + (1-\tau)\boldsymbol{\theta}^B_{\text{target}}$
4. $\boldsymbol{w}_{\text{target}} \leftarrow \tau\boldsymbol{w}_{t+1} + (1-\tau)\boldsymbol{w}_{\text{target}}$
14. $t \leftarrow t + 1$
6. **return** $\boldsymbol{\theta}^A_t$, $\boldsymbol{\theta}^B_t$, $\boldsymbol{w}_t$
The algorithm structure parallels Double DQN but with continuous actions. Lines 8.1-8.2 implement clipped double Q-learning: smoothing adds noise to target actions (preventing exploitation of Q-function artifacts), and the min operation (highlighted in blue) provides pessimistic value estimates. Both critics update toward the same shared target (lines 10-11), but their different initializations and stochastic gradient noise keep their errors partially decorrelated, following the same principle underlying Double DQN's independence assumption. Line 13 gates policy updates to every nfqi-flattened-ema.
TD3 simplifies exploration by replacing DDPG's Ornstein-Uhlenbeck process with uncorrelated Gaussian noise
DDPG and TD3 address continuous actions by learning a deterministic policy that amortizes the
The smoothing chapter presents an alternative: entropy-regularized MDPs, where the agent maximizes expected return plus a bonus for policy randomness. This yields stochastic policies with exploration built into the objective itself. The smooth Bellman operator replaces the hard max with a soft-max:
where
This integral is intractable. We face an infinite-dimensional sum over the continuous action space. The very smoothness that gives us stochastic policies creates a new computational barrier, distinct from but analogous to the
Soft actor-critic (SAC) {cite:p}haarnoja2018soft,haarnoja2018sacapplications exploits an equivalence between the intractable integral and an expectation. The optimal policy under entropy regularization is the Boltzmann distribution $\pi^(a|s) \propto \exp(\beta \cdot q^(s,a))$. Under this policy, the soft value function becomes:
This converts the intractable integral into an expectation we can estimate by sampling. SAC learns a parametric policy
The minimum over twin Q-networks applies the clipped double-Q trick from TD3. Exploration comes from the policy's stochasticity rather than external noise.
The Q-network update assumes a policy
Since
This pushes probability toward high Q-value actions while the
To estimate gradients of this objective, we face a technical problem: the policy parameters
This moves
We can now differentiate through
:label: sac
**Input:** MDP $(S, \mathcal{A}, P, R, \gamma)$, twin Q-networks $q^1(s,a; \boldsymbol{\theta}^1), q^2(s,a; \boldsymbol{\theta}^2)$, policy $\pi_{\boldsymbol{\phi}}$, learning rates $\alpha_q, \alpha_\pi$, replay buffer capacity $B$, mini-batch size $b$, EMA rate $\tau$, entropy weight $\alpha$
**Output:** Q-function parameters $\boldsymbol{\theta}^1, \boldsymbol{\theta}^2$, policy parameters $\boldsymbol{\phi}$
1. Initialize $\boldsymbol{\theta}^1_0, \boldsymbol{\theta}^2_0, \boldsymbol{\phi}_0$ randomly
2. $\boldsymbol{\theta}^1_{\text{target}} \leftarrow \boldsymbol{\theta}^1_0$, $\boldsymbol{\theta}^2_{\text{target}} \leftarrow \boldsymbol{\theta}^2_0$
3. Initialize replay buffer $\mathcal{B}$ with capacity $B$
4. $t \leftarrow 0$
5. **while** training **do**
1. Observe state $s$
2. Sample action: $a \sim \pi_{\boldsymbol{\phi}_t}(\cdot|s)$ $\quad$ // Stochastic policy provides exploration
3. Execute $a$, observe reward $r$ and next state $s'$
4. Store $(s,a,r,s')$ in $\mathcal{B}$, replacing oldest if full
5. Sample mini-batch $\{(s_i,a_i,r_i,s_i')\}_{i=1}^b$ from $\mathcal{B}$
6. **// Update Q-networks: bootstrap using single-sample soft value estimate**
7. **for** each $s'_i$ **do** sample $\tilde{a}'_i \sim \pi_{\boldsymbol{\phi}_t}(\cdot|s'_i)$
8. $y_i \leftarrow r_i + \gamma \left[\min(q^1(s'_i, \tilde{a}'_i; \boldsymbol{\theta}^1_{\text{target}}), q^2(s'_i, \tilde{a}'_i; \boldsymbol{\theta}^2_{\text{target}})) - \alpha \log \pi_{\boldsymbol{\phi}_t}(\tilde{a}'_i|s'_i)\right]$
9. $\boldsymbol{\theta}^1_{t+1} \leftarrow \boldsymbol{\theta}^1_t - \alpha_q \nabla_{\boldsymbol{\theta}} \frac{1}{b}\sum_{i=1}^b(q^1(s_i,a_i;\boldsymbol{\theta}^1_t) - y_i)^2$
10. $\boldsymbol{\theta}^2_{t+1} \leftarrow \boldsymbol{\theta}^2_t - \alpha_q \nabla_{\boldsymbol{\theta}} \frac{1}{b}\sum_{i=1}^b(q^2(s_i,a_i;\boldsymbol{\theta}^2_t) - y_i)^2$
11. **// Update policy: minimize KL to Boltzmann distribution**
12. **for** each $s_i$ **do** sample $\epsilon_i \sim \mathcal{N}(0,I)$ and compute $\hat{a}_i = f_{\boldsymbol{\phi}_t}(s_i, \epsilon_i)$ $\quad$ // Reparameterization
13. $\boldsymbol{\phi}_{t+1} \leftarrow \boldsymbol{\phi}_t - \alpha_\pi \nabla_{\boldsymbol{\phi}} \frac{1}{b}\sum_{i=1}^b \left[\alpha \log \pi_{\boldsymbol{\phi}_t}(\hat{a}_i|s_i) - \min_{j=1,2} q^j(s_i, \hat{a}_i; \boldsymbol{\theta}^j_{t+1})\right]$
14. **// EMA update for Q-network targets**
15. $\boldsymbol{\theta}^1_{\text{target}} \leftarrow \tau \boldsymbol{\theta}^1_{t+1} + (1-\tau)\boldsymbol{\theta}^1_{\text{target}}$
16. $\boldsymbol{\theta}^2_{\text{target}} \leftarrow \tau \boldsymbol{\theta}^2_{t+1} + (1-\tau)\boldsymbol{\theta}^2_{\text{target}}$
17. $t \leftarrow t + 1$
6. **return** $\boldsymbol{\theta}^1_t, \boldsymbol{\theta}^2_t, \boldsymbol{\phi}_t$
The algorithm interleaves three updates. The Q-networks (lines 7-10) follow fitted Q-iteration with the soft Bellman target: sample a next action $\tilde{a}'i$ from the current policy, compute the entropy-adjusted target $y_i = r_i + \gamma[\min_j q^j{\text{target}}(s'_i, \tilde{a}'i) - \alpha \log \pi{\boldsymbol{\phi}}(\tilde{a}'_i|s'_i)]$, and minimize squared error. The minimum over twin Q-networks mitigates overestimation as in TD3. The policy (lines 12-13) updates to match the Boltzmann distribution induced by the current Q-function, using the reparameterization trick for gradient estimation. Target networks update via EMA (lines 15-16) to stabilize training.
The stochastic policy serves the same amortization purpose as in DDPG and TD3: it replaces the intractable
DDPG, TD3, and SAC all follow the same solution template from fitted Q-iteration: compute Bellman targets using the current Q-function, fit the Q-function to those targets, repeat. This is successive approximation, the function iteration approach
Path Consistency Learning (PCL) {cite:p}Nachum2017 solves the Bellman equation differently. Instead of iterating the operator, it directly minimizes a residual. This is the least-squares approach from projection methods: solve
Consider the entropy-regularized Q-function Bellman equation from the smoothing chapter. Under general stochastic dynamics, it involves an expectation over next states:
Suppose the dynamics are deterministic:
The value function relates to Q-functions through the soft-max:
Contrast two cases: general policies versus the optimal Boltzmann policy.
For general policies, the value equals an expectation:
$$ v^\pi(s) = \mathbb{E}_{a \sim \pi(\cdot|s)}[q^\pi(s,a) - \alpha \log \pi(a|s)] $$ (eq:v-general-policy)
This is an average. For a single observed action
where
For the optimal policy under entropy regularization, the Boltzmann structure produces an exact pointwise identity. The optimal policy is:
Taking logarithms and rearranging:
$$ v^(s) = q^(s,a) - \alpha \log \pi^*(a|s) \quad \text{for all } a $$ (eq:v-q-exact-boltzmann)
This holds exactly for every action
Now take a trajectory segment eq:v-q-exact-boltzmann to substitute $v^(s_1) = q^(s_1,a_1) - \alpha\log\pi^*(a_1|s_1)$ exactly:
Substitute $q^(s_1,a_1) = r_1 + \gamma v^(s_2)$:
Continue this telescoping for
Apply equation {eq}eq:v-q-exact-boltzmann once more to get $v^(s_0) = q^(s_0,a_0) - \alpha\log\pi^*(a_0|s_0)$:
$$ v^(s_0) = \sum_{t=0}^{d-1} \gamma^t [r_t - \alpha\log\pi^(a_t|s_t)] + \gamma^d v^*(s_d) $$ (eq:path-consistency-exact)
Rearranging gives the path consistency residual:
$$ R(s_0:s_d; \pi^, v^) = v^(s_0) - \gamma^d v^(s_d) - \sum_{t=0}^{d-1} \gamma^t[r_t - \alpha\log\pi^*(a_t|s_t)] = 0 $$ (eq:path-residual)
The telescoping produces an exact identity:
:class: dropdown
The distinction between equations {eq}`eq:v-general-policy` and {eq}`eq:v-q-exact-boltzmann` is subtle but crucial.
**For general policies** (equation {eq}`eq:v-general-policy`), the value is an average over actions sampled from the policy. Individual actions give noisy estimates: if we draw $a \sim \pi(\cdot|s)$, then $q^\pi(s,a) - \alpha\log\pi(a|s) = v^\pi(s) + \varepsilon$ where $\varepsilon$ is a zero-mean random variable. We need to average many samples to estimate $v^\pi(s)$ accurately. Multi-step telescoping would accumulate these sampling errors $\varepsilon_0, \varepsilon_1, \ldots, \varepsilon_{d-1}$, producing noisy residuals even at the true solution. Off-policy learning would require importance weights to correct for using actions from a different behavior policy.
**For the optimal entropy-regularized policy** (equation {eq}`eq:v-q-exact-boltzmann`), the Boltzmann structure collapses the expectation to a pointwise identity. The relationship $v^*(s) = q^*(s,a) - \alpha\log\pi^*(a|s)$ holds exactly for every action $a$, optimal or not. A suboptimal action has low $q^*(s,a)$ (low expected return) and low $\pi^*(a|s)$ (low probability), making $-\alpha\log\pi^*(a|s)$ large. These terms balance precisely to give $v^*(s)$. No sampling error exists. The telescoping is exact, producing a residual that equals zero for every action sequence, not just in expectation. Off-policy learning works because the constraint holds as a deterministic identity for any observed path.
This property is unique to soft-max operators. For hard-max, $v^*(s) = \max_a q^*(s,a)$ holds only when $a$ is optimal. Suboptimal actions satisfy $v^*(s) > q^*(s,a)$, an inequality that cannot be used to construct a residual.
PCL's two structural requirements (deterministic dynamics and entropy regularization) are not arbitrary design choices. Each addresses a fundamental theoretical issue.
Under stochastic dynamics, the Q-function Bellman equation has an expectation over next states:
The exact relationship {eq}eq:v-q-exact-boltzmann still holds, so we can write the path consistency constraint. But now consider what PCL minimizes: the squared residual
At the true optimum $(v^, \pi^)$, the constraint is
with equality only when
This is Baird's double sampling problem {cite:p}Baird1995. To get an unbiased gradient of
This requires two independent samples of the next state from the same
Under deterministic dynamics,
Attempt the same path consistency derivation with the hard-max Bellman operator. Under deterministic dynamics, the Q-function satisfies:
where $v^(s) = \max_{a'} q^(s,a')$ and the optimal policy is $\pi^(s) = \arg\max_a q^(s,a)$ (deterministic).
Now try to relate $v^(s)$ to an arbitrary observed action $a$. For the optimal action $a^ \in \arg\max_{a'} q^*(s,a')$, we have:
But for a suboptimal action
This is an inequality, not an equation. There is no formula expressing $v^(s)$ in terms of $q^(s,a)$ for suboptimal actions.
Attempt the multi-step telescoping. Start with $q^(s_0, a_0) = r_0 + \gamma v^(s_1)$. To continue, we need to express
with equality only if
Compare this to the soft-max case. The Boltzmann structure gives equation {eq}eq:v-q-exact-boltzmann: $v^(s) = q^(s,a) - \alpha\log\pi^(a|s)$ for all actions $a$. The log-probability term compensates exactly for suboptimality: low-probability actions have large $-\alpha\log\pi^(a|s)$, which adds to the low $q^(s,a)$ to recover $v^(s)$. This enables exact substitution at every step:
The telescoping proceeds without inequalities or restrictions on which actions were chosen. Multi-step hard-max Q-learning lacks theoretical justification for off-policy data because when we observe a trajectory with suboptimal actions, we cannot write an exact path consistency constraint.
Both requirements are structural:
| Requirement | Addresses |
|---|---|
| Deterministic dynamics | Double sampling bias: ensures |
| Entropy regularization | All-action consistency (equation {eq}eq:v-q-exact-boltzmann) |
Without deterministic dynamics, residual minimization is biased. Without entropy regularization, the constraint holds only for optimal actions.
Equation {eq}eq:path-residual provides a constraint that the optimal $(v^, \pi^)$ must satisfy: the residual equals zero for every observed path. For parametric approximations
PCL minimizes the squared residual over observed path segments:
This is the least-squares residual approach from the projection methods chapter. SAC computes targets
Gradient descent gives:
where
:label: pcl
**Input:** MDP with deterministic dynamics $s_{t+1} = f(s_t, a_t)$, policy $\pi_{\boldsymbol{\theta}}$, value function $v_{\boldsymbol{\phi}}$, entropy weight $\alpha$, path length $d$, learning rates $\eta_\pi, \eta_v$, replay buffer capacity $B$
**Output:** Policy parameters $\boldsymbol{\theta}$, value parameters $\boldsymbol{\phi}$
1. Initialize $\boldsymbol{\theta}_0$, $\boldsymbol{\phi}_0$
2. Initialize replay buffer $\mathcal{R}$ with capacity $B$
3. $k \leftarrow 0$
4. **while** training **do**
1. Sample trajectory $\tau = (s_0, a_0, r_0, \ldots, s_T)$ from $\pi_{\boldsymbol{\theta}_k}$ and store in $\mathcal{R}$
2. Sample trajectory $\tau'$ from $\mathcal{R}$
3. **for** each $d$-step segment in $\tau'$ **do**
1. Compute residual: $R_i \leftarrow v_{\boldsymbol{\phi}_k}(s_i) - \gamma^d v_{\boldsymbol{\phi}_k}(s_{i+d}) - \sum_{t=0}^{d-1} \gamma^t[r_{i+t} - \alpha \log \pi_{\boldsymbol{\theta}_k}(a_{i+t}|s_{i+t})]$
2. Update policy: $\boldsymbol{\theta}_{k+1} \leftarrow \boldsymbol{\theta}_k + \eta_\pi \alpha R_i \sum_{t=0}^{d-1} \gamma^t \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}_k}(a_{i+t}|s_{i+t})$
3. Update value: $\boldsymbol{\phi}_{k+1} \leftarrow \boldsymbol{\phi}_k - \eta_v R_i\left[\nabla_{\boldsymbol{\phi}} v_{\boldsymbol{\phi}_k}(s_i) - \gamma^d \nabla_{\boldsymbol{\phi}} v_{\boldsymbol{\phi}_k}(s_{i+d})\right]$
4. Remove oldest trajectories if $|\mathcal{R}| > B$
5. $k \leftarrow k + 1$
5. **return** $\boldsymbol{\theta}_k$, $\boldsymbol{\phi}_k$
The algorithm collects trajectories from the current policy and stores them in a replay buffer. At each iteration, it samples a trajectory (possibly old) and performs gradient descent on the path residual for all
Algorithm {prf:ref}pcl uses separate networks for policy and value. But we can use a single Q-network
The path residual becomes:
and the gradient combines both value and policy contributions through the same parameters. This unified architecture eliminates the actor-critic separation: one Q-network serves both roles.
Single-step case (
No entropy (
Multi-step with hard-max: No analog exists. The hard-max Bellman operator eq:v-q-exact-boltzmann. Multi-step telescoping would accumulate errors from the max operator, making the constraint valid only in expectation under the optimal policy. The soft-max structure enables exact off-policy path consistency.
Both methods solve entropy-regularized MDPs but use fundamentally different solution strategies:
| Aspect | SAC | PCL |
|---|---|---|
| Solution method | Successive approximation: compute targets |
Residual minimization: minimize |
| Update structure | Target computation + regression step | Single gradient step on squared residual |
| Target networks | Required (mark outer-iteration boundaries) | None (residual constraint, not target fitting) |
| Temporal horizon | Single-step TD: |
Multi-step paths: accumulate over |
| Off-policy handling | Replay buffer with single-sample bias | No importance sampling (works for any trajectory) |
| Dynamics requirement | General stochastic transitions |
Deterministic transitions |
| Architecture | Twin Q-networks + policy network | Single Q-network (unified parameterization) |
PCL requires deterministic dynamics. It gains multi-step telescoping and off-policy learning without importance weights, but only for deterministic systems (robotic manipulation, many control tasks). SAC works for general stochastic MDPs.
PCL amortizes at a different level than DDPG/TD3/SAC. Those methods amortize the action maximization: learn a policy network that outputs
SAC and PCL both learn policies that approximate the Boltzmann distribution $\pi^(a|s) \propto \exp(\beta \cdot q^(s,a))$ induced by entropy regularization. This amortization allows fast action selection at deployment: a single forward pass through the policy network. An alternative approach forgoes learning entirely and instead performs optimization at every decision.
Model Predictive Path Integral control (MPPI) {cite:p}williams2017mppi uses the Boltzmann weighting directly for action sequence selection. Given a dynamics model
where
where
:label: mppi
**Input:** Dynamics model $s_{t+1} = f(s_t, a_t)$, cost function $c(s,a)$, horizon $H$, number of samples $K$, temperature $\lambda$, noise distribution $\epsilon \sim \mathcal{N}(0, \Sigma)$
**Output:** Action $a_0^*$
1. Observe current state $s_0$
2. **for** $i = 1, \ldots, K$ **do**
1. Sample action sequence: $a_t^{(i)} \leftarrow \bar{a}_t + \epsilon_t^{(i)}$ for $t = 0, \ldots, H-1$ $\quad$ // Perturb nominal
2. Roll out: $s_{t+1}^{(i)} \leftarrow f(s_t^{(i)}, a_t^{(i)})$ for $t = 0, \ldots, H-1$
3. Compute cost: $C^{(i)} \leftarrow \sum_{t=0}^{H-1} c(s_t^{(i)}, a_t^{(i)})$
3. Compute Boltzmann weights: $w^{(i)} \leftarrow \exp(-C^{(i)}/\lambda) / \sum_j \exp(-C^{(j)}/\lambda)$
4. **return** $a_0^* = \sum_{i=1}^K w^{(i)} a_0^{(i)}$
The algorithm samples perturbed action sequences around a nominal trajectory
The contrast between MPPI and the methods in this chapter illuminates what amortization provides. SAC learns a policy
MPPI performs full optimization at every decision. At each state, it samples action sequences, weights them by exponentiated costs, and returns the weighted average. No learning occurs. The policy is implicitly defined by the optimization procedure itself.
This trade-off has practical consequences:
| Aspect | Amortized (SAC, PCL) | Non-Amortized (MPPI) |
|---|---|---|
| Action selection | Single forward pass |
|
| Generalization | Policy generalizes across states | Optimization from scratch at each state |
| Model requirement | None (SAC) or deterministic (PCL) | Accurate dynamics model |
| Approximation error | Policy network approximation | None (exact optimization) |
| Adaptability | Requires retraining for new tasks | Adapts immediately to new cost functions |
MPPI excels at real-time control for systems with fast, accurate models (robotics, autonomous vehicles). The replanning handles model errors and disturbances without retraining. However, the per-step computation (
The entropy regularization that connects SAC, PCL, and MPPI is not coincidental. All three methods solve variants of the soft Bellman equation. SAC and PCL amortize the solution by learning value functions and policies. MPPI solves it directly through sampling. The Boltzmann weighting emerges in all cases as the optimal policy structure under entropy regularization.
The methods developed in this chapter all parameterize policies, but they remain rooted in the Bellman equation. NFQCA, DDPG, TD3, and SAC learn Q-functions through successive approximation, then derive policies by maximizing these Q-functions. PCL minimizes a path residual derived from the soft Bellman equation. The policy serves as an amortized optimizer for a value-based objective.
There is a different approach, developed in computational economics {cite:p}Judd1992,Rust1996,ndp, that also parameterizes policies but solves an entirely different functional equation. Consider a control problem with continuous states and actions, deterministic dynamics
$$ \frac{\partial r(s,a)}{\partial a}\Big|{a=\pi^(s)} + \gamma, \frac{\partial v^(s')}{\partial s'}\Big|{s'=f(s,\pi^(s))}, \frac{\partial f(s,a)}{\partial a}\Big|_{a=\pi^(s)}
$$
This Euler equation expresses optimality through derivatives rather than through the max operator. For problems with special structure (the Euler class, where dynamics are affine in the controlled state), envelope theorems eliminate
With a parameterized policy
This is root-finding, not fixed-point iteration. Newton-type methods replace the successive approximation of fitted Q-iteration. The Euler operator is not a contraction, so convergence guarantees are problem-dependent.
What does this mean for reinforcement learning? The Euler approach shares the amortization idea: learn a policy network that directly outputs actions. But the training objective comes from first-order optimality conditions rather than from Bellman residuals or Q-function maximization. This raises questions worth considering. Could Euler-style objectives provide useful training signals for actor-critic methods? When dynamics are known or learned, could first-order conditions offer advantages over value-based objectives? The connection between these traditions remains underexplored.
When actions are continuous, computing
NFQCA, DDPG, TD3, and SAC use successive approximation: compute Bellman targets, fit Q-function to targets, update policy to maximize Q-function. Deterministic policies (DDPG, TD3) require external exploration noise; stochastic policies (SAC) explore through their inherent randomness. TD3 and SAC use twin Q-networks with
MPPI forgoes learning entirely, performing Boltzmann-weighted optimization at every decision. This avoids policy approximation error but requires
The next chapter takes a different approach: parameterize the policy directly and optimize via gradient ascent on expected return. The starting point is derivative estimation for stochastic optimization rather than Bellman equations, though value functions return as variance reduction tools in actor-critic methods.