| jupytext |
|
||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| kernelspec |
|
The previous chapter established the theoretical foundations of simulation-based approximate dynamic programming: Monte Carlo integration for evaluating expectations, Q-functions for efficient action selection, and techniques for mitigating overestimation bias. Those developments assumed we could sample freely from transition distributions and choose optimization parameters without constraint. This chapter develops a unified framework for fitted Q-iteration algorithms that spans both offline and online settings. We begin with batch algorithms that learn from fixed datasets, then show how the same template generates online methods like DQN through systematic variations in data collection, optimization strategy, and function approximation.
All FQI methods share the same two-level structure built on three core ingredients: a buffer $\mathcal{B}t$ of transitions inducing an empirical distribution $\hat{P}{\mathcal{B}_t}$, a target function
The fit operation minimizes the regression loss using
The buffer
:class: dropdown
We maintain a careful distinction between two empirical distributions throughout:
- **Buffer distribution** $\hat{P}_{\mathcal{B}_t}$ over transitions $\tau = (s, a, r, s')$: The empirical distribution induced by the replay buffer $\mathcal{B}_t$, which contains raw experience tuples. This is fixed (offline) or evolves via online collection (adding new transitions, dropping old ones).
- **Regression distribution** $\hat{P}_t^{\text{fit}}$ over pairs $((s,a), y)$: The empirical distribution over supervised learning targets. This changes every outer iteration $n$ as we recompute targets using the current target function $g(\cdot; \boldsymbol{\theta}_n)$.
The relationship: at iteration $n$, we construct $\hat{P}_n^{\text{fit}}$ from $\hat{P}_{\mathcal{B}_n}$ by applying the target function:
$$
\hat{P}_n^{\text{fit}} = (\mathrm{id}, g(\cdot; \boldsymbol{\theta}_n))_\# \hat{P}_{\mathcal{B}_n}
$$
where $g(s,a,r,s'; \boldsymbol{\theta}_n) = r + \gamma \max_{a'} q(s', a'; \boldsymbol{\theta}_n)$ or uses the smooth logsumexp operator.
This distinction matters pedagogically: the **buffer distribution** $\hat{P}_{\mathcal{B}_t}$ is fixed (offline) or evolves via online collection, while the **regression distribution** $\hat{P}_t^{\text{fit}}$ evolves via target recomputation. Fitted Q-iteration is the outer loop over target recomputation, not the inner loop over gradient steps.
This template provides a blueprint for instantiating concrete algorithms. Six design axes generate algorithmic diversity: the function approximator (trees, neural networks, linear models), the Bellman operator (hard max vs smooth logsumexp, discussed in the regularized MDP chapter), the inner optimization strategy (full convergence,
| Algorithm | Approximator | Bellman | Inner Loop | Initialization | Data | Bias Fix |
|---|---|---|---|---|---|---|
FQI {cite}ernst2005tree |
Extra Trees | Hard | Full | Cold | Offline | None |
NFQI {cite}riedmiller2005neural |
Neural Net | Hard | Full | Warm | Offline | None |
Q-learning {cite}watkins1989learning |
Any | Hard | K=1 | Warm | Online | None |
DQN {cite}mnih2013atari |
Deep NN | Hard | K=1 | Warm | Replay | None |
Double DQN {cite}van2016deep |
Deep NN | Hard | K=1 | Warm | Replay | Double Q |
Soft Q {cite}haarnoja2017reinforcement |
Neural Net | Smooth | K steps | Warm | Replay | None |
This table omits continuous action methods (NFQCA, DDPG, SAC), which introduce an additional design dimension. We address those in the continuous action chapter. The initialization choice becomes particularly important when moving from batch to online algorithms.
The exact Bellman operator involves expectations under the true transition law (combined with the behavior policy), which we denote abstractly by
In fitted Q-iteration we never see
where
For any integrand
This is exactly the sample average estimator from the Monte Carlo chapter, applied now to transitions. Conceptually, fitted Q-iteration performs plug-in approximate dynamic programming. The plug-in principle is a general approach from statistics: when an algorithm requires an unknown population quantity, substitute its sample-based estimator. Here, we replace the unknown transition law
From a computational viewpoint, we could describe all of this using sample averages and mini-batch gradients. The empirical distribution notation provides three benefits. First, it unifies offline and online algorithms: both perform value iteration under an empirical law $\hat{P}_{\mathcal{B}t}$, differing only in whether $\mathcal{B}t$ remains fixed or evolves. Second, it shows that methods like DQN perform stochastic optimization of a sample average approximation objective $\mathbb{E}{\tau \sim \hat{P}{\mathcal{B}t}}[\ell(\cdot)]$, not some ad hoc non-stationary procedure. Third, it cleanly separates two sources of approximation that we will examine shortly: statistical bootstrap (resampling from $\hat{P}{\mathcal{B}_t}$) versus temporal-difference bootstrap (using estimated values in targets).
:class: dropdown
The empirical distribution $\hat{P}_{\mathcal{B}_t} = \frac{1}{|\mathcal{B}_t|}\sum_{\tau \in \mathcal{B}_t} \delta_{\tau}$ is a discrete probability measure over $|\mathcal{B}_t|$ points regardless of whether state and action spaces are continuous or discrete. For any measurable set $A \subseteq \mathcal{S} \times \mathcal{A} \times \mathbb{R} \times \mathcal{S}$:
$$
\hat{P}_{\mathcal{B}_t}(A) = \frac{1}{|\mathcal{B}_t|}\sum_{\tau \in \mathcal{B}_t} \mathbb{1}[\tau \in A] = \frac{|\{\tau \in \mathcal{B}_t : \tau \in A\}|}{|\mathcal{B}_t|}
$$
The empirical distribution assigns probability $1/|\mathcal{B}_t|$ to each observed tuple and zero elsewhere.
Fitted Q-iteration is built around three ingredients at any time
- A replay buffer $\mathcal{B}t$ containing transitions, inducing an empirical distribution $\hat{P}{\mathcal{B}_t}$
- A target function
$g(s,a,r,s'; \boldsymbol{\theta}) : \mathcal{S} \times \mathcal{A} \times \mathbb{R} \times \mathcal{S} \times \Theta \mapsto \mathbb{R}$ that computes regression targets for individual transitions. Unlike the Bellman operator$\mathcal{T}$ which acts on function spaces,$g$ operates on transition tuples with parameters$\boldsymbol{\theta}$ (after the semicolon) specifying the current Q-function. Typically$g(s,a,r,s'; \boldsymbol{\theta}) = r + \gamma \max_{a'} q(s',a'; \boldsymbol{\theta})$ (hard max) or uses the smooth logsumexp operator. - A loss function
$\ell$ and optimization budget (replay ratio, number of gradient steps)
Pushing transitions through the target function transforms
Different algorithms correspond to different ways of evolving the buffer
-
Offline FQI. We start from a fixed dataset
$\mathcal{D}$ and never collect new data. The buffer is constant, $\mathcal{B}t \equiv \mathcal{D}$ and $\hat{P}{\mathcal{B}t} \equiv \hat{P}{\mathcal{D}}$, and only the target function$g(\cdot; \boldsymbol{\theta}_t)$ changes as the Q-function evolves. -
Replay (DQN-style). The buffer is a circular buffer of fixed capacity
$B$ . At each interaction step we append the new transition and, if the buffer is full, drop the oldest one: $\mathcal{B}t = {\tau{t-B+1},\ldots,\tau_t}$ and $\hat{P}_{\mathcal{B}_t} = \frac{1}{|\mathcal{B}t|} \sum{\tau \in \mathcal{B}t} \delta{\tau}$. The empirical distribution slides forward through time, but at each update we still sample uniformly from the current buffer. -
Fully online Q-learning. Tabular Q-learning is the degenerate case with buffer size
$B=1$ : we only keep the most recent transition. Then$\hat{P}_{\mathcal{B}_t}$ is supported on a single point and each update uses that one sample once.
The replay ratio (optimization steps per environment transition) quantifies data reuse:
where
:class: dropdown
This perspective separates two different "bootstraps" that appear in FQI:
1. **Statistical bootstrap over data.** By sampling with replacement from $\hat{P}_{\mathcal{B}_t}$, we approximate expectations under the (unknown) transition distribution $P$ with expectations under the empirical distribution $\hat{P}_{\mathcal{B}_t}$. This is identical to bootstrap resampling in statistics. Mini-batch training is exactly this: we treat the observed transitions as if they were the entire population and approximate expectations by repeatedly resampling from the empirical law. The replay ratio controls how many such bootstrap samples we take per environment interaction.
2. **Temporal-difference bootstrap over values.** When we compute $y = r + \gamma \max_{a'} q(s',a';\boldsymbol{\theta})$, we replace the unknown continuation value by our current estimate. This is the TD notion of bootstrapping and the source of maximization bias studied in the previous chapter.
FQI, DQN, and their variants combine both: the empirical distribution $\hat{P}_{\mathcal{B}_t}$ encodes how we reuse data (statistical bootstrap), and the target function $g$ encodes how we bootstrap values (TD bootstrap). Most bias-correction techniques (Keane–Wolpin, Double Q-learning, Gumbel losses) modify the second kind of bootstrapping while leaving the statistical bootstrap unchanged.
Every algorithm in this chapter minimizes an empirical risk of the form $\mathbb{E}_{((s,a),y)\sim \hat{P}_t^{\text{fit}}} [\ell(q(s,a;\boldsymbol{\theta}), y)]$, where expectations are computed via the sample average estimator over the buffer. Algorithmic diversity arises from choices of buffer evolution, target function, loss, and optimization schedule.
We begin with the simplest case from the buffer perspective. We are given a fixed transition dataset $\mathcal{D} = {(s_i,a_i,r_i,s'i)}{i=1}^N$ and never collect new data. The replay buffer is frozen:
so the only thing that changes across iterations is the target function $g(\cdot; \boldsymbol{\theta}n)$ induced by the current Q-function. Every outer iteration of FQI samples from the same empirical distribution $\hat{P}{\mathcal{D}}$ but pushes it through a new target function, producing a new regression distribution
At each outer iteration
The following algorithm is simply approximate value iteration where expectations under the transition kernel
:label: fitted-q-iteration-batch
**Input:** Dataset $\mathcal{D} = \{(s_i, a_i, r_i, s'_i)\}_{i=1}^N$, function approximator $q(s,a; \boldsymbol{\theta})$, discount factor $\gamma$, maximum iterations $n_{\max}$
**Output:** Learned Q-function parameters $\boldsymbol{\theta}$
1. Initialize $\boldsymbol{\theta}_0$
2. $n \leftarrow 0$
3. **repeat** $\quad$ // **Outer loop: Value Iteration**
4. $\quad$ **// Construct regression dataset with Bellman targets**
5. $\quad$ $X^{(n)} \leftarrow \{(s_i, a_i) : (s_i, a_i, r_i, s'_i) \in \mathcal{D}\}$
6. $\quad$ **for** each $(s_i, a_i, r_i, s'_i) \in \mathcal{D}$ **do**
7. $\quad\quad$ $y_i^{(n)} \leftarrow r_i + \gamma \max_{a' \in \mathcal{A}} q(s'_i, a'; \boldsymbol{\theta}_n)$
8. $\quad$ **end for**
9. $\quad$ **// Inner loop: Fit Q-function to targets (projection step)**
10. $\quad$ $\boldsymbol{\theta}_{n+1} \leftarrow \texttt{fit}(X^{(n)}, y^{(n)}, \boldsymbol{\theta}_{\text{init}}, K)$
11. $\quad$ $n \leftarrow n+1$
12. **until** convergence or $n \geq n_{\max}$
13. **return** $\boldsymbol{\theta}_n$
The fit operation in line 10 abstracts the inner optimization loop that minimizes
Fitted Q-Iteration (FQI): Ernst et al. {cite}ernst2005tree instantiate this template with extremely randomized trees (extra trees), an ensemble method that partitions the state-action space into regions with piecewise constant Q-values. The fit operation trains the ensemble until completion using the tree construction algorithm. Trees handle high-dimensional inputs naturally and the ensemble reduces overfitting. FQI uses cold start initialization:
Neural Fitted Q-Iteration (NFQI): Riedmiller {cite}riedmiller2005neural replaces the tree ensemble with a neural network fit operation runs gradient-based optimization (RProp, chosen for its insensitivity to hyperparameter choices) to convergence: train the network until the loss stops decreasing (multiple epochs through the full dataset
:class: dropdown
For episodic tasks with goal states $S^+$ and forbidden states $S^-$, Riedmiller modifies the target structure:
$$
y_i^{(n)} = \begin{cases}
c(s_i, a_i, s'_i) & \text{if } s'_i \in S^+ \text{ (goal reached)} \\
C^- & \text{if } s'_i \in S^- \text{ (forbidden state, typically } C^- = 1.0\text{)} \\
c(s_i, a_i, s'_i) + \gamma \max_{a'} q(s'_i, a'; \boldsymbol{\theta}_n) & \text{otherwise}
\end{cases}
$$
where $c(s, a, s')$ is the immediate cost. Goal states have zero future cost (no bootstrapping), forbidden states have high penalty, and regular states use the standard Bellman backup. Additionally, the **hint-to-goal heuristic** adds synthetic transitions $(s, a, s')$ where $s \in S^+$ with target value $c(s,a,s') = 0$ to explicitly clamp the Q-function to zero in the goal region. This stabilizes learning by encoding the boundary condition without requiring additional prior knowledge.
Fitted Q-iteration has an inherently nested structure: an outer loop performs approximate value iteration by computing Bellman targets, and an inner loop performs regression by fitting the function approximator to those targets. This nested structure shows that FQI is approximate dynamic programming with function approximation, distinct from supervised learning with changing targets.
When the inner loop uses gradient descent for
This is a sequence of updates $\boldsymbol{\theta}_n^{(k+1)} = \boldsymbol{\theta}n^{(k)} - \alpha \nabla{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_n^{(k)}; \mathcal{D}_n^{\text{fit}})$ for
In the flattened form, the parameters used for computing targets are called the target network
In contrast, the parameters being actively trained via gradient descent at each step are called the online network
The online/target terminology is standard in the deep RL community. Many modern algorithms, especially those that collect data online like DQN, are presented in flattened form. This can make them appear different from batch methods when they are the same template with different design choices.
Tree ensemble methods like random forests or extra trees have no continuous parameter space and no gradient-based optimization. The fit operation builds the entire tree structure in one pass. There's no sequence of incremental updates to unfold into a single loop. Ernst's FQI {cite}ernst2005tree retains the explicit nested structure with cold start initialization at each outer iteration, while neural methods can be flattened.
To see how flattening works, we first make the nested structure completely explicit by expanding the fit operation to show the inner gradient descent loop. In terms of the buffer notation, the inner loop approximately minimizes the empirical risk:
induced by the fixed buffer fitted-q-iteration-batch), we replace the abstract fit call with explicit gradient updates:
:label: nfqi-explicit-inner
**Input**: MDP $(S, A, P, R, \gamma)$, offline transition dataset $\mathcal{D}$, neural network $q(s,a; \boldsymbol{\theta})$, learning rate $\alpha$, inner optimization steps $K$, initialization $\boldsymbol{\theta}_0$
**Output**: Parameters $\boldsymbol{\theta}$ for Q-function approximation
1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $n \leftarrow 0$ $\quad$ // **Outer loop counter (value iteration)**
3. **repeat**
1. **// Compute Bellman targets using current Q-function**
2. $\mathcal{D}_n^{\text{fit}} \leftarrow \emptyset$
3. **for** each $(s,a,r,s') \in \mathcal{D}$ **do**
1. $y_{s,a} \leftarrow r + \gamma \max_{a' \in A} q(s',a'; \boldsymbol{\theta}_n)$
2. $\mathcal{D}_n^{\text{fit}} \leftarrow \mathcal{D}_n^{\text{fit}} \cup \{((s,a), y_{s,a})\}$
4. **// Inner optimization loop: fit to targets via gradient descent**
5. $\boldsymbol{\theta}_n^{(0)} \leftarrow \boldsymbol{\theta}_n$ $\quad$ // Warm start from previous outer iteration
6. $k \leftarrow 0$ $\quad$ // **Inner loop counter (regression)**
7. **repeat**
1. $\boldsymbol{\theta}_n^{(k+1)} \leftarrow \boldsymbol{\theta}_n^{(k)} - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_n^{(k)}; \mathcal{D}_n^{\text{fit}})$
2. $k \leftarrow k + 1$
8. **until** $k = K$ $\quad$ // Partial optimization: exactly K gradient steps
9. $\boldsymbol{\theta}_{n+1} \leftarrow \boldsymbol{\theta}_n^{(K)}$
10. $n \leftarrow n + 1$
4. **until** training complete
5. **return** $\boldsymbol{\theta}_n$
This makes the two-level structure completely transparent. The outer loop (indexed by
The notation
We can now flatten the nested structure by treating all gradient steps uniformly and using a global step counter
This transformation is purely algebraic. No algorithmic behavior changes, only the presentation:
:label: nfqi-flattened
**Input**: MDP $(S, A, P, R, \gamma)$, offline transition dataset $\mathcal{D}$, neural network $q(s,a; \boldsymbol{\theta})$, learning rate $\alpha$, target update frequency $K$, initialization $\boldsymbol{\theta}_0$
**Output**: Parameters $\boldsymbol{\theta}$ for Q-function approximation
1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}_0$
3. $t \leftarrow 0$ $\quad$ // **Single flattened loop counter**
4. **while** training **do**
1. **// Compute targets using fixed target network**
2. $\mathcal{D}_t^{\text{fit}} \leftarrow \emptyset$
3. **for** each $(s,a,r,s') \in \mathcal{D}$ **do**
1. $y_{s,a} \leftarrow r + \gamma \max_{a' \in A} q(s',a'; \boldsymbol{\theta}_{\text{target}})$
2. $\mathcal{D}_t^{\text{fit}} \leftarrow \mathcal{D}_t^{\text{fit}} \cup \{((s,a), y_{s,a})\}$
4. **// Gradient step on online network**
5. $\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t; \mathcal{D}_t^{\text{fit}})$
6. **// Periodic target network update (marks outer iteration boundary)**
7. **if** $t \bmod K = 0$ **then**
1. $\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}_t$
8. $t \leftarrow t + 1$
5. **return** $\boldsymbol{\theta}_t$
At step
Flattening replaces the pair
The target network arises directly from flattening the nested FQI structure. When DQN is presented with a target network that updates every
An alternative to periodic hard updates is exponential moving average (EMA) (also called Polyak averaging), which updates the target network smoothly at every step rather than synchronizing it every
:label: nfqi-flattened-ema
**Input**: MDP $(S, A, P, R, \gamma)$, offline transition dataset $\mathcal{D}$, neural network $q(s,a; \boldsymbol{\theta})$, learning rate $\alpha$, EMA rate $\tau \in (0, 1)$, initialization $\boldsymbol{\theta}_0$
**Output**: Parameters $\boldsymbol{\theta}$ for Q-function approximation
1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}_0$
3. $t \leftarrow 0$ $\quad$ // **Single flattened loop counter**
4. **while** training **do**
1. **// Compute targets using fixed target network**
2. $\mathcal{D}_t^{\text{fit}} \leftarrow \emptyset$
3. **for** each $(s,a,r,s') \in \mathcal{D}$ **do**
1. $y_{s,a} \leftarrow r + \gamma \max_{a' \in A} q(s',a'; \boldsymbol{\theta}_{\text{target}})$
2. $\mathcal{D}_t^{\text{fit}} \leftarrow \mathcal{D}_t^{\text{fit}} \cup \{((s,a), y_{s,a})\}$
4. **// Gradient step on online network**
5. $\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t; \mathcal{D}_t^{\text{fit}})$
6. **// ← CHANGED: Smooth EMA update at every step**
7. $\boldsymbol{\theta}_{\text{target}} \leftarrow \tau\boldsymbol{\theta}_{t+1} + (1-\tau)\boldsymbol{\theta}_{\text{target}}$
8. $t \leftarrow t + 1$
5. **return** $\boldsymbol{\theta}_t$
With EMA updates, the target network slowly tracks the online network instead of making discrete jumps. For small lillicrap2015continuous for continuous control and is now standard in algorithms like TD3 {cite}fujimoto2018addressing and SAC {cite}haarnoja2018soft.
We now keep the same fitted Q-iteration template but allow the buffer $\mathcal{B}t$ and its empirical distribution $\hat{P}{\mathcal{B}t}$ to evolve while we learn. Instead of repeatedly sampling from a fixed $\hat{P}{\mathcal{D}}$, we collect new transitions during learning and store them in a circular replay buffer.
Note that "online" here takes on a second meaning: we collect data online (interacting with the environment) while also maintaining the online network (actively-updated parameters $\boldsymbol{\theta}t$) and target network (frozen parameters $\boldsymbol{\theta}{\text{target}}$) distinction from the flattened FQI structure. The online network now plays a dual role: it is both the parameters being trained and the policy used to collect new data for the buffer.
Deep Q-Network (DQN) instantiates the online template with moderate choices along the design axes: buffer capacity
Deep Q-Network (DQN) maintains a circular replay buffer of capacity
DQN uses two copies of the Q-network parameters:
- The online network
$\boldsymbol{\theta}_t$ : actively updated at each gradient step (corresponds to$\boldsymbol{\theta}_n^{(k)}$ in the nested view). It serves a dual role: (1) the policy for collecting new data via$\varepsilon$ -greedy action selection, and (2) the parameters being trained. - The target network
$\boldsymbol{\theta}_{\text{target}}$ : frozen parameters updated every$K$ steps (corresponds to$\boldsymbol{\theta}_n$ in the nested view). Used only for computing Bellman targets.
As shown in the nested-to-flattened section, the target network is not a stabilization trick but the natural consequence of periodic outer-iteration boundaries in flattened FQI. The target network keeps targets fixed for
:label: dqn
**Input**: MDP $(S, A, P, R, \gamma)$, neural network $q(s,a; \boldsymbol{\theta})$, learning rate $\alpha$, target update frequency $K$, replay buffer capacity $B$, mini-batch size $b$, exploration rate $\varepsilon$
**Output**: Parameters $\boldsymbol{\theta}$ for Q-function approximation
1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}_0$
3. Initialize replay buffer $\mathcal{B}$ with capacity $B$
4. $t \leftarrow 0$
5. **while** training **do**
1. Observe current state $s$
2. **// Use online network to collect data**
3. Select action: $a \leftarrow \begin{cases} \arg\max_{a'} q(s,a';\boldsymbol{\theta}_t) & \text{with probability } 1-\varepsilon \\ \text{random action} & \text{with probability } \varepsilon \end{cases}$
4. Execute $a$, observe reward $r$ and next state $s'$
5. Store $(s,a,r,s')$ in $\mathcal{B}$, replacing oldest if full
6. Sample mini-batch of $b$ transitions $\{(s_i,a_i,r_i,s_i')\}_{i=1}^b$ from $\mathcal{B}$
7. For each sampled transition $(s_i,a_i,r_i,s_i')$:
1. $y_i \leftarrow r_i + \gamma \color{blue}{\max_{a' \in A} q(s_i',a'; \boldsymbol{\theta}_{\text{target}})}$ (both selection AND evaluation with target network)
8. $\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} \frac{1}{b}\sum_{i=1}^b (q(s_i,a_i;\boldsymbol{\theta}_t) - y_i)^2$
9. **if** $t \bmod K = 0$ **then**
1. $\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}_t$
10. $t \leftarrow t + 1$
6. **return** $\boldsymbol{\theta}_t$
Double DQN addresses overestimation bias by decoupling action selection from evaluation. Instead of using the target network for both purposes (as DQN does), Double DQN uses the online network (currently-training parameters $\boldsymbol{\theta}t$) to select which action looks best, then uses the target network (frozen parameters $\boldsymbol{\theta}{\text{target}}$) to evaluate that action:
:label: double-dqn
**Input**: MDP $(S, A, P, R, \gamma)$, neural network $q(s,a; \boldsymbol{\theta})$, learning rate $\alpha$, target update frequency $K$, replay buffer capacity $B$, mini-batch size $b$, exploration rate $\varepsilon$
**Output**: Parameters $\boldsymbol{\theta}$ for Q-function approximation
1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}_0$
3. Initialize replay buffer $\mathcal{B}$ with capacity $B$
4. $t \leftarrow 0$
5. **while** training **do**
1. Observe current state $s$
2. **// Use online network to collect data**
3. Select action: $a \leftarrow \begin{cases} \arg\max_{a'} q(s,a';\boldsymbol{\theta}_t) & \text{with probability } 1-\varepsilon \\ \text{random action} & \text{with probability } \varepsilon \end{cases}$
4. Execute $a$, observe reward $r$ and next state $s'$
5. Store $(s,a,r,s')$ in $\mathcal{B}$, replacing oldest if full
6. Sample mini-batch of $b$ transitions $\{(s_i,a_i,r_i,s_i')\}_{i=1}^b$ from $\mathcal{B}$
7. For each sampled transition $(s_i,a_i,r_i,s_i')$:
1. $a^*_i \leftarrow \color{red}{\arg\max_{a' \in A} q(s_i',a'; \boldsymbol{\theta}_t)}$ (action selection with online network)
2. $y_i \leftarrow r_i + \gamma \color{green}{q(s_i',a^*_i; \boldsymbol{\theta}_{\text{target}})}$ (action evaluation with target network)
8. $\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} \frac{1}{b}\sum_{i=1}^b (q(s_i,a_i;\boldsymbol{\theta}_t) - y_i)^2$
9. **if** $t \bmod K = 0$ **then**
1. $\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}_t$
10. $t \leftarrow t + 1$
6. **return** $\boldsymbol{\theta}_t$
Double DQN leaves $\mathcal{B}t$ and $\hat{P}{\mathcal{B}_t}$ unchanged and modifies only the target function
- DQN (blue): Uses the target network $\boldsymbol{\theta}{\text{target}}$ for BOTH selecting which action is best AND evaluating it: $\max{a'} q(s',a'; \boldsymbol{\theta}{\text{target}})$ implicitly finds $a^* = \arg\max{a'} q(s',a'; \boldsymbol{\theta}{\text{target}})$ and then evaluates $q(s',a^*; \boldsymbol{\theta}{\text{target}})$
-
Double DQN (red + green): Uses the online network $\boldsymbol{\theta}t$ to select $a^* = \arg\max{a'} q(s',a'; \boldsymbol{\theta}t)$, then uses the target network $\boldsymbol{\theta}{\text{target}}$ to evaluate
$q(s',a^*; \boldsymbol{\theta}_{\text{target}})$
This decoupling prevents overestimation bias because the noise in action selection (from the online network) is independent of the noise in action evaluation (from the target network). Recall that the online network represents the currently-training parameters being updated at each gradient step, while the target network represents the frozen parameters used for computing targets. By using different parameters for selection versus evaluation, we break the correlation that leads to systematic overestimation in the standard max operator.
DQN and Double DQN use moderate settings: watkins1989learning: buffer capacity
In the buffer perspective, Q-learning has
Stochastic approximation is a general framework for solving equations of the form
using noisy samples
Q-learning fits this framework by solving the Bellman residual equation. Recall from the projection methods chapter that the Bellman equation
This is the TD error. Q-learning is a stochastic approximation method for solving
The algorithm works with any function approximator that supports incremental updates. The general form applies a gradient descent step to minimize the squared TD error. The only difference between linear and nonlinear cases is how the gradient
:label: q-learning
**Input**: MDP $(S, A, P, R, \gamma)$, function approximator $q(s,a; \boldsymbol{\theta})$, learning rate $\alpha$, exploration rate $\varepsilon$
**Output**: Parameters $\boldsymbol{\theta}$ for Q-function approximation
1. Initialize $\boldsymbol{\theta}_0$
2. $t \leftarrow 0$
3. **while** training **do**
1. Observe current state $s$
2. Select action: $a \leftarrow \begin{cases} \arg\max_{a' \in A} q(s,a';\boldsymbol{\theta}_t) & \text{with probability } 1-\varepsilon \\ \text{random action} & \text{with probability } \varepsilon \end{cases}$
3. Execute $a$, observe reward $r$ and next state $s'$
4. Compute TD target: $y \leftarrow r + \gamma \max_{a' \in A} q(s',a'; \boldsymbol{\theta}_t)$
5. Update: $\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} (q(s,a;\boldsymbol{\theta}_t) - y)^2$
6. $t \leftarrow t + 1$
4. **return** $\boldsymbol{\theta}_t$
The gradient in line 5 depends on the choice of function approximator:
-
Tabular: Uses one-hot features $\boldsymbol{\phi}(s,a) = \boldsymbol{e}{(s,a)}$, giving table-lookup updates $Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma \max{a'} Q(s',a') - Q(s,a))$. Converges under standard stochastic approximation conditions (detailed below).
-
Linear:
$q(s,a; \boldsymbol{\theta}) = \boldsymbol{\theta}^\top \boldsymbol{\phi}(s,a)$ where$\boldsymbol{\phi}: \mathcal{S} \times \mathcal{A} \to \mathbb{R}^d$ is a feature map. The gradient is$\nabla_{\boldsymbol{\theta}} q(s,a; \boldsymbol{\theta}) = \boldsymbol{\phi}(s,a)$ , giving the update$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \alpha (y - \boldsymbol{\theta}_t^\top \boldsymbol{\phi}(s,a)) \boldsymbol{\phi}(s,a)$ . Convergence is not guaranteed in general; the max operator combined with function approximation can cause instability. -
Nonlinear:
$q(s,a; \boldsymbol{\theta})$ is a neural network with weights$\boldsymbol{\theta}$ . The gradient$\nabla_{\boldsymbol{\theta}} q(s,a; \boldsymbol{\theta})$ is computed via backpropagation. Convergence guarantees do not exist.
Tabular Q-learning converges under diminishing step sizes (watkins1992qlearning,jaakkola1994convergence,tsitsiklis1994asynchronous. The standard proof technique is the ordinary differential equation (ODE) method {cite}kushner2003stochastic,borkar2000ode, which analyzes stochastic approximation algorithms by studying an associated deterministic dynamical system.
The ODE method connects stochastic approximation to deterministic dynamics through a two-step argument. First, we show that as step sizes shrink, the discrete stochastic recursion
For tabular Q-learning, the update at state-action pair
where
Under sufficient exploration (all state-action pairs visited infinitely often), this expectation is well-defined. Let
This is a deterministic dynamical system. The fixed points satisfy $Q(s,a) = \mathbb{E}{s'}[r(s,a,s') + \gamma \max{a'} Q(s',a')]$, which is the Bellman optimality equation. Convergence is established by showing this ODE has a global Lyapunov function: the distance $|Q - Q^|$ to the optimal Q-function decreases along trajectories. The Bellman operator is a contraction, which ensures the ODE trajectories converge to $Q^$.
The interaction between the stationary distribution
For linear Q-learning with function approximation, the ODE analysis becomes more complex. The projection methods chapter shows that non-monotone projections (like linear least squares) can fail to preserve contraction properties. The max operator in Q-learning creates additional complications that prevent general convergence guarantees, even though the algorithm may work in practice for well-chosen features.
Fix a particular time fit operation in the inner loop is then a standard statistical estimation problem: given empirical samples from
The choice of loss
This section examines alternative loss functions for the regression step. The standard approach uses squared error, but the noise in Bellman targets has special structure due to the max operator and bootstrapping. Two strategies have shown empirical success: Gumbel regression, which uses the proper likelihood for extreme-value noise, and classification-based methods, which avoid parametric noise assumptions by working with distributions over value bins.
Extreme value theory tells us that the maximum of Gaussian errors has Gumbel-distributed tails. If we take this distribution seriously, maximum likelihood estimation should use a Gumbel likelihood rather than a Gaussian one. Garg, Tang, Kahn, and Levine {cite}garg2023extreme developed this idea in Extreme Q-Learning (XQL). Instead of modeling the Bellman error as additive Gaussian noise:
they model it as Gumbel noise:
The negative Gumbel distribution arises because we are modeling errors in targets that overestimate the true value. The corresponding maximum likelihood loss is Gumbel regression:
The temperature parameter
When
The Gumbel loss can be understood as the natural likelihood for problems involving max operators, just as the Gaussian is the natural likelihood for problems involving averages. The central limit theorem tells us that sums converge to Gaussians; extreme value theory tells us that maxima converge to Gumbel (for light-tailed base distributions). Squared error is optimal for Gaussian noise; Gumbel regression is optimal for Gumbel noise.
:tags: [hide-input]
# label: fig-gumbel-loss
# caption: Gumbel regression reshapes the loss landscape (left) and gradient geometry (right), penalizing overestimation exponentially while keeping underestimation gentle.
%config InlineBackend.figure_format = 'retina'
import numpy as np
import matplotlib.pyplot as plt
# Apply book style
try:
import scienceplots
plt.style.use(['science', 'notebook'])
except (ImportError, OSError):
pass # Use matplotlib defaults
# Set up the figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# Define error range (q - y)
error = np.linspace(-3, 3, 500)
# Different beta values to show control over asymmetry
betas = [0.5, 1.0, 2.0]
colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
labels = [r'$\beta = 0.5$ (aggressive)', r'$\beta = 1.0$ (moderate)', r'$\beta = 2.0$ (mild)']
# Plot 1: Gumbel loss function
for beta, color, label in zip(betas, colors, labels):
loss = error / beta + np.exp(error / beta)
ax1.plot(error, loss, color=color, linewidth=2, label=label)
# Add reference: squared error for comparison
squared_error = error**2
ax1.plot(error, squared_error, 'k--', linewidth=1.5, alpha=0.5, label='Squared error')
ax1.axvline(x=0, color='gray', linestyle=':', alpha=0.5)
ax1.axhline(y=0, color='gray', linestyle=':', alpha=0.5)
ax1.set_xlabel(r'Error: $q - y$', fontsize=11)
ax1.set_ylabel('Loss', fontsize=11)
ax1.set_title('Gumbel Loss Function', fontsize=12, fontweight='bold')
ax1.legend(fontsize=9)
ax1.grid(True, alpha=0.3)
ax1.set_ylim([0, 10])
# Annotate regions
ax1.text(-2, 7, 'Underestimation\n' + r'$(q < y)$',
fontsize=9, ha='center', color='darkred', alpha=0.7)
ax1.text(2, 7, 'Overestimation\n' + r'$(q > y)$',
fontsize=9, ha='center', color='darkblue', alpha=0.7)
# Plot 2: Gradient (score function) - showing asymmetry
for beta, color, label in zip(betas, colors, labels):
gradient = (1/beta) * (1 + np.exp(error / beta))
ax2.plot(error, gradient, color=color, linewidth=2, label=label)
# Add reference: gradient of squared error
squared_error_grad = 2 * error
ax2.plot(error, squared_error_grad, 'k--', linewidth=1.5, alpha=0.5, label='Squared error')
ax2.axvline(x=0, color='gray', linestyle=':', alpha=0.5)
ax2.axhline(y=0, color='gray', linestyle=':', alpha=0.5)
ax2.set_xlabel(r'Error: $q - y$', fontsize=11)
ax2.set_ylabel('Gradient magnitude', fontsize=11)
ax2.set_title('Gumbel Loss Gradient (Score Function)', fontsize=12, fontweight='bold')
ax2.legend(fontsize=9)
ax2.grid(True, alpha=0.3)
ax2.set_ylim([-5, 10])
# Annotate asymmetry
ax2.text(-2, -3, 'Mild gradient\n(tolerates underestimation)',
fontsize=9, ha='center', color='darkred', alpha=0.7)
ax2.text(2, 7, 'Steep gradient\n(penalizes overestimation)',
fontsize=9, ha='center', color='darkblue', alpha=0.7)
plt.tight_layout()
The left panel shows the Gumbel loss as a function of the error
The right panel shows the gradient (score function), which determines how strongly the optimizer pushes back against errors. For overestimation (
:class: dropdown
XQL (Gumbel loss) and Soft Q-learning both involve Gumbel distributions, but they operate at different levels of the FQI template. Recall our unified framework: buffer $\mathcal{B}_t$, target function $g$, loss $\ell$, optimization budget $K$.
**Soft Q-learning** changes the target function by using the smooth Bellman operator from [regularized MDPs](smoothing.md):
$$
g^{\text{soft}}(s,a,r,s'; \boldsymbol{\theta}) = r + \gamma \frac{1}{\beta}\log\sum_{a'} \exp(\beta q(s',a'; \boldsymbol{\theta}))
$$
This replaces the hard max with logsumexp for entropy regularization. The loss remains standard: $\ell(q,y) = (q-y)^2$. The Gumbel distribution appears through the Gumbel-max trick (logsumexp = soft max), but this is about making the optimal policy stochastic, not about the noise structure in TD errors.
**XQL** keeps the hard max target function:
$$
g^{\text{hard}}(s,a,r,s'; \boldsymbol{\theta}) = r + \gamma \max_{a'} q(s',a'; \boldsymbol{\theta})
$$
and instead changes the loss to Gumbel regression: $\ell_{\text{Gumbel}}(q,y) = \frac{q-y}{\beta} + \exp(\frac{q-y}{\beta})$. The Gumbel distribution here models the noise in the TD errors themselves, not the policy.
The asymmetric penalization in XQL (exponential penalty for overestimation) comes from the score function of the Gumbel likelihood, which is designed to handle extreme-value noise. Soft Q-learning uses symmetric L2 loss, so it does not preferentially penalize overestimation. The logsumexp smoothing in Soft Q reduces maximization bias by averaging over actions, but this is a property of the target function, not the loss geometry.
Both can be combined: Soft Q-learning with Gumbel loss would change both the target function (logsumexp) and the loss (asymmetric penalization).
The practical advantage of XQL is that the loss function itself handles the asymmetric error structure through its score function. We do not need to estimate variances or compute weighted averages. XQL has shown improvements in both value-based and actor-critic methods, particularly in offline reinforcement learning where the max-operator bias compounds across iterations without corrective exploration.
From the buffer viewpoint, nothing changes upstream: we still sample transitions from
Choose a finite grid
Each scalar TD target
This is barycentric interpolation: backward-recursion-interp).
:tags: [hide-input]
# label: fig-two-hot
# caption: Two-hot encoding as barycentric interpolation: the left panel shows a single target spread over two bins, while the right panel compares encodings for multiple targets.
%config InlineBackend.figure_format = 'retina'
import numpy as np
import matplotlib.pyplot as plt
# Apply book style
try:
import scienceplots
plt.style.use(['science', 'notebook'])
except (ImportError, OSError):
pass # Use matplotlib defaults
def two_hot_encode(y, z_bins):
"""Compute two-hot encoding weights for target value y on grid z_bins."""
K = len(z_bins)
q = np.zeros(K)
# Clip to grid boundaries
if y <= z_bins[0]:
q[0] = 1.0
return q, 0
if y >= z_bins[-1]:
q[-1] = 1.0
return q, K - 1
# Find bins j, j+1 such that z_j <= y < z_{j+1}
j = np.searchsorted(z_bins, y, side='right') - 1
# Barycentric coordinates (linear interpolation)
q[j+1] = (y - z_bins[j]) / (z_bins[j+1] - z_bins[j])
q[j] = (z_bins[j+1] - y) / (z_bins[j+1] - z_bins[j])
return q, j
# Set up the figure with two panels
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# Define a grid of K=11 bins spanning [-5, 5]
K = 11
z_bins = np.linspace(-5, 5, K)
# Panel 1: Two-hot encoding for a specific target
y_target = 0.7
q, j = two_hot_encode(y_target, z_bins)
# Plot bins as vertical lines
for z in z_bins:
ax1.axvline(z, color='lightgray', linestyle='--', linewidth=0.8, alpha=0.5)
# Plot the distribution
colors_bar = ['#E63946' if qi > 0 else '#A8DADC' for qi in q]
ax1.bar(z_bins, q, width=(z_bins[1] - z_bins[0]) * 0.8,
color=colors_bar, alpha=0.7, edgecolor='black', linewidth=1.5)
# Highlight the target value
ax1.axvline(y_target, color='#2E86AB', linewidth=3, label=f'Target $y = {y_target}$', zorder=10)
ax1.plot(y_target, 0, 'o', color='#2E86AB', markersize=12, zorder=11)
# Annotate the two active bins with barycentric coordinates
ax1.annotate(f'$q_{{{j}}} = {q[j]:.3f}$\n$= \\frac{{z_{{{j+1}}} - y}}{{z_{{{j+1}}} - z_{{{j}}}}}$',
xy=(z_bins[j], q[j]), xytext=(z_bins[j] - 1.8, q[j] + 0.25),
fontsize=10, color='#E63946', weight='bold',
arrowprops=dict(arrowstyle='->', color='#E63946', lw=2))
ax1.annotate(f'$q_{{{j+1}}} = {q[j+1]:.3f}$\n$= \\frac{{y - z_{{{j}}}}}{{z_{{{j+1}}} - z_{{{j}}}}}$',
xy=(z_bins[j+1], q[j+1]), xytext=(z_bins[j+1] + 0.5, q[j+1] + 0.25),
fontsize=10, color='#E63946', weight='bold',
arrowprops=dict(arrowstyle='->', color='#E63946', lw=2))
# Verify barycentric property
reconstruction = np.sum(z_bins * q)
textstr = f'Barycentric property:\n$\\sum_k z_k q_k(y) = {reconstruction:.4f} = y$'
ax1.text(0.02, 0.97, textstr, transform=ax1.transAxes, fontsize=10,
verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
ax1.set_xlabel('Bin value $z_k$', fontsize=11)
ax1.set_ylabel('Probability $q_k(y)$', fontsize=11)
ax1.set_title('Two-Hot Encoding = Linear Interpolation', fontsize=12, fontweight='bold')
ax1.set_ylim([0, 1.2])
ax1.legend(loc='upper right', fontsize=10)
ax1.grid(axis='y', alpha=0.3)
# Panel 2: Multiple targets to show the pattern
targets = [0.7, -2.3, 3.8]
colors = ['#2E86AB', '#A23B72', '#F18F01']
for idx, (y_target, color) in enumerate(zip(targets, colors)):
q, _ = two_hot_encode(y_target, z_bins)
offset = idx * 0.2
ax2.bar(z_bins + offset, q, width=0.18, color=color, alpha=0.7,
label=f'$y = {y_target}$', edgecolor='black', linewidth=0.8)
for z in z_bins:
ax2.axvline(z, color='lightgray', linestyle='--', linewidth=0.5, alpha=0.4)
ax2.set_xlabel('Bin value $z_k$', fontsize=11)
ax2.set_ylabel('Probability $q_k(y)$', fontsize=11)
ax2.set_title('Multiple Targets: Each Encoded as Two-Hot', fontsize=12, fontweight='bold')
ax2.legend(loc='upper right', fontsize=10)
ax2.grid(axis='y', alpha=0.3)
plt.tight_layout()
The left panel shows two-hot encoding for
The loss minimizes cross-entropy:
which projects the target distribution onto the predicted distribution in KL geometry on the simplex
This provides three sources of implicit robustness. First, gradient influence is bounded: each sample contributes
The two-hot weights backward-recursion-interp). This places the encoding within Gordon's monotone approximator framework (Definition {prf:ref}gordon-averager): targets are convex combinations preserving order and boundedness. The neural network predicting
Empirically, cross-entropy loss scales better with network capacity. Farebrother et al. {cite}farebrother2024stop found that L2-based DQN and CQL degrade when Q-networks scale to large ResNets, while classification loss (specifically HL-Gauss, which uses Gaussian smoothing instead of two-hot) maintains performance. The combination of KL geometry, quantization, and smoothing prevents overfitting to noisy targets that plagues L2 with high-capacity networks.
Fitted Q-iteration has a two-level structure: an outer loop applies the Bellman operator to construct targets, an inner loop fits a function approximator to those targets. All algorithms in this chapter instantiate this template with different choices of buffer
The empirical distribution $\hat{P}_{\mathcal{B}t}$ unifies offline and online methods through plug-in approximation: replace unknown transition law $P$ with $\hat{P}{\mathcal{B}t}$ and minimize empirical risk $\mathbb{E}{((s,a),y)\sim \hat{P}_t^{\text{fit}}} [\ell(q, y)]$. Offline uses fixed
Target networks and online networks arise from flattening the nested loops. Merging inner gradient steps with outer value iteration creates a single loop where two sets of parameters coexist: the online network
The next chapter directly parameterizes and optimizes policies instead of searching over value functions.