Skip to content

Latest commit

 

History

History
687 lines (509 loc) · 50.9 KB

File metadata and controls

687 lines (509 loc) · 50.9 KB

The Keane-Wolpin Bias Correction Algorithm

Keane and Wolpin proposed to de-bias such estimators by essentially "learning" the bias, then subtracting it when computing the empirical Bellman operator. If we knew this bias function, we could subtract it from our empirical estimate to get an unbiased estimate of the true Bellman operator:

$$ (\widehat{\Bellman}v_n)(s) - \text{bias}(s) = (\widehat{\Bellman}v_n)(s) - (\mathbb{E}[(\widehat{\Bellman}v_n)(s)] - (\Bellman v_n)(s)) \approx (\Bellman v_n)(s) $$

This equality holds in expectation, though any individual estimate would still have variance around the true value.

So how can we estimate the bias function? The Keane-Wolpin manages this using an important fact from extreme value theory: for normal random variables, the difference between the expected maximum and maximum of expectations scales with the standard deviation:

$$ \mathbb{E}\left[\max_{a \in \mathcal{A}} \hat{q}_i(s,a)\right] - \max_{a \in \mathcal{A}} \mathbb{E}[\hat{q}_i(s,a)] \approx c \cdot \sqrt{\max_{a \in \mathcal{A}} \text{Var}_i(s,a)} $$

The variance term $\max_{a \in \mathcal{A}} \text{Var}_i(s,a)$ will typically be dominated by the action with the largest value -- the greedy action $a^*_i(s)$. Rather than deriving the constant $c$ theoretically, Keane-Wolpin proposed learning the relationship between variance and bias empirically through these steps:

  1. Select a small set of "benchmark" states (typically 20-50) that span the state space

  2. For these states, compute more accurate value estimates using many more Monte Carlo samples (10-100x more than usual)

  3. Compute the empirical bias at each benchmark state $s$:

    $$\hat{b}i(s) = (\hat{\Bellman}v_i)(s) - (\hat{\Bellman}{\text{accurate}}v_i)(s)$$

  4. Fit a linear relationship between this bias and the variance at the greedy action:

    $$\hat{b}_i(s) = \alpha_i \cdot \text{Var}_i(s,a^*_i(s)) + \epsilon$$

This creates a dataset of pairs $(\text{Var}_i(s,a^*_i(s)), \hat{b}_i(s))$ that can be used to estimate $\alpha_i$ through ordinary least squares regression. Once we have learned this bias function $\hat{b}$, we can define the bias-corrected Bellman operator:

$$ (\widetilde{\Bellman}v_i)(s) \triangleq (\hat{\Bellman}v_i)(s) - \hat{b}(s) $$

While this bias correction approach has been influential in econometrics, it hasn't gained much traction in the machine learning community. A major drawback is the need for accurate operator estimation at benchmark states, which requires allocating substantially more samples to these states. In the next section, we'll explore an alternative strategy that, while requiring the maintenance of two sets of value estimates, achieves bias correction without demanding additional samples.

Decoupling Selection and Evaluation

A simpler approach to addressing the upward bias is to maintain two separate q-function estimates - one for action selection and another for evaluation. We first examine the corresponding Monte Carlo value iteration algorithm, then establish why this works mathematically. Assume a Monte Carlo integration setup over Q factors:

:label: double-mc-value-iteration

**Input**: MDP $(S, A, P, R, \gamma)$, number of samples $N$, tolerance $\varepsilon > 0$, maximum iterations $K$
**Output**: Q-functions $q^A, q^B$

1. Initialize $q^A_0(s,a) = q^B_0(s,a) = 0$ for all $s \in S, a \in A$
2. $i \leftarrow 0$
3. repeat
    1. For each $s \in S, a \in A$:
        1. Draw $s'_j \sim p(\cdot|s,a)$ for $j = 1,\ldots,N$
        2. For network A:
            1. $a^*_i \leftarrow \arg\max_{a'} q^A_i(s'_j,a')$
            2. $q^A_{i+1}(s,a) \leftarrow r(s,a) + \frac{\gamma}{N} \sum_{j=1}^N q^B_i(s'_j,a^*_i)$
        3. For network B:
            1. $b^*_i \leftarrow \arg\max_{a'} q^B_i(s'_j,a')$
            2. $q^B_{i+1}(s,a) \leftarrow r(s,a) + \frac{\gamma}{N} \sum_{j=1}^N q^A_i(s'_j,b^*_i)$
    2. $\delta \leftarrow \max(\|q^A_{i+1} - q^A_i\|_{\infty}, \|q^B_{i+1} - q^B_i\|_{\infty})$
    3. $i \leftarrow i + 1$
4. until $\delta < \varepsilon$ or $i \geq K$
5. return final $q^A_{i+1}, q^B_{i+1}$

In this algorithm, we maintain two separate Q-functions ($q^A$ and $q^B$) and use them asymmetrically: when updating $q^A$, we use network A to select the best action ($a^i = \arg\max{a'} q^A_i(s'_j,a')$) but then evaluate that action using network B's estimates ($q^B_i(s'_j,a^_i)$). We do the opposite for updating $q^B$. You can see this separation in steps 3.2.2 and 3.2.3 of the algorithm, where for each network update, we first use one network to pick the action and then plug that chosen action into the other network for evaluation. We will see that this decomposition helps mitigate the positive bias that occurs due to Jensen's inequality.

An HVAC analogy

Consider a building where each HVAC unit $i$ has some true maximum power draw $\mu_i$ under worst-case conditions. Suppose we lack access to manufacturer datasheets and must estimate these maxima from actual measurements. The challenge is that power draw fluctuates with environmental conditions. If we use a single day's measurements and look at the highest power draw, we systematically overestimate the true maximum draw across all units.

To see this, let $X_A^i$ be unit i's power draw on day A and $X_B^i$ be unit i's power draw on day B. While both measurements are unbiased $\mathbb{E}[X_A^i] = \mathbb{E}[X_B^i] = \mu_i$, their maximum is not due to Jensen's inequality:

$$ \mathbb{E}[\max_i X_A^i] \geq \max_i \mathbb{E}[X_A^i] = \max_i \mu_i $$

Intuitively, this problem occurs because reading tends to come from units that experienced particularly demanding conditions (e.g., direct sunlight, full occupancy, peak humidity) rather than just those with high true maximum draw. To estimate the true maximum power draw more accurately, we use the following measurement protocol:

  1. Use day A measurements to select which unit hit the highest peak
  2. Use day B measurements to evaluate that unit's power consumption

This yields the estimator:

$$ Y = X_B^{\arg\max_i X_A^i} $$

We can show that by decoupling selection and evaluation in this fashion, our estimator $Y$ will no longer systematically overestimate the true maximum draw. First, observe that $\arg\max_i X_A^i$ is a random variable (call it $J$) - it tells us which unit had highest power draw on day A. It has some probability distribution based on day A's conditions: $P(J = j) = P(\arg\max_i X_A^i = j)$. Using the law of total expectation:

$$ \begin{align*} \mathbb{E}[Y] = \mathbb{E}[X_B^J] &= \mathbb{E}[\mathbb{E}[X_B^J \mid J]] \text{ (by tower property)} \\ &= \sum_{j=1}^n \mathbb{E}[X_B^j \mid J = j] P(J = j) \\ &= \sum_{j=1}^n \mathbb{E}[X_B^j \mid \arg\max_i X_A^i = j] P(\arg\max_i X_A^i = j) \end{align*} $$

Note that unit j's power draw on day B ($X_B^j$) is independent of whether it had the highest reading on day A (${\arg\max_i X_A^i = j}$). An extreme cold event on day A should not affect day B's readings (especially in Quebec where the weather tends to vary widely from day to day). Therefore:

$$ \mathbb{E}[X_B^j \mid \arg\max_i X_A^i = j] = \mathbb{E}[X_B^j] = \mu_j $$

This tells us that the two-day estimator is now an average of the true underlying power consumptions:

$$ \mathbb{E}[Y] = \sum_{j=1}^n \mu_j P(\arg\max_i X_A^i = j) $$

To analyze $ \mathbb{E}[Y] $ more closely, we use a general result: if we have a real-valued function $ f $ defined on a discrete set of units $ {1, \dots, n} $ and a probability distribution $ q(\cdot) $ over these units, then the maximum value of $ f $ across all units is at least as large as the weighted sum of $ f $ values with weights $ q $. Formally,

$$ \max_{j \in {1, \dots, n}} f(j) \geq \sum_{j=1}^n q(j) f(j). $$

Applying this to our setting, we set $ f(j) = \mu_j $ (the true maximum power draw for unit $ j $) and $ q(j) = P(J = j) $ (the probability that unit $ j $ achieves the maximum reading on day A). This gives us:

$$ \max_{j \in {1, \dots, n}} \mu_j \geq \sum_{j=1}^n P(J = j) \mu_j = \mathbb{E}[Y]. $$

Therefore, the expected value of $ Y $ (our estimator) will always be less than or equal to the true maximum value $ \max_j \mu_j $. In other words, $ Y $ provides a conservative estimate of the true maximum: it tends not to overestimate $ \max_j \mu_j $ but instead approximates it as closely as possible without systematic upward bias.

Consistency

Even though $ Y $ is not an unbiased estimator of $ \max_j \mu_j $ (since $ \mathbb{E}[Y] \leq \max_j \mu_j $), it is consistent. As more independent days (or measurements) are observed, the selection-evaluation procedure becomes more effective at isolating the intrinsic maximum, reducing the influence of day-specific environmental fluctuations. Over time, this approach yields a stable and increasingly accurate approximation of $ \max_j \mu_j $.

To show that $ Y $ is a consistent estimator of $ \max_i \mu_i $, we want to demonstrate that as the number of independent measurements (days, in this case) increases, $ Y $ converges in probability to $ \max_i \mu_i $. Suppose we have $ m $ independent days of measurements for each unit. Denote:

  • $ X_A^{(k),i} $ as the power draw for unit $ i $ on day $ A_k $, where $ k \in {1, \dots, m} $.
  • $ J_m = \arg\max_i \left( \frac{1}{m} \sum_{k=1}^m X_A^{(k),i} \right) $, which identifies the unit with the highest average power draw over $ m $ days.

The estimator we construct is: $ Y_m = X_B^{(J_m)}, $ where $ X_B^{(J_m)} $ is the power draw of the selected unit $ J_m $ on an independent evaluation day $ B $. We will now show that $ Y_m $ converges to $ \max_i \mu_i $ as $ m \to \infty $. This involves two main steps:

  1. Consistency of the Selection Step $ J_m $: As $ m \to \infty $, the unit selected by $ J_m $ will tend to be the one with the true maximum power draw $ \max_i \mu_i $.
  2. Convergence of $ Y_m $ to $ \mu_{J_m} $: Since the evaluation day $ B $ measurement $ X_B^{(J_m)} $ is unbiased with expectation $ \mu_{J_m} $, as $ m \to \infty $, $ Y_m $ will converge to $ \mu_{J_m} $, which in turn converges to $ \max_i \mu_i $.

The average power draw over $ m $ days for each unit $ i $ is:

$$ \frac{1}{m} \sum_{k=1}^m X_A^{(k),i}. $$

By the law of large numbers, as $ m \to \infty $, this sample average converges to the true expected power draw $ \mu_i $ for each unit $ i $:

$$ \frac{1}{m} \sum_{k=1}^m X_A^{(k),i} \xrightarrow{m \to \infty} \mu_i. $$

Since $ J_m $ selects the unit with the highest sample average, in the limit, $ J_m $ will almost surely select the unit with the highest true mean, $ \max_i \mu_i $. Thus, as $ m \to \infty $, $$ \mu_{J_m} \to \max_i \mu_i. $$

Given that $ J_m $ identifies the unit with the maximum true mean power draw in the limit, we now look at $ Y_m = X_B^{(J_m)} $, which is the power draw of unit $ J_m $ on the independent evaluation day $ B $.

Since $ X_B^{(J_m)} $ is an unbiased estimator of $ \mu_{J_m} $, we have:

$$ \mathbb{E}[Y_m \mid J_m] = \mu_{J_m}. $$

As $ m \to \infty $, $ \mu_{J_m} $ converges to $ \max_i \mu_i $. Thus, $ Y_m $ will also converge in probability to $ \max_i \mu_i $ because $ Y_m $ is centered around $ \mu_{J_m} $ and $ J_m $ converges to the index of the unit with $ \max_i \mu_i $.

Combining these two steps, we conclude that:

$$ Y_m \xrightarrow{m \to \infty} \max_i \mu_i \text{ in probability}. $$

This establishes the consistency of $ Y $ as an estimator for $ \max_i \mu_i $: as the number of independent measurements grows, $ Y_m $ converges to the true maximum power draw $ \max_i \mu_i $.

Initialization and Warmstarting

Parametric dynamic programming involves solving a sequence of related optimization problems, one for each fitting procedure at each iteration. While we've presented these as independent fitting problems, in practice we can leverage the relationship between successive iterations through careful initialization. This "warmstarting" strategy can significantly impact both computational efficiency and solution quality.

The basic idea is simple: rather than starting each fitting procedure from scratch, we initialize the function approximator with parameters from the previous iteration. This can speed up convergence since successive Q-functions tend to be similar. However, recent work suggests that persistent warmstarting might sometimes be detrimental, potentially leading to a form of overfitting. Alternative "reset" strategies that occasionally reinitialize parameters have shown promise in mitigating this issue.

Warmstarting can be incorporated into parametric Q-learning with one-step Monte Carlo integration as follows:

:label: warmstarted-q-learning

**Input** Given dataset $\mathcal{D}$ with transitions $(s, a, r, s')$, function approximator class $q(s,a; \boldsymbol{\theta})$, maximum iterations $N$, tolerance $\varepsilon > 0$, warmstart frequency $k$

**Output** Parameters $\boldsymbol{\theta}$ for Q-function approximation

1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $n \leftarrow 0$
3. **repeat**
    1. $\mathcal{D} \leftarrow \emptyset$
    2. For each $(s,a,r,s') \in \mathcal{D}$:
        1. $y_{s,a} \leftarrow r + \gamma \max_{a'} q(s',a'; \boldsymbol{\theta}_n)$  // One-step Monte Carlo estimate
        2. $\mathcal{D} \leftarrow \mathcal{D} \cup \{((s,a), y_{s,a})\}$
    3. **if** $n \bmod k = 0$:  // Reset parameters periodically
        1. Initialize $\boldsymbol{\theta}_{temp}$ randomly
    4. **else**:
        1. $\boldsymbol{\theta}_{temp} \leftarrow \boldsymbol{\theta}_n$  // Warmstart from previous iteration
    5. $\boldsymbol{\theta}_{n+1} \leftarrow \texttt{fit}(\mathcal{D}, \boldsymbol{\theta}_{temp})$  // Initialize optimizer with $\boldsymbol{\theta}_{temp}$
    4. $\delta \leftarrow \frac{1}{|\mathcal{D}|}\sum_{(s,a) \in \mathcal{D}} (q(s,a; \boldsymbol{\theta}_{n+1}) - q(s,a; \boldsymbol{\theta}_n))^2$
    5. $n \leftarrow n + 1$
4. **until** ($\delta < \varepsilon$ or $n \geq N$)
5. **return** $\boldsymbol{\theta}_n$

The main addition here is the periodic reset of parameters (controlled by frequency $k$) which helps balance the benefits of warmstarting with the need to avoid potential overfitting. When $k=\infty$, we get traditional persistent warmstarting, while $k=1$ corresponds to training from scratch each iteration.

Inner Loop Convergence

Beyond the choice of initialization and whether to chain optimization problems through warmstarting, we can also control how we terminate the inner optimization procedure. In the templates presented above, we implicitly assumed that $\texttt{fit}$ is run to convergence. However, this need not be the case, and different implementations handle this differently.

For example, scikit-learn's MLPRegressor terminates based on several criteria: when the improvement in loss falls below a tolerance (default tol=1e-4), when it reaches the maximum number of iterations (default max_iter=200), or when the loss fails to improve for n_iter_no_change consecutive epochs. In contrast, ExtraTreesRegressor builds trees deterministically to completion based on its splitting criteria, with termination controlled by parameters like min_samples_split and max_depth.

The intuition for using early stopping in the inner optimization mirrors that of modified policy iteration in exact dynamic programming. Just as modified policy iteration truncates the Neumann series during policy evaluation rather than solving to convergence, we might only partially optimize our function approximator at each iteration. While this complicates the theoretical analysis, it often works well in practice and can be computationally more efficient.

This perspective helps us understand modern deep reinforcement learning algorithms. For instance, DQN can be viewed as an instance of fitted Q-iteration where the inner optimization is intentionally limited. We can formalize this approach as follows:

:label: early-stopping-fqi

**Input** Given dataset $\mathcal{D}$ with transitions $(s, a, r, s')$, function approximator $q(s,a; \boldsymbol{\theta})$, maximum outer iterations $N_{outer}$, maximum inner iterations $N_{inner}$, outer tolerance $\varepsilon_{outer}$, inner tolerance $\varepsilon_{inner}$

**Output** Parameters $\boldsymbol{\theta}$ for Q-function approximation

1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $n \leftarrow 0$
3. **repeat**
    1. $\mathcal{P} \leftarrow \emptyset$
    2. For each $(s,a,r,s') \in \mathcal{D}$:
        1. $y_{s,a} \leftarrow r + \gamma \max_{a'} q(s',a'; \boldsymbol{\theta}_n)$
        2. $\mathcal{P} \leftarrow \mathcal{P} \cup \{((s,a), y_{s,a})\}$
    3. // Inner optimization loop with early stopping
    4. $\boldsymbol{\theta}_{temp} \leftarrow \boldsymbol{\theta}_n$
    5. $k \leftarrow 0$
    6. **repeat**
        1. Update $\boldsymbol{\theta}_{temp}$ using one step of optimizer on $\mathcal{P}$
        2. Compute inner loop loss $\delta_{inner}$
        3. $k \leftarrow k + 1$
    7. **until** ($\delta_{inner} < \varepsilon_{inner}$ or $k \geq N_{inner}$)
    8. $\boldsymbol{\theta}_{n+1} \leftarrow \boldsymbol{\theta}_{temp}$
    9. $\delta_{outer} \leftarrow \frac{1}{|\mathcal{D}|}\sum_{(s,a) \in \mathcal{D}} (q(s,a; \boldsymbol{\theta}_{n+1}) - q(s,a; \boldsymbol{\theta}_n))^2$
    10. $n \leftarrow n + 1$
4. **until** ($\delta_{outer} < \varepsilon_{outer}$ or $n \geq N_{outer}$)
5. **return** $\boldsymbol{\theta}_n$

This formulation makes explicit the two-level optimization structure and allows us to control the trade-off between inner loop optimization accuracy and overall computational efficiency. When $N_{inner}=1$, we recover something closer to DQN's update rule, while larger values of $N_{inner}$ bring us closer to the full fitted Q-iteration approach.

Example Methods

There are several moving parts we can swap in and out when working with parametric dynamic programming - from the function approximator we choose, to how we warm start things, to the specific methods we use for numerical integration and inner optimization. In this section, we'll look at some concrete examples and see how they fit into this general framework.

Kernel-Based Reinforcement Learning (2002)

Ormoneit and Sen's Kernel-Based Reinforcement Learning (KBRL) {cite}Ormoneit2002 helped establish the general paradigm of batch reinforcement learning later advocated by {cite}ErnstGW05. KBRL is a purely offline method that first collects a fixed set of transitions and then uses kernel regression to solve the optimal control problem through value iteration on this dataset. While the dominant approaches at the time were online methods like temporal difference, KBRL showed that another path to developping reinforcement learning algorithm was possible: one that capable of leveraging advances in supervised learning to provide both theoretical and practical benefits.

As the name suggests, KBRL uses kernel based regression within the general framework of outlined above.

:label: kernel-based-q-iteration

**Input** Given an MDP $(S, A, P, R, \gamma)$, dataset $\mathcal{D}$ with observed transitions $(s, a, r, s')$, kernel bandwidth $b$, maximum iterations $N$, tolerance $\varepsilon > 0$

**Output** Kernel-based Q-function approximation

1. Initialize $\hat{Q}_0$ to zero everywhere
2. $n \leftarrow 0$
3. **repeat**
    1. $\mathcal{D} \leftarrow \emptyset$
    2. For each $(s, a, r, s') \in \mathcal{D}$:
        1. $y_{s,a} \leftarrow r + \gamma \max_{a' \in A} \hat{Q}_n(s', a')$
        2. $\mathcal{D} \leftarrow \mathcal{D} \cup \{((s,a), y_{s,a})\}$
    3. $\hat{Q}_{n+1}(s,a) \leftarrow \sum_{(s_i,a_i,r_i,s_i') \in \mathcal{D}} k_b(s_i, s)\mathbb{1}[a_i=a] y_{s_i,a_i} / \sum_{(s_i,a_i,r_i,s_i') \in \mathcal{D}} k_b(s_i, s)\mathbb{1}[a_i=a]$
    4. $\delta \leftarrow \frac{1}{|\mathcal{D}|}\sum_{(s,a,r,s') \in \mathcal{D}} (\hat{Q}_{n+1}(s,a) - \hat{Q}_n(s,a))^2$
    5. $n \leftarrow n + 1$
4. **until** ($\delta < \varepsilon$ or $n \geq N$)
5. **return** $\hat{Q}_n$

Step 3 is where KBRL uses kernel regression with a normalized weighting kernel:

$$k_b(x^l_t, x) = \frac{\phi(|x^l_t - x|/b)}{\sum_{l'} \phi(|x^l_{t'} - x|/b)}$$

where $\phi$ is a kernel function (often Gaussian) and $b$ is the bandwidth parameter. Each iteration reuses the entire fixed dataset to re-estimate Q-values through this kernel regression.

An important theoretical contribution of KBRL is showing that this kernel-based approach ensures convergence of the Q-function sequence. The authors prove that, with appropriate choice of kernel bandwidth decreasing with sample size, the method is consistent - the estimated Q-function converges to the true Q-function as the number of samples grows.

The main practical limitation of KBRL is computational - being a batch method, it requires storing and using all transitions at each iteration, leading to quadratic complexity in the number of samples. The authors acknowledge this limitation for online settings, suggesting that modifications like discarding old samples or summarizing data clusters would be needed for online applications. Ernst's later work with tree-based methods would help address this limitation while maintaining many of the theoretical advantages of the batch approach.

Ernst's Fitted Q Iteration (2005)

Ernst's {cite}ErnstGW05 specific instantiation of parametric q-value iteration uses extremely randomized trees, an extension to random forests proposed by {cite:t}Geurts2006. This algorithm became particularly well-known, partly because it was one of the first to demonstrate the advantages of offline reinforcement learning in practice on several challenging benchmarks at the time.

Random Forests and Extra-Trees differ primarily in how they construct individual trees. Random Forests creates diversity in two ways: it resamples the training data (bootstrap) for each tree, and at each node it randomly selects a subset of features but then searches exhaustively for the best cut-point within each selected feature. In contrast, Extra-Trees uses the full training set for each tree and injects randomization differently: at each node, it randomly selects both features and cut-points without searching for the optimal one. It then picks the best among these completely random splits according to a variance reduction criterion. This double randomization - in both feature and cut-point selection - combined with using the full dataset makes Extra-Trees faster than Random Forests while maintaining similar predictive accuracy.

An important implementation detail concerns how tree structures can be reused across iterations of fitted Q iteration. With parametric methods like neural networks, warmstarting is straightforward - you simply initialize the weights with values from the previous iteration. For decision trees, the situation is more subtle because the model structure is determined by how splits are chosen at each node. When the number of candidate splits per node is $K=1$ (totally randomized trees), the algorithm selects both the splitting variable and threshold purely at random, without looking at the target values (the Q-values we're trying to predict) to evaluate the quality of the split. This means the tree structure only depends on the input variables and random choices, not on what we're predicting. As a result, we can build the trees once in the first iteration and reuse their structure throughout all iterations, only updating the prediction values at the leaves.

Standard Extra-Trees ($K&gt;1$), however, uses target values to choose the best among K random splits by calculating which split best reduces the variance of the predictions. Since these target values change in each iteration of fitted Q iteration (as our estimate of Q evolves), we must rebuild the trees completely. While this is computationally more expensive, it allows the trees to better adapt their structure to capture the evolving Q-function.

The complete algorithm can be formalized as follows:

:label: extra-trees-fqi

**Input** Given an MDP $(S, A, P, R, \gamma)$, dataset $\mathcal{D}$ with observed transitions $(s, a, r, s')$, Extra-Trees parameters $(K, n_{min}, M)$, maximum iterations $N$, tolerance $\varepsilon > 0$

**Output** Extra-Trees model for Q-function approximation

1. Initialize $\hat{Q}_0$ to zero everywhere
2. $n \leftarrow 0$
3. **repeat**
    1. $\mathcal{D} \leftarrow \emptyset$
    2. For each $(s, a, r, s') \in \mathcal{D}$:
        1. $y_{s,a} \leftarrow r + \gamma \max_{a' \in A} \hat{Q}_n(s', a')$
        2. $\mathcal{D} \leftarrow \mathcal{D} \cup \{((s,a), y_{s,a})\}$
    3. $\hat{Q}_{n+1} \leftarrow \text{BuildExtraTrees}(\mathcal{D}, K, n_{min}, M)$
    4. $\delta \leftarrow \frac{1}{|\mathcal{D}|}\sum_{(s,a,r,s') \in \mathcal{D}} (\hat{Q}_{n+1}(s,a) - \hat{Q}_n(s,a))^2$
    5. $n \leftarrow n + 1$
4. **until** ($\delta < \varepsilon$ or $n \geq N$)
5. **return** $\hat{Q}_n$

Neural Fitted Q Iteration (2005)

Riedmiller's Neural Fitted Q Iteration (NFQI) {cite}Riedmiller05 is a natural instantiation of parametric Q-value iteration where:

  1. The function approximator $q(s,a; \boldsymbol{\theta})$ is a multi-layer perceptron
  2. The $\texttt{fit}$ function uses Rprop optimization trained to convergence on each iteration's pattern set
  3. The expected next-state values are estimated through Monte Carlo integration with $N=1$, using the observed next states from transitions

Specifically, rather than using numerical quadrature which would require known transition probabilities, NFQ approximates the expected future value using observed transitions:

$$ \int q_n(s',a')p(ds'|s,a) \approx q_n(s'_{observed},a') $$

where $s'_{observed}$ is the actual next state that was observed after taking action $a$ in state $s$. This is equivalent to Monte Carlo integration with a single sample, making the algorithm fully model-free.

The algorithm follows from the parametric Q-value iteration template:

:label: neural-fitted-q-iteration

**Input** Given an MDP $(S, A, P, R, \gamma)$, dataset $\mathcal{D}$ with observed transitions $(s, a, r, s')$, MLP architecture $q(s,a; \boldsymbol{\theta})$, maximum iterations $N$, tolerance $\varepsilon > 0$

**Output** Parameters $\boldsymbol{\theta}$ for Q-function approximation

1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $n \leftarrow 0$
3. **repeat**
    1. $\mathcal{D} \leftarrow \emptyset$
    2. For each $(s,a,r,s') \in \mathcal{D}$:
        1. $y_{s,a} \leftarrow r + \gamma \max_{a'} q(s',a'; \boldsymbol{\theta}_n)$  // Monte Carlo estimate with one sample
        2. $\mathcal{D} \leftarrow \mathcal{D} \cup \{((s,a), y_{s,a})\}$
    3. $\boldsymbol{\theta}_{n+1} \leftarrow \text{Rprop}(\mathcal{D})$ // Train MLP to convergence
    4. $\delta \leftarrow \frac{1}{|\mathcal{D}||A|}\sum_{(s,a) \in \mathcal{D} \times A} (q(s,a; \boldsymbol{\theta}_{n+1}) - q(s,a; \boldsymbol{\theta}_n))^2$
    5. $n \leftarrow n + 1$
4. **until** ($\delta < \varepsilon$ or $n \geq N$)
5. **return** $\boldsymbol{\theta}_n$

While NFQI was originally introduced as an offline method with base points collected a priori, the authors also present a variant where base points are collected incrementally. In this online variant, new transitions are gathered using the current policy (greedy with respect to $Q_k$) and added to the experience set. This approach proves particularly useful when random exploration cannot efficiently collect representative experiences.

Deep Q Networks (2013)

DQN {cite}mnih2013atari is a close relative of NFQI - in fact, Riedmiller, the author of NFQI, was also an author on the DQN paper. What at first glance might look like a different algorithm can actually be understood as a special case of parametric dynamic programming with practical adaptations. We build this connection step by step.

First, we start with basic parametric Q-value iteration using a neural network:

:label: basic-q-iteration

**Input** Given an MDP $(S, A, P, R, \gamma)$, dataset of transitions $\mathcal{T}$, neural network $q(s,a; \boldsymbol{\theta})$, maximum iterations $N$, tolerance $\varepsilon > 0$, initialization $\boldsymbol{\theta}_0$

**Output** Parameters $\boldsymbol{\theta}$ for Q-function approximation

1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $n \leftarrow 0$
3. **repeat**
    1. $\mathcal{D}_n \leftarrow \emptyset$  // Regression dataset
    2. For each $(s,a,r,s') \in \mathcal{T}$:
        1. $y_{s,a} \leftarrow r + \gamma \max_{a' \in A} q(s',a'; \boldsymbol{\theta}_n)$
        2. $\mathcal{D}_n \leftarrow \mathcal{D}_n \cup \{((s,a), y_{s,a})\}$
    3. $\boldsymbol{\theta}_{n+1} \leftarrow \texttt{fit}(\mathcal{D}_n, \boldsymbol{\theta}_0)$ // Fit neural network using built-in convergence criterion 
    4. $n \leftarrow n + 1$
4. **until** training complete
5. **return** $\boldsymbol{\theta}_n$

Next, we open up the fit procedure to show the inner optimization loop using gradient descent:

:label: q-iteration-inner-loop

**Input** Given MDP $(S, A, P, R, \gamma)$, dataset of transitions $\mathcal{T}$, neural network $q(s,a; \boldsymbol{\theta})$, learning rate $\alpha$, convergence test $\texttt{has\_converged}(\cdot)$, initialization $\boldsymbol{\theta}_0$, regression loss function $\mathcal{L}$

**Output** Parameters $\boldsymbol{\theta}$ for Q-function approximation

1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $n \leftarrow 0$  // Outer iteration index
3. **repeat**
    1. $\mathcal{D}_n \leftarrow \emptyset$  // Regression dataset
    2. For each $(s,a,r,s') \in \mathcal{T}$:
        1. $y_{s,a} \leftarrow r + \gamma \max_{a' \in A} q(s',a'; \boldsymbol{\theta}_n)$
        2. $\mathcal{D}_n \leftarrow \mathcal{D}_n \cup \{((s,a), y_{s,a})\}$
    3. // Inner optimization loop
    4. $\boldsymbol{\theta}^{(0)} \leftarrow \boldsymbol{\theta}_0$  // Start from initial parameters
    5. $k \leftarrow 0$  // Inner iteration index
    6. **repeat**
        1. $\boldsymbol{\theta}^{(k+1)} \leftarrow \boldsymbol{\theta}^{(k)} - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}^{(k)}; \mathcal{D}_n)$
        2. $k \leftarrow k + 1$
    7. **until** $\texttt{has\_converged}(\boldsymbol{\theta}^{(0)}, ..., \boldsymbol{\theta}^{(k)}, \mathcal{D}_n)$
    8. $\boldsymbol{\theta}_{n+1} \leftarrow \boldsymbol{\theta}^{(k)}$
    9. $n \leftarrow n + 1$
4. **until** training complete
5. **return** $\boldsymbol{\theta}_n$

Warmstarting and Partial Fitting

A natural modification is to initialize the inner optimization loop with the previous iteration's parameters - a strategy known as warmstarting - rather than starting from $\boldsymbol{\theta}_0$ each time. Additionally, similar to how modified policy iteration performs partial policy evaluation rather than solving to convergence, we can limit ourselves to a fixed number of optimization steps. These pragmatic changes, when combined, yield:

:label: nfqi-warmstart-partial

**Input** Given MDP $(S, A, P, R, \gamma)$, dataset of transitions $\mathcal{T}$, neural network $q(s,a; \boldsymbol{\theta})$, learning rate $\alpha$, number of steps $K$

**Output** Parameters $\boldsymbol{\theta}$ for Q-function approximation

1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $n \leftarrow 0$  // Outer iteration index
3. **repeat**
    1. $\mathcal{D}_n \leftarrow \emptyset$  // Regression dataset
    2. For each $(s,a,r,s') \in \mathcal{T}$:
        1. $y_{s,a} \leftarrow r + \gamma \max_{a' \in A} q(s',a'; \boldsymbol{\theta}_n)$
        2. $\mathcal{D}_n \leftarrow \mathcal{D}_n \cup \{((s,a), y_{s,a})\}$
    3. // Inner optimization loop with warmstart and fixed steps
    4. $\boldsymbol{\theta}^{(0)} \leftarrow \boldsymbol{\theta}_n$  // Warmstart from previous iteration
    5. For $k = 0$ to $K-1$:
        1. $\boldsymbol{\theta}^{(k+1)} \leftarrow \boldsymbol{\theta}^{(k)} - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}^{(k)}; \mathcal{D}_n)$
    6. $\boldsymbol{\theta}_{n+1} \leftarrow \boldsymbol{\theta}^{(K)}$
    7. $n \leftarrow n + 1$
4. **until** training complete
5. **return** $\boldsymbol{\theta}_n$

Flattening the Updates with Target Swapping

Now rather than maintaining two sets of indices for the outer and inner levels, we could also "flatten" this algorithm under a single loop structure using modulo arithmetic. We can rewrite it as follows:

:label: nfqi-flattened-swap

**Input** Given MDP $(S, A, P, R, \gamma)$, dataset of transitions $\mathcal{T}$, neural network $q(s,a; \boldsymbol{\theta})$, learning rate $\alpha$, target update frequency $K$

**Output** Parameters $\boldsymbol{\theta}$ for Q-function approximation

1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_0$  // Initialize target parameters
3. $t \leftarrow 0$  // Single iteration counter
4. **while** training:
    1. $\mathcal{D}_t \leftarrow \emptyset$  // Regression dataset
    2. For each $(s,a,r,s') \in \mathcal{T}$:
        1. $y_{s,a} \leftarrow r + \gamma \max_{a' \in A} q(s',a'; \boldsymbol{\theta}_{target})$  // Use target parameters
        2. $\mathcal{D}_t \leftarrow \mathcal{D}_t \cup \{((s,a), y_{s,a})\}$
    3. $\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t; \mathcal{D}_t)$
    4. If $t \bmod K = 0$:  // Every K steps
        1. $\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_t$  // Update target parameters
    5. $t \leftarrow t + 1$
4. **return** $\boldsymbol{\theta}_t$

The flattened version with target parameters achieves exactly the same effect as our previous nested-loop structure with warmstarting and K gradient steps. In the nested version, we would create a dataset using parameters $\boldsymbol{\theta}n$, then perform K gradient steps to obtain $\boldsymbol{\theta}{n+1}$. In our flattened version, we maintain a separate $\boldsymbol{\theta}_{target}$ that gets updated every K steps, ensuring that the dataset $\mathcal{D}n$ is created using the same parameters for K consecutive iterations - just as it would be in the nested version. The only difference is that we've restructured the algorithm to avoid explicitly nesting the loops, making it more suitable for continuous online training which we are about to introduce. The periodic synchronization of $\boldsymbol{\theta}{target}$ with the current parameters $\boldsymbol{\theta}_n$ effectively marks the boundary of what would have been the outer loop in our previous version.

Exponential Moving Average Targets

An alternative to this periodic swap of parameters is to use an exponential moving average (EMA) of the parameters:

:label: nfqi-flattened-ema

**Input** Given MDP $(S, A, P, R, \gamma)$, dataset of transitions $\mathcal{T}$, neural network $q(s,a; \boldsymbol{\theta})$, learning rate $\alpha$, EMA rate $\tau$

**Output** Parameters $\boldsymbol{\theta}$ for Q-function approximation

1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_0$  // Initialize target parameters
3. $n \leftarrow 0$  // Single iteration counter
4. **while** training:
    1. $\mathcal{D}_n \leftarrow \emptyset$  // Regression dataset
    2. For each $(s,a,r,s') \in \mathcal{T}$:
        1. $y_{s,a} \leftarrow r + \gamma \max_{a' \in A} q(s',a'; \boldsymbol{\theta}_{target})$  // Use target parameters
        2. $\mathcal{D}_n \leftarrow \mathcal{D}_n \cup \{((s,a), y_{s,a})\}$
    3. $\boldsymbol{\theta}_{n+1} \leftarrow \boldsymbol{\theta}_n - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_n; \mathcal{D}_n)$
    4. $\boldsymbol{\theta}_{target} \leftarrow \tau\boldsymbol{\theta}_{n+1} + (1-\tau)\boldsymbol{\theta}_{target}$  // Smooth update of target parameters
    5. $n \leftarrow n + 1$
4. **return** $\boldsymbol{\theta}_n$

Note that the original DQN used the periodic swap of parameters rather than EMA targets. EMA targets (also called "Polyak averaging") started becoming popular in deep RL with DDPG {cite}lillicrap2015continuous where they used a "soft" target update: $\boldsymbol{\theta}{target} \leftarrow \tau\boldsymbol{\theta} + (1-\tau)\boldsymbol{\theta}{target}$ with a small $\tau$ (like 0.001). This has since become a common choice in many algorithms like TD3 {cite}fujimoto2018addressing and SAC {cite}haarnoja2018soft.

Online Data Collection and Experience Replay

Rather than using offline data, we now consider a modification where we incrementally gather samples under our current policy. A common exploration strategy is $\varepsilon$-greedy: with probability $\varepsilon$ we select a random action, and with probability $1-\varepsilon$ we select the greedy action $\arg\max_a q(s,a;\boldsymbol{\theta}_n)$. This ensures we maintain some exploration even as our Q-function estimates improve. Typically $\varepsilon$ is annealed over time, starting with a high value (e.g., 1.0) to encourage early exploration and gradually decreasing to a small final value (e.g., 0.01) to maintain a minimal level of exploration while mostly exploiting our learned policy.

:label: online-nfqi-flattened

**Input** Given MDP $(S, A, P, R, \gamma)$, neural network $q(s,a; \boldsymbol{\theta})$, learning rate $\alpha$, target update frequency $K$

**Output** Parameters $\boldsymbol{\theta}$ for Q-function approximation

1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_0$  // Initialize target parameters
3. Initialize $\mathcal{T} \leftarrow \emptyset$  // Initialize transition dataset
4. $n \leftarrow 0$  // Single iteration counter
5. **while** training:
    1. Observe current state $s$
    2. Select action $a$ using policy derived from $q(s,\cdot;\boldsymbol{\theta}_n)$ (e.g., ε-greedy)
    3. Execute $a$, observe reward $r$ and next state $s'$
    4. $\mathcal{T}_n \leftarrow \mathcal{T}_n \cup \{(s,a,r,s')\}$  // Add transition to dataset
    5. $\mathcal{D}_n \leftarrow \emptyset$  // Regression dataset
    6. For each $(s,a,r,s') \in \mathcal{T}_n$:
        1. $y_{s,a} \leftarrow r + \gamma \max_{a' \in A} q(s',a'; \boldsymbol{\theta}_{target})$  // Use target parameters
        2. $\mathcal{D}_n \leftarrow \mathcal{D}_n \cup \{((s,a), y_{s,a})\}$
    7. $\boldsymbol{\theta}_{n+1} \leftarrow \boldsymbol{\theta}_n - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_n; \mathcal{D}_n)$
    8. If $n \bmod K = 0$:  // Every K steps
        1. $\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_n$  // Update target parameters
    9. $n \leftarrow n + 1$
6. **return** $\boldsymbol{\theta}_n$

This version faces two practical challenges. First, the transition dataset $\mathcal{T}_n$ grows unbounded over time, creating memory issues. Second, computing gradients over the entire dataset becomes increasingly expensive. These are common challenges in online learning settings, and the standard solutions from supervised learning apply here:

  1. Use a fixed-size circular buffer (often called replay buffer, in reference to "experience replay" by {cite}lin1992self) to limit memory usage
  2. Compute gradients on mini-batches rather than the full dataset

We can modify our algorithm to incorporate these ideas as follows:

:label: online-nfqi-replay

**Input** Given MDP $(S, A, P, R, \gamma)$, neural network $q(s,a; \boldsymbol{\theta})$, learning rate $\alpha$, target update frequency $K$, replay buffer size $B$, mini-batch size $b$

**Output** Parameters $\boldsymbol{\theta}$ for Q-function approximation

1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_0$  // Initialize target parameters
3. Initialize replay buffer $\mathcal{R}$ with capacity $B$
4. $n \leftarrow 0$  // Single iteration counter
5. **while** training:
    1. Observe current state $s$
    2. Select action $a$ using policy derived from $q(s,\cdot;\boldsymbol{\theta}_n)$ (e.g., ε-greedy)
    3. Execute $a$, observe reward $r$ and next state $s'$
    4. Store $(s,a,r,s')$ in $\mathcal{R}$, replacing oldest if full  // Circular buffer update
    5. $\mathcal{D}_n \leftarrow \emptyset$  // Regression dataset
    6. Sample mini-batch of $b$ transitions $(s_i,a_i,r_i,s_i')$ from $\mathcal{R}$
    7. For each sampled $(s_i,a_i,r_i,s_i')$:
        1. $y_i \leftarrow r_i + \gamma \max_{a' \in A} q(s_i',a'; \boldsymbol{\theta}_{target})$  // Use target parameters
        2. $\mathcal{D}_n \leftarrow \mathcal{D}_n \cup \{((s_i,a_i), y_i)\}$
    8. $\boldsymbol{\theta}_{n+1} \leftarrow \boldsymbol{\theta}_n - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_n; \mathcal{D}_n)$ // Replace by RMSProp to obtain DQN
    9. If $n \bmod K = 0$:  // Every K steps
        1. $\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_n$  // Update target parameters
    10. $n \leftarrow n + 1$
6. **return** $\boldsymbol{\theta}_n$

This formulation shows the replay ratio (or data reuse ratio) in deep reinforcement learning. In our algorithm, for each new transition we collect, we sample a mini-batch of size b from our replay buffer and perform one update. This means we're reusing past experiences at a ratio of b:1 - for every new piece of data, we're learning from b experiences. This ratio can be tuned as a hyperparameter. Higher ratios mean more computation per environment step but better data efficiency, as we're extracting more learning from each collected transition. Experience replay allows us to decouple the rate of data collection from the rate of learning updates. Some modern algorithms like SAC or TD3 explicitly tune this ratio, sometimes using multiple gradient steps per environment step to achieve higher data efficiency.

Double-Q Network Variant

As we saw earlier, the max operator in the target computation can lead to overestimation of Q-values. This happens because we use the same network to both select and evaluate actions in the target computation: $y_i \leftarrow r_i + \gamma \max_{a' \in A} q(s_i',a'; \boldsymbol{\theta}_{target})$. The max operator means we're both choosing the action that looks best under our current estimates and then using that same set of estimates to evaluate how good that action is, potentially compounding any optimization bias.

Double DQN {cite:t}van2016deep addresses this by using the current network parameters to select actions but the target network parameters to evaluate them. This leads to a simple modification of the DQN algorithm:

:label: double-dqn

**Input** Given MDP $(S, A, P, R, \gamma)$, neural network $q(s,a; \boldsymbol{\theta})$, learning rate $\alpha$, target update frequency $K$, replay buffer size $B$, mini-batch size $b$

**Output** Parameters $\boldsymbol{\theta}$ for Q-function approximation

1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_0$  // Initialize target parameters
3. Initialize replay buffer $\mathcal{R}$ with capacity $B$
4. $n \leftarrow 0$  // Single iteration counter
5. **while** training:
    1. Observe current state $s$
    2. Select action $a$ using policy derived from $q(s,\cdot;\boldsymbol{\theta}_n)$ (e.g., ε-greedy)
    3. Execute $a$, observe reward $r$ and next state $s'$
    4. Store $(s,a,r,s')$ in $\mathcal{R}$, replacing oldest if full
    5. $\mathcal{D}_n \leftarrow \emptyset$  // Regression dataset
    6. Sample mini-batch of $b$ transitions $(s_i,a_i,r_i,s_i')$ from $\mathcal{R}$
    7. For each sampled $(s_i,a_i,r_i,s_i')$:
        1. $a^*_i \leftarrow \arg\max_{a' \in A} q(s_i',a'; \boldsymbol{\theta}_n)$  // Select action using current network
        2. $y_i \leftarrow r_i + \gamma q(s_i',a^*_i; \boldsymbol{\theta}_{target})$  // Evaluate using target network
        3. $\mathcal{D}_n \leftarrow \mathcal{D}_n \cup \{((s_i,a_i), y_i)\}$
    8. $\boldsymbol{\theta}_{n+1} \leftarrow \boldsymbol{\theta}_n - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_n; \mathcal{D}_n)$
    9. If $n \bmod K = 0$:  // Every K steps
        1. $\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_n$  // Update target parameters
    10. $n \leftarrow n + 1$
6. **return** $\boldsymbol{\theta}_n$

The main difference from the original DQN is in step 7, where we now separate action selection from action evaluation. Rather than directly taking the max over the target network's Q-values, we first select the action using our current network ($\boldsymbol{\theta}n$) and then evaluate that specific action using the target network ($\boldsymbol{\theta}{target}$). This simple change has been shown to lead to more stable learning and better final performance across a range of tasks.

Deep Q Networks with Resets (2022)

In flattening neural fitted Q-iteration, our field had lost sight of an important structural element: the choice of inner-loop initializer inherent in the original FQI algorithm. The traditional structure explicitly separated outer iterations (computing targets) from inner optimization (fitting to those targets), with each inner optimization starting fresh from parameters $\boldsymbol{\theta}_0$.

The flattened version with persistent warmstarting seemed like a natural optimization - why throw away learned parameters? However, recent work {cite}Doro2023 has shown that persistent warmstarting can actually be detrimental to learning. Neural networks tend to lose their ability to learn and generalize over the course of training, suggesting that occasionally starting fresh from $\boldsymbol{\theta}_0$ can be beneficial. In the context of DQN, this looks algorithmically as follows:

:label: dqn-hard-resets

**Input** Given MDP $(S, A, P, R, \gamma)$, neural network $q(s,a; \boldsymbol{\theta})$, learning rate $\alpha$, reset interval $K$, replay buffer size $B$, mini-batch size $b$

**Output** Parameters $\boldsymbol{\theta}$ for Q-function approximation

1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_0$  // Initialize target parameters
3. Initialize replay buffer $\mathcal{R}$ with capacity $B$
4. $n \leftarrow 0$  // Single iteration counter
5. **while** training:
    1. Observe current state $s$
    2. Select action $a$ using policy derived from $q(s,\cdot;\boldsymbol{\theta}_n)$ (e.g., ε-greedy)
    3. Execute $a$, observe reward $r$ and next state $s'$
    4. Store $(s,a,r,s')$ in $\mathcal{R}$, replacing oldest if full
    5. $\mathcal{D}_n \leftarrow \emptyset$  // Regression dataset
    6. Sample mini-batch of $b$ transitions $(s_i,a_i,r_i,s_i')$ from $\mathcal{R}$
    7. For each sampled $(s_i,a_i,r_i,s_i')$:
        1. $y_i \leftarrow r_i + \gamma \max_{a' \in A} q(s_i',a'; \boldsymbol{\theta}_{target})$
        2. $\mathcal{D}_n \leftarrow \mathcal{D}_n \cup \{((s_i,a_i), y_i)\}$
    8. If $n \bmod K = 0$:  // Periodic reset
        1. $\boldsymbol{\theta}_{n+1} \leftarrow \boldsymbol{\theta}_0$  // Reset to initial parameters
    9. Else:
        1. $\boldsymbol{\theta}_{n+1} \leftarrow \boldsymbol{\theta}_n - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_n; \mathcal{D}_n)$
    10. If $n \bmod K = 0$:  // Every K steps
        1. $\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_n$  // Update target parameters
    11. $n \leftarrow n + 1$
6. **return** $\boldsymbol{\theta}_n$

This algorithm change allows us to push the limits of our update ratio - the number of gradient steps we perform per environment interaction. Without resets, increasing this ratio leads to diminishing returns as the network's ability to learn degrades. However, by periodically resetting the parameters while maintaining our dataset of transitions, we can perform many more updates per interaction, effectively making our algorithm more "offline" and thus more sample efficient.

The hard reset strategy, while effective, can be too aggressive in some settings as it completely discards learned parameters. An alternative approach is to use a softer form of reset, adapting the "Shrink and Perturb" technique originally introduced by {cite:t}ash2020warm in the context of continual learning. In their work, they found that neural networks that had been trained on one task could better adapt to new tasks if their parameters were partially reset - interpolated with a fresh initialization - rather than either kept intact or completely reset.

We can adapt this idea to our setting. Instead of completely resetting to $\boldsymbol{\theta}_0$, we can perform a soft reset by interpolating between our current parameters and a fresh random initialization:

:label: dqn-soft-resets

**Input** Given MDP $(S, A, P, R, \gamma)$, neural network $q(s,a; \boldsymbol{\theta})$, learning rate $\alpha$, reset interval $K$, replay buffer size $B$, mini-batch size $b$, interpolation coefficient $\beta$

**Output** Parameters $\boldsymbol{\theta}$ for Q-function approximation

1. Initialize $\boldsymbol{\theta}_0$ randomly
2. $\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_0$  // Initialize target parameters
3. Initialize replay buffer $\mathcal{R}$ with capacity $B$
4. $n \leftarrow 0$  // Single iteration counter
5. **while** training:
    1. Observe current state $s$
    2. Select action $a$ using policy derived from $q(s,\cdot;\boldsymbol{\theta}_n)$ (e.g., ε-greedy)
    3. Execute $a$, observe reward $r$ and next state $s'$
    4. Store $(s,a,r,s')$ in $\mathcal{R}$, replacing oldest if full
    5. $\mathcal{D}_n \leftarrow \emptyset$  // Regression dataset
    6. Sample mini-batch of $b$ transitions $(s_i,a_i,r_i,s_i')$ from $\mathcal{R}$
    7. For each sampled $(s_i,a_i,r_i,s_i')$:
        1. $y_i \leftarrow r_i + \gamma \max_{a' \in A} q(s_i',a'; \boldsymbol{\theta}_{target})$
        2. $\mathcal{D}_n \leftarrow \mathcal{D}_n \cup \{((s_i,a_i), y_i)\}$
    8. If $n \bmod K = 0$:  // Periodic soft reset
        1. Sample $\boldsymbol{\phi} \sim$ initializer  // Fresh random parameters
        2. $\boldsymbol{\theta}_{n+1} \leftarrow \beta\boldsymbol{\theta}_n + (1-\beta)\boldsymbol{\phi}$
    9. Else:
        1. $\boldsymbol{\theta}_{n+1} \leftarrow \boldsymbol{\theta}_n - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_n; \mathcal{D}_n)$
    10. If $n \bmod K = 0$:  // Every K steps
        1. $\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_n$  // Update target parameters
    11. $n \leftarrow n + 1$
6. **return** $\boldsymbol{\theta}_n$

The interpolation coefficient $\beta$ controls how much of the learned parameters we retain, with $\beta = 0$ recovering the hard reset case and $\beta = 1$ corresponding to no reset at all. This provides a more flexible approach to restoring learning capability while potentially preserving useful features that have been learned. Like hard resets, this softer variant still enables high update ratios by preventing the degradation of learning capability, but does so in a more gradual way.