| jupytext |
|
||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| kernelspec |
|
Dynamic programming methods suffer from the curse of dimensionality and can quickly become difficult to apply in practice. We may also be dealing with large or continuous state or action spaces. We have seen so far that we could address this problem using discretization, or interpolation. These were already examples of approximate dynamic programming. In this chapter, we will see other forms of approximations meant to facilitate the optimization problem, either by approximating the optimality equations, the value function, or the policy itself. Approximation theory is at the heart of learning methods, and fundamentally, this chapter will be about the application of learning ideas to solve complex decision-making problems.
While the standard Bellman optimality equations use the max operator to determine the best action, an alternative formulation known as the smooth or soft Bellman optimality equations replaces this with a softmax operator. This approach originated from {cite}rust1987optimal and was later rediscovered in the context of maximum entropy inverse reinforcement learning {cite}ziebart2008maximum, which then led to soft Q-learning {cite}haarnoja2017reinforcement and soft actor-critic {cite}haarnoja2018soft, a state-of-the-art deep reinforcement learning algorithm.
In the infinite-horizon setting, the smooth Bellman optimality equations take the form:
Adopting an operator-theoretic perspective, we can define a nonlinear operator
As
Despite the confusing terminology, the above "softmax" policy is simply the smooth counterpart to the argmax operator in the original optimality equation: it acts as a soft-argmax.
This formulation is interesting for several reasons. First, smoothness is a desirable property from an optimization standpoint. Unlike
Second, while presented from an intuitive standpoint where we replace the max by the log-sum-exp (a smooth maximum) and the argmax by the softmax (a smooth argmax), this formulation can also be obtained from various other perspectives, offering theoretical tools and solution methods. For example, {cite:t}rust1987optimal derived this algorithm by considering a setting in which the rewards are stochastic and perturbed by a Gumbel noise variable. When considering the corresponding augmented state space and integrating the noise, we obtain smooth equations. This interpretation is leveraged by Rust for modeling purposes.
The smooth value iteration algorithm replaces the max operator in standard value iteration with the logsumexp operator. Here's the algorithm structure:
:label: smooth-value-iteration
**Input:** MDP $(S, A, r, p, \gamma)$, inverse temperature $\beta > 0$, tolerance $\epsilon > 0$
**Output:** Approximate optimal value function $v$ and stochastic policy $\pi$
1. Initialize $v(s) \leftarrow 0$ for all $s \in S$
2. **repeat**
3. $\quad \Delta \leftarrow 0$
4. $\quad$ **for** each state $s \in S$ **do**
5. $\quad\quad$ **for** each action $a \in A_s$ **do**
6. $\quad\quad\quad q(s,a) \leftarrow r(s,a) + \gamma \sum_{j \in S} p(j|s,a) v(j)$
7. $\quad\quad$ **end for**
8. $\quad\quad v_{\text{new}}(s) \leftarrow \frac{1}{\beta} \log \sum_{a \in A_s} \exp(\beta \cdot q(s,a))$
9. $\quad\quad \Delta \leftarrow \max(\Delta, |v_{\text{new}}(s) - v(s)|)$
10. $\quad\quad v(s) \leftarrow v_{\text{new}}(s)$
11. $\quad$ **end for**
12. **until** $\Delta < \epsilon$
13. Extract policy: **for** each state $s \in S$ **do**
14. $\quad$ Compute $q(s,a)$ for all $a \in A_s$ as in lines 5-7
15. $\quad \pi(a|s) \leftarrow \frac{\exp(\beta \cdot q(s,a))}{\sum_{a' \in A_s} \exp(\beta \cdot q(s,a'))}$ for all $a \in A_s$
16. **end for**
17. **return** $v, \pi$
Differences from standard value iteration:
- Line 8 uses
$\frac{1}{\beta} \log \sum_a \exp(\beta \cdot q(s,a))$ instead of$\max_a q(s,a)$ - Line 15 extracts a stochastic policy using softmax instead of a deterministic argmax policy
- As
$\beta \to \infty$ , the algorithm converges to standard value iteration - Lower
$\beta$ values produce more stochastic policies with higher entropy
There is also a way to obtain this equation by starting from the energy-based formulation often used in supervised learning, in which we convert an unnormalized probability distribution into a distribution using the softmax transformation. This is essentially what {cite:t}ziebart2008maximum did in their paper. Furthermore, this perspective bridges with the literature on probabilistic graphical models, in which we can now cast the problem of finding an optimal smooth policy into one of maximum likelihood estimation (an inference problem). This is the idea of control as inference, which also admits the converse - that of inference as control - used nowadays for deriving fast samples and amortized inference techniques using reinforcement learning {cite}levine2018reinforcement.
Finally, it's worth noting that we can also derive this form by considering an entropy-regularized formulation in which we penalize for the entropy of our policy in the reward function term. This formulation admits a solution that coincides with the smooth Bellman equations {cite}haarnoja2017reinforcement.
The logsumexp operator provides one way to soften the hard maximum, but alternative approaches exist. When Q-value estimates have heterogeneous uncertainty (some actions estimated more precisely than others), we can weight actions by the probability they are optimal under a Gaussian uncertainty model. {cite}deramo2016estimating proposed computing weights as:
where $\hat{\mu}{a'}$ and $\hat{\sigma}{a'}$ are the sample mean and standard deviation of Q-value estimates,
This differs from logsumexp in that it adapts to state-action-specific uncertainty (actions with tighter confidence intervals receive more weight), whereas logsumexp applies uniform smoothing via temperature
We can obtain the smooth Bellman equation by considering a setting in which we have Gumbel noise added to the reward function. This derivation provides both theoretical insight and connects to practical modeling scenarios where rewards have random perturbations.
At each time period and state
These shocks are independent across time periods, states, and actions, and are independent of the MDP transition dynamics
The Gumbel distribution with location parameter
To generate a Gumbel-distributed random variable, we can use inverse transform sampling:
:class: tip
To ensure the shocks have zero mean, we set $\mu_\epsilon = -\gamma_E/\beta$ where $\gamma_E \approx 0.5772$ is the Euler-Mascheroni constant. This choice eliminates an additive constant that would otherwise appear in the smooth Bellman equation. For simplicity, we will adopt this convention throughout.
We now define an augmented MDP with:
-
Augmented state:
$\tilde{s} = (s, \boldsymbol{\epsilon})$ where$s \in \mathcal{S}$ and$\boldsymbol{\epsilon} \in \mathbb{R}^{|\mathcal{A}_s|}$ -
Augmented reward:
$\tilde{r}(\tilde{s}, a) = r(s,a) + \epsilon(a)$ -
Augmented transition:
$\tilde{p}(\tilde{s}' | \tilde{s}, a) = p(s' | s, a) \cdot p(\boldsymbol{\epsilon}')$
The transition factorizes because the next shock vector
:class: warning
Even if the original state space $\mathcal{S}$ and action space $\mathcal{A}$ are finite, the augmented state space $\tilde{\mathcal{S}} = \mathcal{S} \times \mathbb{R}^{|\mathcal{A}|}$ is **uncountably infinite** because each shock vector $\boldsymbol{\epsilon}$ is a continuous random variable. Therefore:
- We cannot enumerate the augmented states
- Tabular dynamic programming methods do not apply directly
- The augmented value function $\tilde{v}(s, \boldsymbol{\epsilon})$ maps a continuous space to $\mathbb{R}$
**This motivates why we immediately marginalize over the shocks** to obtain a finite-dimensional representation.
The Bellman optimality equation for the augmented MDP is:
Here the expectation is over the next augmented state
This is a perfectly well-defined Bellman equation, and an optimal stationary policy exists:
However, this equation is computationally intractable because:
- The state space is continuous and infinite-dimensional
- The shocks are fresh each period
- We would need to solve for
$\tilde{v}$ over an uncountable domain
We never solve this equation directly. Instead, we use it as a mathematical device to derive the smooth Bellman equation.
The idea here is to consider the expected value before observing the current shocks. We define what some authors in econometrics call the inclusive value or ex-ante value:
This is the value of being in state
:class: note
It is crucial to distinguish:
- $\tilde{v}(s, \boldsymbol{\epsilon})$: the value **after** observing shocks (conditional on $\boldsymbol{\epsilon}$), defined on the augmented state space
- $v(s)$: the value **before** observing shocks (marginalizing over $\boldsymbol{\epsilon}$), defined on the original state space
The function $v(s)$ is what we actually compute and care about. The augmented value $\tilde{v}$ exists only as a proof device.
Now we take the expectation of the augmented Bellman equation with respect to the current shocks only (everything that does not depend on the current
First, note that by the law of iterated expectations and independence of shocks across time:
This follows from our definition of
Now define the deterministic part of the right-hand side:
This is the expected return from taking action
Taking the expectation over
:class: important
Notice carefully: we have $\mathbb{E}[\max(\cdot)]$, **not** $\max \mathbb{E}[\cdot]$. We are **not** swapping max and expectation.
The expression $\mathbb{E}_{\boldsymbol{\epsilon}}[\max_a \{x_a + \epsilon(a)\}]$ is the expected value of the maximum of Gumbel-perturbed utilities. The Gumbel random utility identity evaluates this quantity in closed form.
We now invoke a result from extreme value theory:
:label: gumbel-random-utility
Let $\epsilon_1, \ldots, \epsilon_m$ be i.i.d. $\mathrm{Gumbel}(\mu_\epsilon, 1/\beta)$ random variables. For any deterministic values $x_1, \ldots, x_m \in \mathbb{R}$:
$$ \max_{i=1,\ldots,m} \{x_i + \epsilon_i\} \overset{d}{=} \frac{1}{\beta} \log \sum_{i=1}^m \exp(\beta x_i) + \zeta $$
where $\zeta \sim \mathrm{Gumbel}(\mu_\epsilon, 1/\beta)$ (same distribution as the original shocks).
Taking expectations:
$$ \mathbb{E}\left[\max_{i=1,\ldots,m} \{x_i + \epsilon_i\}\right] = \frac{1}{\beta} \log \sum_{i=1}^m \exp(\beta x_i) + \mu_\epsilon + \frac{\gamma_E}{\beta} $$
where $\gamma_E \approx 0.5772$ is the Euler-Mascheroni constant.
**With mean-zero shocks** ($\mu_\epsilon = -\gamma_E/\beta$), the constant term vanishes:
$$ \mathbb{E}\left[\max_{i=1,\ldots,m} \{x_i + \epsilon_i\}\right] = \frac{1}{\beta} \log \sum_{i=1}^m \exp(\beta x_i) $$
Applying this identity to our problem (with mean-zero shocks):
Substituting the definition of
We have arrived at the smooth Bellman equation.
To recap the logical flow:
- We constructed an augmented MDP with state
$(s, \boldsymbol{\epsilon})$ where shocks perturb rewards - We wrote the standard Bellman equation for this augmented MDP (hard max, but over an infinite-dimensional state space)
- We defined the ex-ante value
$v(s) = \mathbb{E}_{\boldsymbol{\epsilon}}[\tilde{v}(s, \boldsymbol{\epsilon})]$ to eliminate the continuous shock component - We separated deterministic and random terms:
$\tilde{v}(s, \boldsymbol{\epsilon}) = \max_a {x_a(s) + \epsilon(a)}$ - We applied the Gumbel identity to evaluate
$\mathbb{E}_{\boldsymbol{\epsilon}}[\max_a {\cdots}]$ in closed form as a log-sum-exp
The augmented MDP with shocks exists only as a mathematical device. We never approximate
Now that we have derived the smooth value function, we can also obtain the corresponding optimal policy. The question is: what policy should we follow in the original MDP (without explicitly conditioning on shocks)?
In the augmented MDP, the optimal policy is deterministic but depends on the shock realization:
However, we want a policy for the original state space
Define an indicator function:
where
The ex-ante probability that action
This is the probability that action
:label: gumbel-softmax
Let $\epsilon_1, \ldots, \epsilon_m$ be i.i.d. $\mathrm{Gumbel}(\mu_\epsilon, 1/\beta)$ random variables. For any deterministic values $x_1, \ldots, x_m \in \mathbb{R}$, the probability that index $i$ achieves the maximum is:
$$ \mathbb{P}\left(i \in \operatorname{argmax}_j \{x_j + \epsilon_j\}\right) = \frac{\exp(\beta x_i)}{\sum_{j=1}^m \exp(\beta x_j)} $$
This holds regardless of the location parameter $\mu_\epsilon$.
Applying this result to our problem:
This is the softmax policy or Gibbs/Boltzmann policy with inverse temperature
Properties:
- As
$\beta \to \infty$ : the policy becomes deterministic, concentrating on the action(s) with highest$x_a(s)$ (recovers standard greedy policy) - As
$\beta \to 0$ : the policy becomes uniform over all actions (maximum entropy) - For finite
$\beta > 0$ : the policy is stochastic, with probability mass proportional to exponentiated Q-values
This completes the derivation: the smooth Bellman equation yields a value function
Regularized MDPs {cite}geist2019 provide another perspective on how the smooth Bellman equations come to be. This framework offers a more general approach in which we seek to find optimal policies under the infinite horizon criterion while also accounting for a regularizer that influences the kind of policies we try to obtain.
Let's set up some necessary notation. First, recall that the policy evaluation operator for a stationary policy with decision rule
where $\mathbf{r}\pi$ is the expected reward vector under policy $\pi$, $\gamma$ is the discount factor, and $\mathbf{P}\pi$ is the state transition probability matrix under
The policy evaluation operator can then be written in terms of the q-function as:
$$ \BellmanPi v = \langle \pi(\cdot | s), q(s, \cdot) \rangle $$
The workhorse behind the theory of regularized MDPs is the Legendre-Fenchel transform, also known as the convex conjugate. For a strongly convex function
An important property of this transform is that it has a unique maximizing argument, given by the gradient of
An important example of a regularizer is the negative entropy, which gives rise to the smooth Bellman equations as we are about to see.
With these concepts in place, we can now define the regularized Bellman operators:
-
Regularized Policy Evaluation Operator
$(\mathrm{L}_{\pi,\Omega})$ :$$ \mathrm{L}_{\pi,\Omega} v = \langle q(s,\cdot), \pi(\cdot | s) \rangle - \Omega(\pi(\cdot | s)) $$
-
Regularized Bellman Optimality Operator
$(\mathrm{L}_\Omega)$ :$$ \mathrm{L}_\Omega v = \max_\pi \mathrm{L}_{\pi,\Omega} v = \Omega^*(q(s, \cdot)) $$
It can be shown that the addition of a regularizer in these regularized operators still preserves the contraction properties, and therefore the existence of a solution to the optimality equations and the convergence of successive approximation.
The regularized value function of a stationary policy with decision rule
Under the usual assumptions on the discount factor and the boundedness of the reward, the value of a policy can also be found in closed form by solving for
where $\boldsymbol{\Omega}_\pi = \Omega(\pi(\cdot|s))$ is the vector of regularization terms at each state.
The associated state-action value function
The regularized optimal value function $v^*\Omega$ is then the unique fixed point of $\mathrm{L}\Omega$ in the fixed point equation:
The associated state-action value function
An important result in the theory of regularized MDPs is that there exists a unique optimal regularized policy. Specifically, if $\pi^_\Omega$ is a conserving decision rule (i.e., $\pi^\Omega = \arg\max\pi \mathrm{L}{\pi,\Omega} v^*\Omega$), then the randomized stationary policy
In practice, once we have found
Under this framework, we can recover the smooth Bellman equations by choosing
-
Using the negative entropy regularizer:
$$ \Omega(d(\cdot|s)) = \sum_{a \in \mathcal{A}_s} d(a|s) \ln d(a|s) $$
-
The convex conjugate:
$$ \Omega^*(q(s, \cdot)) = \ln \sum_{a \in \mathcal{A}_s} \exp q(s,a) $$
-
Now, let's write out the regularized Bellman optimality equation:
$$ v^_\Omega(s) = \Omega^(q^*_\Omega(s, \cdot)) $$
-
Substituting the expressions for $\Omega^$ and $q^_\Omega$:
$$ v^\Omega(s) = \ln \sum{a \in \mathcal{A}s} \exp \left(r(s, a) + \gamma \sum{j \in \mathcal{S}} p(j|s,a) v^_\Omega(j)\right) $$
This matches the form of the smooth Bellman equation we derived earlier, with the log-sum-exp operation replacing the max operation of the standard Bellman equation.
Furthermore, the optimal policy is given by the gradient of
This is the familiar softmax policy we encountered in the smooth MDP setting.
Now that we've seen how the regularized MDP framework leads to smooth Bellman equations, we present smooth policy iteration. Unlike value iteration which directly iterates the Bellman operator, policy iteration alternates between policy evaluation and policy improvement steps.
:label: smooth-policy-evaluation
**Input:** MDP $(S, A, r, p, \gamma)$, policy $\pi$, inverse temperature $\beta > 0$, tolerance $\epsilon > 0$
**Output:** Value function $v^\pi$ for policy $\pi$
1. Initialize $v(s) \leftarrow 0$ for all $s \in S$
2. Set $\alpha \leftarrow 1/\beta$
3. **repeat**
4. $\quad \Delta \leftarrow 0$
5. $\quad$ **for** each state $s \in S$ **do**
6. $\quad\quad v_{\text{old}} \leftarrow v(s)$
7. $\quad\quad$ **for** each action $a \in A_s$ **do**
8. $\quad\quad\quad q(s,a) \leftarrow r(s,a) + \gamma \sum_{j \in S} p(j|s,a) v(j)$
9. $\quad\quad$ **end for**
10. $\quad\quad$ Compute expected Q-value: $\bar{q} \leftarrow \sum_{a \in A_s} \pi(a|s) \cdot q(s,a)$
11. $\quad\quad$ Compute policy entropy: $H \leftarrow -\sum_{a \in A_s} \pi(a|s) \log \pi(a|s)$
12. $\quad\quad v(s) \leftarrow \bar{q} + \alpha H$
13. $\quad\quad \Delta \leftarrow \max(\Delta, |v(s) - v_{\text{old}}|)$
14. $\quad$ **end for**
15. **until** $\Delta < \epsilon$
16. **return** $v$
:label: policy-iteration-smooth
**Input:** MDP $(S, A, r, p, \gamma)$, inverse temperature $\beta > 0$, tolerance $\epsilon > 0$
**Output:** Approximate optimal value function $v$ and stochastic policy $\pi$
1. Initialize $\pi(a|s) \leftarrow 1/|A_s|$ for all $s \in S, a \in A_s$ (uniform policy)
2. **repeat**
3. $\quad$ **Policy Evaluation:**
4. $\quad\quad$ $v \leftarrow$ SmoothPolicyEvaluation($S, A, r, p, \gamma, \pi, \beta, \epsilon$)
5. $\quad$ **Policy Improvement:**
6. $\quad$ policy_stable $\leftarrow$ true
7. $\quad$ **for** each state $s \in S$ **do**
8. $\quad\quad \pi_{\text{old}}(\cdot|s) \leftarrow \pi(\cdot|s)$
9. $\quad\quad$ **for** each action $a \in A_s$ **do**
10. $\quad\quad\quad q(s,a) \leftarrow r(s,a) + \gamma \sum_{j \in S} p(j|s,a) v(j)$
11. $\quad\quad$ **end for**
12. $\quad\quad$ **for** each action $a \in A_s$ **do**
13. $\quad\quad\quad \pi(a|s) \leftarrow \frac{\exp(\beta \cdot q(s,a))}{\sum_{a' \in A_s} \exp(\beta \cdot q(s,a'))}$
14. $\quad\quad$ **end for**
15. $\quad\quad$ **if** $\|\pi(\cdot|s) - \pi_{\text{old}}(\cdot|s)\| > \epsilon$ **then**
16. $\quad\quad\quad$ policy_stable $\leftarrow$ false
17. $\quad\quad$ **end if**
18. $\quad$ **end for**
19. **until** policy_stable
20. **return** $v, \pi$
Key properties of smooth policy iteration:
-
Entropy-regularized evaluation: The policy evaluation step (line 12 of Algorithm {prf:ref}
smooth-policy-evaluation) accounts for the entropy bonus$\alpha H(\pi(\cdot|s))$ where$\alpha = 1/\beta$ -
Stochastic policy improvement: The policy improvement step (lines 12-14 of Algorithm {prf:ref}
policy-iteration-smooth) uses softmax instead of deterministic argmax, producing a stochastic policy -
Temperature parameter:
- Higher
$\beta$ → policies closer to deterministic (lower entropy) - Lower
$\beta$ → more stochastic policies (higher entropy) - As
$\beta \to \infty$ → recovers standard policy iteration
- Higher
- Convergence: Like standard policy iteration, this algorithm converges to the unique optimal regularized value function and policy
We have now seen two distinct ways to arrive at smooth Bellman equations. Earlier in this chapter, we introduced the logsumexp operator as a smooth approximation to the max operator, motivated by analytical tractability and the desire for differentiability. Just now, we derived the same equations through the lens of regularized MDPs, where we explicitly penalize the entropy of policies. These two perspectives are mathematically equivalent: solving the smooth Bellman equation with inverse temperature parameter
To see this equivalence clearly, consider the standard MDP problem with rewards
$$
\max_\pi \mathbb{E}\pi \left[ \sum{t=0}^\infty \gamma^t r(s_t, a_t) \right] + \alpha \mathbb{E}\pi \left[ \sum{t=0}^\infty \gamma^t H(\pi(\cdot|s_t)) \right],
$$
where
We can rewrite this objective by absorbing the entropy term into a modified reward function. Define the entropy-augmented reward:
However, this formulation makes the reward depend on the entire policy at each state, which is awkward. We can reformulate this more cleanly by expanding the entropy term. Recall that the entropy is:
When we take the expectation over actions drawn from
since the entropy doesn't depend on which action is actually sampled. But we can also write this as:
This shows that adding
The entropy bonus at each state, when averaged over the policy, becomes a per-action penalty proportional to the negative log probability of the action taken. This reformulation is more useful because the modified reward now depends only on the state, the action taken, and the probability assigned to that specific action by the policy, not on the entire distribution over actions.
This expression shows that entropy regularization is equivalent to adding a state-action dependent penalty term
Now, when we write down the Bellman equation for this entropy-regularized problem, at each state
Here
This is a convex optimization problem with a linear constraint. We form the Lagrangian:
$$
\mathcal{L}(d, \lambda) = \sum_a d(a|s) \left[ r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j) - \alpha \ln d(a|s) \right] - \lambda \left(\sum_a d(a|s) - 1\right),
$$
where
Solving for
Using the normalization constraint
Therefore:
Substituting this back into the Bellman equation and simplifying:
Setting
We recover the smooth Bellman equation we derived earlier using the logsumexp operator. The inverse temperature parameter
The optimal policy is:
$$ \pi^(a|s) = \frac{\exp\left(\beta q^(s,a)\right)}{\sum_{a'} \exp\left(\beta q^(s,a')\right)} = \text{softmax}_\beta(q^(s,\cdot))(a), $$ which is exactly the softmax policy parametrized by inverse temperature.
The derivation establishes the complete equivalence: the value function $v^$ that solves the smooth Bellman equation is identical to the optimal value function $v^_\Omega$ of the entropy-regularized MDP (with
This equivalence has important implications. When we use smooth Bellman equations with a logsumexp operator, we are implicitly solving an entropy-regularized MDP. Conversely, when we explicitly add entropy regularization to an MDP objective, we arrive at smooth Bellman equations as the natural description of optimality. This dual perspective will prove valuable in understanding various algorithms and theoretical results. For instance, in soft actor-critic methods and other maximum entropy reinforcement learning algorithms, the connection between smooth operators and entropy regularization provides both computational benefits (differentiability) and conceptual clarity (why we want stochastic policies).
While the smooth Bellman equations (using logsumexp) and entropy-regularized formulations are mathematically equivalent, it is instructive to present the algorithms explicitly in the entropy-regularized form, where the entropy bonus appears directly in the update equations.
:label: entropy-regularized-value-iteration
**Input:** MDP $(S, A, r, p, \gamma)$, entropy weight $\alpha > 0$, tolerance $\epsilon > 0$
**Output:** Approximate optimal value function $v$ and stochastic policy $\pi$
1. Initialize $\pi(a|s) \leftarrow 1/|A_s|$ for all $s \in S, a \in A_s$ (uniform policy)
2. Initialize $v(s) \leftarrow 0$ for all $s \in S$
3. **repeat**
4. $\quad \Delta \leftarrow 0$
5. $\quad$ **for** each state $s \in S$ **do**
6. $\quad\quad$ **Policy Improvement:** Update policy for current value estimate
7. $\quad\quad$ **for** each action $a \in A_s$ **do**
8. $\quad\quad\quad q(s,a) \leftarrow r(s,a) + \gamma \sum_{j \in S} p(j|s,a) v(j)$
9. $\quad\quad$ **end for**
10. $\quad\quad$ **for** each action $a \in A_s$ **do**
11. $\quad\quad\quad \pi_{\text{new}}(a|s) \leftarrow \frac{\exp(q(s,a)/\alpha)}{\sum_{a' \in A_s} \exp(q(s,a')/\alpha)}$
12. $\quad\quad$ **end for**
13. $\quad\quad$ **Value Update:** Compute regularized value
14. $\quad\quad v_{\text{new}}(s) \leftarrow \sum_{a \in A_s} \pi_{\text{new}}(a|s) \cdot q(s,a) + \alpha H(\pi_{\text{new}}(\cdot|s))$
15. $\quad\quad$ where $H(\pi_{\text{new}}(\cdot|s)) = -\sum_{a \in A_s} \pi_{\text{new}}(a|s) \log \pi_{\text{new}}(a|s)$
16. $\quad\quad \Delta \leftarrow \max(\Delta, |v_{\text{new}}(s) - v(s)|)$
17. $\quad\quad v(s) \leftarrow v_{\text{new}}(s)$
18. $\quad\quad \pi(\cdot|s) \leftarrow \pi_{\text{new}}(\cdot|s)$
19. $\quad$ **end for**
20. **until** $\Delta < \epsilon$
21. **return** $v, \pi$
Features:
- Line 11 updates the policy using the softmax of Q-values, with temperature
$\alpha$ - Line 14 explicitly computes the entropy-regularized value: expected Q-value plus entropy bonus
- The algorithm maintains and updates a stochastic policy throughout
- As
$\alpha \to 0$ (or equivalently$\beta \to \infty$ ), this recovers standard value iteration
:label: policy-iteration-entropy-regularized
**Input:** MDP $(S, A, r, p, \gamma)$, entropy weight $\alpha > 0$, tolerance $\epsilon > 0$
**Output:** Approximate optimal value function $v$ and stochastic policy $\pi$
1. Initialize $\pi(a|s) \leftarrow 1/|A_s|$ for all $s \in S, a \in A_s$ (uniform policy)
2. **repeat**
3. $\quad$ **Policy Evaluation:** Solve for $v^\pi$ such that for all $s \in S$:
4. $\quad\quad$ **Option 1 (Iterative):**
5. $\quad\quad$ Initialize $v(s) \leftarrow 0$ for all $s \in S$
6. $\quad\quad$ **repeat**
7. $\quad\quad\quad$ **for** each state $s \in S$ **do**
8. $\quad\quad\quad\quad$ Compute $q^\pi(s,a) \leftarrow r(s,a) + \gamma \sum_{j \in S} p(j|s,a) v(j)$ for all $a \in A_s$
9. $\quad\quad\quad\quad v_{\text{new}}(s) \leftarrow \sum_{a \in A_s} \pi(a|s) \cdot q^\pi(s,a) + \alpha H(\pi(\cdot|s))$
10. $\quad\quad\quad$ **end for**
11. $\quad\quad\quad$ **if** $\max_s |v_{\text{new}}(s) - v(s)| < \epsilon$ **then break**
12. $\quad\quad\quad v \leftarrow v_{\text{new}}$
13. $\quad\quad$ **until** convergence
14. $\quad\quad$ **Option 2 (Direct):** Solve linear system $(\mathbf{I} - \gamma \mathbf{P}_\pi) \mathbf{v} = \mathbf{r}_\pi + \alpha \mathbf{H}_\pi$
15. $\quad\quad$ where $[\mathbf{r}_\pi](s) = \sum_a \pi(a|s) r(s,a)$ and $[\mathbf{H}_\pi](s) = H(\pi(\cdot|s))$
16. $\quad$ **Policy Improvement:**
17. $\quad$ policy_changed $\leftarrow$ false
18. $\quad$ **for** each state $s \in S$ **do**
19. $\quad\quad \pi_{\text{old}}(\cdot|s) \leftarrow \pi(\cdot|s)$
20. $\quad\quad$ **for** each action $a \in A_s$ **do**
21. $\quad\quad\quad q(s,a) \leftarrow r(s,a) + \gamma \sum_{j \in S} p(j|s,a) v(j)$
22. $\quad\quad$ **end for**
23. $\quad\quad$ **for** each action $a \in A_s$ **do**
24. $\quad\quad\quad \pi(a|s) \leftarrow \frac{\exp(q(s,a)/\alpha)}{\sum_{a' \in A_s} \exp(q(s,a')/\alpha)}$
25. $\quad\quad$ **end for**
26. $\quad\quad$ **if** $\|\pi(\cdot|s) - \pi_{\text{old}}(\cdot|s)\| > \epsilon$ **then**
27. $\quad\quad\quad$ policy_changed $\leftarrow$ true
28. $\quad\quad$ **end if**
29. $\quad$ **end for**
30. **until** policy_changed $=$ false
31. **return** $v, \pi$
Features:
-
Policy Evaluation (lines 3-15): Computes the value of the current policy including entropy bonus
- Option 1: Iterative method (successive approximation)
- Option 2: Direct solution via linear system
- Policy Improvement (lines 16-29): Updates policy to softmax over Q-values
- Line 14 shows the vector form: the linear system includes the entropy vector
$\mathbf{H}_\pi$ - The algorithm alternates between evaluating the current stochastic policy and improving it
- Converges to the unique optimal entropy-regularized policy