Reinforcement Learning (RL) faces challenges in large-scale problems like Optimal Power Flow (OPF) due to high-dimensional continuous action spaces that make scaling difficult. To overcome this, we propose a model-based RL approach where we analyze the MuZero algorithm to enable long-term planning.
To train the MuZero agent, you can run the agents/MuZero/muzero_agent.py file.
During self-play, the agent's goal is to simulate an episode using MCTS. At each episodic step, it sets the root of the MCTS to be the last observation in the episode using the Initial Inference method.
This method first converts the raw observation into a hidden state using the representation network. It then uses this hidden representation to predict the value and policy of the state using the value network and policy network respectively.
Using the predicted policy, it generates the children of the node, where the edge represents the action index and the child node's prior stores the probability of the action.
Before running MCTS, we add exploration noise sampled from a Dirichlet distribution to these priors:
where
-
$\eta =$ root exploration fraction, set to 0.25 $\epsilon = (\epsilon_1, \dots, \epsilon_k) \sim \text{Dirichlet}(\alpha, \dots, \alpha)$ -
$\alpha =$ Dirichlet concentration, set to 0.3 -
$k =$ number of available actions (487)
MCTS is then performed on the root node (last observation)
In each simulation, the agent traverses the tree by picking the action with the highest UCB score until reaching a leaf node:
The value component is normalized using min-max statistics:
where:
-
$N_p$ = parent node’s visit count -
$N_c$ = child node’s visit count -
$\pi_c$ = prior probability of child$c$ -
$c_{\text{base}} = 19652$ ,$c_{\text{init}} = 1.25$ -
$r_c$ = reward of transitioning to child $$V_c = \frac{valuesum_c}{N_c}$$ -
$\gamma = 0.9$ = discount factor
Once reaching a leaf node, MuZero applies recurrent inference:
- The dynamics network takes in the parent hidden state + one-hot encoded action, outputs the next hidden state.
- The reward network predicts the reward for the transition.
- The value and policy networks predict the value and policy of the new state.
The new node is then backpropagated up the tree.
After
where
Training starts by sampling episodes from the replay buffer.
For each sampled observation, we unroll it
Target value
-
$r_{t+i}$ = reward at step$t+i$ -
$\gamma = 0.9$ = discount factor -
$tdsteps = 7$ = temporal difference horizon -
$V_{t + tdsteps}$ = bootstrap value
Target policy at step
The loss function is a weighted sum of value, reward, and policy losses with L2 regularization:
