This repository implements a new class of a thermodynamics-inspired algorithms for reinforcement learning, together with experiments demonstrating its usability. Just as particles in nature generally prefer to occupy states/locations with lower total energy (there are more air particles down here compared to 100km away from earth), the algorithm proposed and implemented here enforces that action sequences that get high rewards have high total probabilities. Total probability is here defined as the product of probabilities of all actions taken:
As a quick demo, here's a neural network trained using the Policy Annealing algorithm to take a moonlander to a safe stop. For training, see Usage/Lunar Lander.
This project requires Python 3.10-3.12 to be installed. The current codebase does not utilize a GPU; all computations are run on the CPU.
Installing dependencies requires swig, a system-level dependency.
On Linux:
# Ubuntu/Debian/Mint
sudo apt install swig
# Arch Linux
sudo pacman -S swigOnce swig is installed, you can install the Python dependencies:
# Install Pytorch (CPU version)
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cpu
# Install other Python dependencies
pip install -r requirements.txtExperiments can be run with the following command:
python scripts/main.pyPlease see --help for setting the training environment and type/parameters, as well as whether to render the training progress. All metadata and renders will be stored in a runs/[timestamp]/ directory.
Performance curves of different runs can be plotted and compared with:
python scripts/plot.py runs/[timestamp1]/experiment.json runs/[timestamp2]/experiment.jsonBelow is a list of environments together with hyperparameters and visualizations of the trained agents. Validation mode, used in some examples, means the agent will run deterministically, always choosing the action with the highest probability instead of sampling from the probability distribution.
The Cart Pole environment requires the agent to learn to balance a vertical pole by accelerating the cart left or right. A reward is given for each time step as long as the pole is vertical within 15 degrees and does not veer out of the image. There's a maximum time limit of 500 simulation steps.
Here's a very aggressive setup for training the CartPole agent. The training can be extremely fast if we're lucky, but is quite unstable and often does not converge.
python scripts/main.py --value-function direct --batch-size 4 --num-episode-batches 10 --learning-rate 0.005 --num-optim-steps 300 --env-name CartPole-v1 --render fullA relatively lucky from-scratch training:
Here's a much more stable setup that almost always converges:
python scripts/main.py --value-function grouped --batch-size 32 --group-size 8 --env-name CartPole-v1 --num-episode-batches 10 --temp-start 0.5 --temp-end 0.5The final solution seems more convincing:
With validation mode on (always choosing the highest probability action), the solution gets so good that it's even boring:
In this environment, the moonlander agent must learn to fire the three (one main, and two side) thrusters to successfully land on a platform. Reward is increased for being near the platform and landing, while it's reduced for thruster usage (fuel consumption), fast movement, and non-vertical orientation.
python scripts/main.py --env-name LunarLander-v3 --num-episode-batches 300 --clip-eps 2 --temp-start 0.1 --temp-end 0.1 --learning-rate 0.0005To visualize a trained agent in validation mode, load its policy file:
# Replace [timestamp], [experiment_number] and [episode_batch_number] with directory names
python scripts/main.py --env-name LunarLander-v3 --validate --render full --load-model runs/[timestamp]/[experiment_number]/models/[episode_batch_number]/policy.pthAt a fixed temperature
where
-
$p$ is the probability of the particle being at a certain location -
$E$ is the energy of the particle at that location -
$k$ is the Boltzmann constant -
$T$ is the absolute temperature of the system
Essentially, particles are more likely to be in lower energy states, with the exact probability distribution depending on the temperature.
In a deep reinforcement learning setting, we want the neural network to have a high probability of emitting actions that lead to high rewards. If we substitute negative energy with reward,
where
Now, let's consider a full trajectory. The total probability of a trajectory is the product of the probabilities of all actions taken, and the total reward is the sum of rewards from all steps:
This leads to the relationship:
To avoid dealing with products of small probabilities and make the equation easier to work with, we apply the logarithm to both sides:
Using summation notation, this is:
Multiplying by
We can combine the terms under a single summation. Let's define the "trajectory value"
This equation is the basis for all variants of the policy annealing algorithm. If we can train a policy network to produce probabilities
The easiest way to enforce the above condition is to group environment runs by identical seed (starting conditions), so they should all have an identical trajectory value
Specifically, for the loss function, we note that the condition is satisfied when the computed values of all trajectories in a group are equal to their mean. We thus use a squared error loss:
where
To improve learning efficiency, we can enforce the boxed condition
with
We can then define two neural networks: one for computing the value
Experiments show that this technique can be quite sample-efficient, however it may need a relatively large number of training steps on each batch of samples. This is because the value and policy networks learn together from one equation, but the policy network can learn wrong behavior while the value network has not yet converged.
One problem with the Policy Annealing methods described above is that they define their own loss functions, which aren't obviously similar to the Policy Gradients method. This means that Policy Annealing, in its previous forms, is hard to apply to state-of-the-art algorithms.
To remedy this, we can define an "entropy-regularized reward" (ERR) as:
We can then see that our core equation is simply the sum of these new rewards:
Thus, we can treat these entropy-regularized rewards just as we would treat standard rewards in any RL method (for both value estimation and policy updates) and still, at least theoretically, arrive at the same Boltzmann distribution. At the same time, we can use the bells and whistles that have been invented for algorithms like PPO.
Note: this is a work in progress and can't be run from scripts/main.py, however there are some quick experiments showing much better stability properties compared to other regularization methods such as Entropy Bonus.
Entropy bonus regularization is a common technique in modern deep reinforcement learning. The goal is to encourage exploration by adding the policy's entropy, scaled by a temperature parameter
Both theoretical arguments and toy simulations show that Policy Annealing based algorithms have much better stability characteristics than Entropy Bonus.
For example, let's take an RL environment that has no input but three possible actions:
The simulation below shows how the action probabilities change for three methods: Policy Gradients without regularization (blue), Policy Gradients with Entropy Bonus (red), and Policy Gradients with Policy Annealing Regularization (green, proposed algorithm).
Note that Policy Annealing regularization allows the policy to become stationary at an optimal distribution, while the Entropy Bonus policy never truly settles. For Policy Annealing, at the optimal distribution, the rewards are fully compensated by the
The current project is divided into the following parts:
- demos/ # Code for some README visualizations
- lib/
- anneal_*.py # Annealing logic implementations
- loss_*.py # Loss function implementations
- batch_environment.py # Environments with grouped seeding logic
- model.py # PyTorch model generator
- sample.py # Sampling logic
- tracking.py # Visualization and experiment tracking logic. You can safely ignore this file if you just want to understand the algorithm.
- scripts/
- main.py # Run experiments
- plot.py # Compare and visualize performance plots of experiments
- tests/
- test_.py # Pytest tests
In addition, the following directory is generated while running the experiments:
- runs/
- [timestamp]/
- experiment.json # Experiment metadata + rewards. Feed this into scripts/plot.py
- [experiment number]/ # If num_runs > 0
- models/
- [episode batch number]/
- policy.pth # Policy network
- value.pth # Value network (if applicable)
training.mp4 # Video of all training episodes
first_observation_value.html # Visualization of value network changes
The following coding conventions are used:
- Tensor channels are ordered as:
(num_groups, group_size, steps, num_actions/num_observations). Note thatbatch_size = num_groups * group_size. - Variables whose type can't be inferred by mypy are explicitly typed (e.g. function signatures, empty list initializations)
- Ensure that linting and tests succeed:
pre-commit run -a && pytest




