Skip to content

Bounded Action Space #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

AntoineRichard
Copy link

Hi there!

This PR adds support for bounded action spaces directly into the agent.
The main difference with clipping, is that this ensures actions are sampled within a fixed range and rewards on actions will not be computed on clipped actions.

To accomodate this, two options are provided:

  1. The "SAC style", where a gaussian based policy is bounded to the [-1, 1] range with a tanh on the mean and a tanh on the sampled actions. This is accounted for in the calculation of the action log dist. (Appendix C here: https://arxiv.org/pdf/1801.01290) Or one could look at: https://github.com/DLR-RM/stable-baselines3/blob/ea913a848242b2fca3cbcac255097e1d144207df/stable_baselines3/common/distributions.py#L207 ?
  2. A "beta policy", where rather than sampling on a probability distribution that's unbounded (the normal distribution for instance), we sample actions using a bounded probability distribution (the beta distribution). Original paper here: https://proceedings.mlr.press/v70/chou17a/chou17a.pdf . This is then rescaled to whatever is needed.

To allow for the smooth calculation of the KL distance between two beta distribution, I had to slightly rework the transition to store the distribution parameters rather than just the std and the mean. Hence in the case of the normal distribution, I save mean + std_dev, while for the beta distribution alpha and beta.

Then instead of manually computing the KL distance, I let torch do the heavy lifting.

Configuration wise it could look like this:

Beta

    policy = RslRlPpoActorCriticBetaCfg(
        init_noise_std=1.0,
        actor_hidden_dims=[32, 32],
        critic_hidden_dims=[32, 32],
        activation="elu",
        clip_actions=True, # Note this is useless since it clips all the time regardless.
        clip_actions_range=[-1.0, 1.0],
    )

Normal

    policy = RslRlPpoActorCriticCfg(
        init_noise_std=1.0,
        actor_hidden_dims=[32, 32],
        critic_hidden_dims=[32, 32],
        activation="elu",
        clip_actions=True, # Default to False
        clip_actions_range=[-1.0, 1.0],
    )

I know, this changes significantly the way PPO updates are done, and it's a BREAKING CHANGE, so no I totally understand if the beta policy doesn't make it to main repo! Though having a reliable action clipping mechanism would be nice :).

LMK if you want me to change anything, I'd be happy to!

Best,

Antoine

@AntoineRichard AntoineRichard changed the title Added action clipping SAC style, and created a BetaPolicy which has a… Bounded Action Space Apr 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant