SAPPS-RL

Adaptive Policy Regularization for Smooth Control in Reinforcement Learning

This repository contains the official implementation of State-Adaptive Proportional Policy Smoothing (SAPPS), a policy regularization method designed to produce smooth yet responsive control policies in continuous-control reinforcement learning.

SAPPS suppresses high-frequency oscillations in learned policies without compromising performance, particularly in dynamic environments where rapid adaptation is required.

📄 Paper: Adaptive Policy Regularization for Smooth Control in Reinforcement Learning
📌 Journal submission: IEEE Transactions on Automation Science and Engineering (under review)
🔗 TechRxiv Preprint: https://doi.org/10.36227/techrxiv.177004949.91897305/v1
👤 Authors: Payam Parvizi, Abhishek Naik, Colin Bellinger, Ross Cheriton, Davide Spinello

Abstract

A significant challenge in applying reinforcement learning (RL) to continuous-control problems is the presence of high-frequency oscillations in the actions produced by learned policies. These oscillations result in abrupt control responses, leading to excessive actuator wear, increased power consumption, and instability in real-world deployments. Existing approaches to reduce such oscillations often involve trade-offs, including increased architectural complexity or degraded policy performance, particularly in environments where rapid state changes require rapid adaptation.

To address this issue, we propose State-Adaptive Proportional Policy Smoothing (SAPPS), a novel approach that adaptively adjusts smoothness constraints to suppress high-frequency components in RL policies. SAPPS is inspired by Lipschitz continuity. It introduces a state-adaptive proportional regularization during policy optimization, encouraging changes in consecutive actions to scale with changes in consecutive observations. This adaptive constraint enables smooth yet responsive control.

Results from simulation and hardware experiments demonstrate that SAPPS produces smooth control policies without compromising performance across a diverse set of environments, including MuJoCo continuous-control benchmarks, a simulated adaptive optics system for optical satellite communications, and a real-world nano quadcopter, under both slowly and rapidly changing conditions.

Method Overview

SAPPS is a general policy regularization technique that can be integrated into deep RL algorithms to improve policy smoothness in both static and dynamic continuous-control settings. Rather than directly penalizing action magnitude, SAPPS regularizes the change between consecutive actions based on the relative change between consecutive observations.

This adaptive formulation:

penalizes unnecessary action fluctuations when state changes are small
preserves responsiveness when large observation changes require rapid control adaptation

SAPPS is implemented within Proximal Policy Optimization (PPO) and compared against:

Vanilla PPO
PPO with Conditioning for Action Policy Smoothness (CAPS)
PPO with LipsNet-based architectures

Repository Structure

sapps-rl/
│
├── Adaptive_Optics_Environment/
├── MuJoCo_Environments/
├── Quadcopter_Environment/
└── README.md

Each environment directory is self-contained and includes training and evaluation scripts corresponding to the experiments reported in the paper.

Detailed instructions are provided in the environment-specific READMEs:

Adaptive Optics Environment → Adaptive_Optics_Environment/README.md
MuJoCo Continuous-Control Environments → MuJoCo_Environments/README.md
Quadcopter Environment (Real Hardware) → Quadcopter_Environment/README.md

Experimental Domains

1. MuJoCo Continuous-Control Benchmarks

SAPPS is evaluated on standard OpenAI Gymnasium MuJoCo tasks, including:

Walker2D
HalfCheetah
Ant
Reacher
Swimmer

The diagnostic experiment in the paper is conducted on a continuing version of the Reacher task, whose implementation is straightforward and closely follows that in this repository.

Across these benchmarks, SAPPS improves policy smoothness while maintaining or improving task return.

2. Real-World Quadcopter Control

A nano quadcopter hovering task is used to validate real-world applicability. SAPPS demonstrates reduced control oscillations, improved actuator efficiency, and stable performance under sensor noise and disturbances.

3. Wavefront Sensorless Adaptive Optics

An optical control problem for satellite-to-ground optical communication. SAPPS is evaluated under both slowly varying and rapidly varying atmospheric turbulence, achieving consistently strong performance and improved robustness relative to benchmark policy smoothing methods.

Installation

All experiments are implemented in Python and use standard deep reinforcement learning libraries.

Requirements

Python ≥ 3.9
PyTorch
NumPy
SciPy
Farama Gymnasium (with MuJoCo support)
Weights & Biases (for logging)
tianshou (training framework and rollout collection)

Each environment subdirectory includes its own requirements.txt listing any additional dependencies (e.g., specialized simulation libraries or hardware interface packages).

Quick Start (Minimal Example)

Below is a minimal example to train a SAPPS-regularized PPO policy on a MuJoCo task.

# 1. Create and activate a virtual environment
python -m venv sapps-env
source sapps-env/bin/activate  # Windows: sapps-env\Scripts\activate

# 2. Install dependencies for MuJoCo experiments
cd MuJoCo_Environments
pip install -r requirements.txt

# 3. Run a minimal training example (default environment: Ant-v4)
python run_mujoco.py \
  --regularization_case PPO_SAPPS \
  --seed 0

By default, this command trains on the Ant-v4 environment using the hyperparameters reported in the paper.

Running Experiments

Each environment directory contains its own training and evaluation scripts. Please refer to the specific environment's README and scripts for details on usage, hyperparameters, and experimental settings.

Reproducibility

All results reported in the paper are averaged over multiple random seeds, and hyperparameters match those described in the paper. While evaluation protocols are consistent across methods, differences in simulator versions, hardware, or inherent randomness may cause your results to vary slightly. However, the qualitative performance trends should remain consistent.

Citation

If you use this code in your research, please cite the associated paper.
Citation files are provided in CITATION.cff and CITATION.bib.

@article{parvizi2026sapps,
  title   = {Adaptive Policy Regularization for Smooth Control in Reinforcement Learning},
  author  = {Parvizi, Payam and Naik, Abhishek and Bellinger, Colin and Cheriton, Ross and Spinello, Davide},
  journal = {TechRxiv},
  year    = {2026},
  month   = {February},
  note    = {Preprint},
  doi     = {10.36227/techrxiv.177004949.91897305/v1},
  url     = {https://doi.org/10.36227/techrxiv.177004949.91897305/v1}
}

License

This project is released under the MIT License. See the LICENSE file for details.

Contact

For questions, bug reports, or feature requests, please open a GitHub Issue on this repository so others can benefit from the discussion.

For private or collaboration-related inquiries, please contact:
Payam Parvizi — Email: pparv056@uottawa.ca
Abhishek Naik — Email: Abhishek.Naik@nrc-cnrc.gc.ca

Acknowledgments

This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) and by the National Research Council Canada (NRC).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAPPS-RL

Adaptive Policy Regularization for Smooth Control in Reinforcement Learning

Abstract

Method Overview

Repository Structure

Experimental Domains

1. MuJoCo Continuous-Control Benchmarks

2. Real-World Quadcopter Control

3. Wavefront Sensorless Adaptive Optics

Installation

Requirements

Quick Start (Minimal Example)

Running Experiments

Reproducibility

Citation

License

Contact

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 248 Commits
Adaptive_Optics_Environment		Adaptive_Optics_Environment
MuJoCo_Environments		MuJoCo_Environments
Quadcopter_Environment		Quadcopter_Environment
.gitignore		.gitignore
CITATION.bib		CITATION.bib
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

License

payamparvizi/sapps-rl

Folders and files

Latest commit

History

Repository files navigation

SAPPS-RL

Adaptive Policy Regularization for Smooth Control in Reinforcement Learning

Abstract

Method Overview

Repository Structure

Experimental Domains

1. MuJoCo Continuous-Control Benchmarks

2. Real-World Quadcopter Control

3. Wavefront Sensorless Adaptive Optics

Installation

Requirements

Quick Start (Minimal Example)

Running Experiments

Reproducibility

Citation

License

Contact

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages