Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation
This is the official implementation of the paper Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation, accepted by UAI 2025.
This repository introduces PURE (Policy Update and Rolling-out Efficient CTRL), a novel continuous-time reinforcement learning (CTRL) framework that achieves competitive performance with significantly fewer policy updates and rollouts. Our approach is supported by both theoretical guarantees and empirical evidence. The codebase includes implementations for diffusion model fine-tuning and continuous control tasks.
Continuous-time reinforcement learning (CTRL) provides a principled framework for sequential decision-making in environments where interactions evolve continuously over time. Despite its empirical success, the theoretical understanding of CTRL remains limited, especially in settings with general function approximation. In this work, we propose a model-based CTRL algorithm that achieves both sample and computational efficiency. Our approach leverages optimism-based confidence sets to establish the first sample complexity guarantee for CTRL with general function approximation, showing that a near-optimal policy can be learned with a suboptimality gap of
Create a conda environment with the following command:
cd PURE_SEIKO
conda create -n SEIKO python=3.10
conda activate SEIKO
pip install -r requirements.txtPlease use accelerate==0.17.0; other library dependencies might be flexible.
CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 online/online_main_pure.py --config config/UCB.py:aesthetic --seed=31 --num_outer_loop=4You can modify the --seed and --num_outer_loop values as needed.
CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 online/online_main.py --config config/UCB.py:aestheticWe compare the running time and aesthetic reward score of PURESEIKO against SEIKO. As shown below, PURESEIKO achieves comparable performance with substantially reduced fine-tuning time. All experiments were conducted on a single A6000 GPU with 5 random seeds.
We also visualize generated images and their aesthetic scores from PURESEIKO and SEIKO.

Create a conda environment with the following command:
cd PURE_ENODE
conda create -n ODERL python=3.7.7
conda activate ODERL
pip install -r requirements.txtPlease use torch==1.6.0 and later versions; other library dependencies might be flexible.
CUDA_VISIBLE_DEVICES=0 python runner_pure.pyCUDA_VISIBLE_DEVICES=0 python runner.pyWe compare the performance of PUREENODE and ENODE. As shown below, PUREENODE achieves similar reward scores with significantly reduced fine-tuning time. Experiments were run on a single A6000 GPU with 20 seeds.

This codebase builds on top of SEIKO and ODERL. We thank the original authors for making their code available.
If you find our work useful, please consider citing:
@inproceedings{zhaosample,
title={Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation},
author={Zhao, Runze and Yu, Yue and Zhu, Adams Yiyue and Yang, Chen and Zhou, Dongruo},
booktitle={The 41st Conference on Uncertainty in Artificial Intelligence}
}