Skip to content

🤖 TD3 implementation for Pendulum-v1 using PyTorch. Features twin critics, delayed policy updates, target policy smoothing, and clipped double Q-learning to address function approximation errors.

Notifications You must be signed in to change notification settings

elliotvilhelm/td3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TD3 Implementation for Pendulum-v1

A complete implementation of the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm for the Pendulum-v1 environment using PyTorch.

Pendulum Animation

Installation

pip install -r requirements.txt

Usage

Training

python main.py train

Trains the TD3 agent for 200 episodes and saves the model.

Testing

python main.py test

Loads the trained model and runs 10 test episodes with rendering.

Recording GIF

python main.py gif

Records a 5-second GIF of the pendulum in action using the trained model.

CLI Options

Training Options

# Basic training
python main.py train

# Custom training parameters
python main.py train --max-episodes 500 --batch-size 64 --learning-rate 1e-4

# Full parameter customization
python main.py train \
    --max-episodes 300 \
    --batch-size 128 \
    --discount 0.99 \
    --tau 0.005 \
    --policy-noise 0.2 \
    --noise-clip 0.5 \
    --policy-freq 2 \
    --exploration-noise 0.1 \
    --learning-rate 3e-4 \
    --model-name my_td3_model

Testing Options

# Basic testing
python main.py test

# Custom test parameters
python main.py test --episodes 20 --model my_model

# Test without rendering (faster)
python main.py test --no-render --episodes 50

# Custom episode length
python main.py test --max-steps 1000

GIF Recording Options

# Basic GIF recording (5 seconds)
python main.py gif

# Custom duration and output
python main.py gif --duration 10 --output my_pendulum.gif

# Use different model
python main.py gif --model my_td3_model --output custom_pendulum.gif

Monitoring

tensorboard --logdir runs/

Details

Info

  • Twin Critics: Reduces overestimation bias in Q-learning
  • Delayed Policy Updates: Updates actor less frequently than critics
  • Target Policy Smoothing: Adds noise to target actions for regularization
  • Clipped Double Q-learning: Uses minimum of two Q-values for targets

Actor Network

μ(s) = tanh(W₃ · ReLU(W₂ · ReLU(W₁ · s + b₁) + b₂) + b₃) × max_action

Critic Networks

Q₁(s, a) = W₃ · ReLU(W₂ · ReLU(W₁ · [s; a] + b₁) + b₂) + b₃
Q₂(s, a) = W₃' · ReLU(W₂' · ReLU(W₁' · [s; a] + b₁') + b₂') + b₃'

Target Policy Smoothing

ã = μ'(s') + ε, where ε ~ N(0, σ²)
ã = clip(ã, -max_action, max_action)

Clipped Double Q-Learning

Q_target = r + γ · min(Q₁'(s', ã), Q₂'(s', ã))

Delayed Policy Updates

L_actor = -E[Q₁(s, μ(s))]

Soft Target Updates

θ' ← τθ + (1 - τ)θ'

References

About

🤖 TD3 implementation for Pendulum-v1 using PyTorch. Features twin critics, delayed policy updates, target policy smoothing, and clipped double Q-learning to address function approximation errors.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages