You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For a complete implementation of CartPole with OptimRL, check out our examples in the `simple_test` directory:
93
134
94
-
## 🔍 Advanced Usage
135
+
-`cartpole_simple.py`: Basic implementation with GRPO
136
+
-`cartpole_improved.py`: Improved implementation with tuned parameters
137
+
-`cartpole_final.py`: Final implementation with optimized performance
138
+
-`cartpole_tuned.py`: Enhanced implementation with advanced features
139
+
-`cartpole_simple_pg.py`: Vanilla Policy Gradient implementation for comparison
95
140
96
-
Integrate OptimRL seamlessly into your **PyTorch pipelines** or custom training loops. Below is a **complete example** showcasing GRPO in action:
141
+
The vanilla policy gradient implementation (`cartpole_simple_pg.py`) achieves excellent performance on CartPole-v1, reaching the maximum reward of 500 consistently. It serves as a useful baseline for comparing against the GRPO implementations.
142
+
143
+
### Continuous Action Space Example (Pendulum)
97
144
98
145
```python
99
146
import torch
100
-
import optimrl
101
-
102
-
classPolicyNetwork(torch.nn.Module):
103
-
def__init__(self, input_dim, output_dim):
147
+
import torch.nn as nn
148
+
import torch.optim as optim
149
+
import gym
150
+
from optimrl import create_agent
151
+
152
+
# Define a continuous policy network
153
+
classContinuousPolicyNetwork(nn.Module):
154
+
def__init__(self, input_dim, action_dim):
104
155
super().__init__()
105
-
self.network=torch.nn.Sequential(
106
-
torch.nn.Linear(input_dim, 64),
107
-
torch.nn.Tanh(),
108
-
torch.nn.Linear(64, output_dim),
109
-
torch.nn.LogSoftmax(dim=-1)
156
+
self.shared_layers= nn.Sequential(
157
+
nn.Linear(input_dim, 64),
158
+
nn.ReLU(),
159
+
nn.Linear(64, 64),
160
+
nn.ReLU()
110
161
)
111
-
162
+
# Output both mean and log_std for each action dimension
Our simple policy gradient implementation consistently solves the CartPole-v1 environment in under 1000 episodes, achieving the maximum reward of 500. The GRPO implementations offer competitive performance with additional benefits:
208
+
209
+
-**Lower variance**: More stable learning across different random seeds
210
+
-**Improved sample efficiency**: Learns from fewer interactions with the environment
211
+
-**Better regularization**: Prevents policy collapse during training
212
+
213
+
## Kaggle Notebook
214
+
215
+
You can view the "OptimRL Trading Experiment" notebook on Kaggle:
0 commit comments