SubaashNair
diff --git a/‎OptimRL.code-workspace‎
Lines changed: 0 additions & 8 deletions b/‎OptimRL.code-workspace‎
Lines changed: 0 additions & 8 deletions
diff --git a/‎README.md‎
Lines changed: 142 additions & 67 deletions b/‎README.md‎
Lines changed: 142 additions & 67 deletions
diff --git a/‎cartpole_rewards_final.png‎
103 KB b/‎cartpole_rewards_final.png‎
103 KB
diff --git a/‎cartpole_rewards_improved.png‎
59.3 KB b/‎cartpole_rewards_improved.png‎
59.3 KB
diff --git a/‎cartpole_rewards_simple.png‎
53.3 KB b/‎cartpole_rewards_simple.png‎
53.3 KB
diff --git a/‎cartpole_training_progress.png‎
98.7 KB b/‎cartpole_training_progress.png‎
98.7 KB
diff --git a/‎cartpole_training_progress_pg.png‎
91.3 KB b/‎cartpole_training_progress_pg.png‎
91.3 KB
diff --git a/‎examples/cartpole_example.py‎
Lines changed: 107 additions & 0 deletions b/‎examples/cartpole_example.py‎
Lines changed: 107 additions & 0 deletions
@@ -11,9 +11,9 @@ OptimRL is a **high-performance reinforcement learning library** that introduces
 ![PyTorch](https://img.shields.io/badge/Framework-PyTorch-EE4C2C?logo=pytorch&logoColor=white)
 ![Setuptools](https://img.shields.io/badge/Tool-Setuptools-3776AB?logo=python&logoColor=white)
 ![Build Status](https://github.com/subaashnair/optimrl/actions/workflows/tests.yml/badge.svg)
-![CI](https://github.com/subaashnair/optimrl/workflows/CI/badge.svg)
-![Coverage](https://img.shields.io/codecov/c/github/subaashnair/optimrl)
 ![License](https://img.shields.io/github/license/subaashnair/optimrl)
+<!-- ![Coverage](https://img.shields.io/codecov/c/github/subaashnair/optimrl) -->
+
 
 ## 🌟 Features
 
@@ -39,6 +39,18 @@ OptimRL is a **high-performance reinforcement learning library** that introduces
    - Native integration with deep learning workflows
    - Full automatic differentiation support
 
+5. **🔄 Experience Replay Buffer**  
+   Improve sample efficiency with built-in experience replay:
+   - Learn from past experiences multiple times
+   - Reduce correlation between consecutive samples
+   - Configurable buffer capacity and batch sizes
+
+6. **🔄 Continuous Action Space Support**  
+   Train agents in environments with continuous control:
+   - Gaussian policy implementation for continuous actions
+   - Configurable action bounds
+   - Adaptive standard deviation for exploration
+
 ---
 
 ## 🛠️ Installation
@@ -61,95 +73,156 @@ pip install -e '.[dev]'
 
 ## ⚡ Quick Start
 
-Here’s a **minimal working example** to get started with OptimRL:
+### Discrete Action Space Example (CartPole)
 
 ```python
 import torch
-import optimrl
-
-# Initialize the GRPO optimizer
-grpo = optimrl.GRPO(epsilon=0.2, beta=0.1)
+import torch.nn as nn
+import torch.optim as optim
+import gym
+from optimrl import create_agent
 
-# Prepare batch data (example)
-batch_data = {
-    'log_probs_old': current_policy_log_probs,
-    'log_probs_ref': reference_policy_log_probs,
-    'rewards': episode_rewards,
-    'group_size': len(episode_rewards)
-}
+# Define a simple policy network
+class PolicyNetwork(nn.Module):
+    def __init__(self, input_dim, output_dim):
+        super().__init__()
+        self.network = nn.Sequential(
+            nn.Linear(input_dim, 64),
+            nn.ReLU(),
+            nn.Linear(64, output_dim),
+            nn.LogSoftmax(dim=-1)
+        )
+        
+    def forward(self, x):
+        return self.network(x)
 
-# Compute policy loss
-log_probs_new = new_policy_log_probs
-loss, gradients = grpo.compute_loss(batch_data, log_probs_new)
+# Create environment and network
+env = gym.make('CartPole-v1')
+state_dim = env.observation_space.shape[0]
+action_dim = env.action_space.n
+policy = PolicyNetwork(state_dim, action_dim)
+
+# Create GRPO agent
+agent = create_agent(
+    "grpo",
+    policy_network=policy,
+    optimizer_class=optim.Adam,
+    learning_rate=0.001,
+    gamma=0.99,
+    grpo_params={"epsilon": 0.2, "beta": 0.01},
+    buffer_capacity=10000,
+    batch_size=32
+)
 
-# Apply gradients to update the policy
-optimizer.zero_grad()
-policy_loss = torch.tensor(loss, requires_grad=True)
-policy_loss.backward()
-optimizer.step()
+# Training loop
+state, _ = env.reset()
+for step in range(1000):
+    action = agent.act(state)
+    next_state, reward, done, truncated, _ = env.step(action)
+    agent.store_experience(reward, done)
+    
+    if done or truncated:
+        state, _ = env.reset()
+        agent.update()  # Update policy after episode ends
+    else:
+        state = next_state
 ```
 
----
+### Complete CartPole Implementation
+
+For a complete implementation of CartPole with OptimRL, check out our examples in the `simple_test` directory:
 
-## 🔍 Advanced Usage
+- `cartpole_simple.py`: Basic implementation with GRPO
+- `cartpole_improved.py`: Improved implementation with tuned parameters
+- `cartpole_final.py`: Final implementation with optimized performance
+- `cartpole_tuned.py`: Enhanced implementation with advanced features
+- `cartpole_simple_pg.py`: Vanilla Policy Gradient implementation for comparison
 
-Integrate OptimRL seamlessly into your **PyTorch pipelines** or custom training loops. Below is a **complete example** showcasing GRPO in action:
+The vanilla policy gradient implementation (`cartpole_simple_pg.py`) achieves excellent performance on CartPole-v1, reaching the maximum reward of 500 consistently. It serves as a useful baseline for comparing against the GRPO implementations.
+
+### Continuous Action Space Example (Pendulum)
 
 ```python
 import torch
-import optimrl
-
-class PolicyNetwork(torch.nn.Module):
-    def __init__(self, input_dim, output_dim):
+import torch.nn as nn
+import torch.optim as optim
+import gym
+from optimrl import create_agent
+
+# Define a continuous policy network
+class ContinuousPolicyNetwork(nn.Module):
+    def __init__(self, input_dim, action_dim):
         super().__init__()
-        self.network = torch.nn.Sequential(
-            torch.nn.Linear(input_dim, 64),
-            torch.nn.Tanh(),
-            torch.nn.Linear(64, output_dim),
-            torch.nn.LogSoftmax(dim=-1)
+        self.shared_layers = nn.Sequential(
+            nn.Linear(input_dim, 64),
+            nn.ReLU(),
+            nn.Linear(64, 64),
+            nn.ReLU()
         )
-    
+        # Output both mean and log_std for each action dimension
+        self.output_layer = nn.Linear(64, action_dim * 2)
+        
     def forward(self, x):
-        return self.network(x)
-
-# Initialize components
-policy = PolicyNetwork(input_dim=4, output_dim=2)
-reference_policy = PolicyNetwork(input_dim=4, output_dim=2)
-optimizer = torch.optim.Adam(policy.parameters(), lr=3e-4)
-grpo = optimrl.GRPO(epsilon=0.2, beta=0.1)
+        x = self.shared_layers(x)
+        return self.output_layer(x)
+
+# Create environment and network
+env = gym.make('Pendulum-v1')
+state_dim = env.observation_space.shape[0]
+action_dim = env.action_space.shape[0]
+action_bounds = (env.action_space.low[0], env.action_space.high[0])
+policy = ContinuousPolicyNetwork(state_dim, action_dim)
+
+# Create Continuous GRPO agent
+agent = create_agent(
+    "continuous_grpo",
+    policy_network=policy,
+    optimizer_class=optim.Adam,
+    action_dim=action_dim,
+    learning_rate=0.0005,
+    gamma=0.99,
+    grpo_params={"epsilon": 0.2, "beta": 0.01},
+    buffer_capacity=10000,
+    batch_size=64,
+    min_std=0.01,
+    action_bounds=action_bounds
+)
 
 # Training loop
-for episode in range(1000):  # Replace with your num_episodes
-    states, actions, rewards = collect_episode()  # Replace with your data
+state, _ = env.reset()
+for step in range(1000):
+    action = agent.act(state)
+    next_state, reward, done, truncated, _ = env.step(action)
+    agent.store_experience(reward, done)
 
-    # Compute log probabilities
-    with torch.no_grad():
-        log_probs_old = policy(states)
-        log_probs_ref = reference_policy(states)
-    
-    batch_data = {
-        'log_probs_old': log_probs_old.numpy(),
-        'log_probs_ref': log_probs_ref.numpy(),
-        'rewards': rewards,
-        'group_size': len(rewards)
-    }
-    
-    # Policy update
-    log_probs_new = policy(states)
-    loss, gradients = grpo.compute_loss(batch_data, log_probs_new.numpy())
-    
-    # Backpropagation
-    optimizer.zero_grad()
-    policy_loss = torch.tensor(loss, requires_grad=True)
-    policy_loss.backward()
-    optimizer.step()
+    if done or truncated:
+        state, _ = env.reset()
+        agent.update()  # Update policy after episode ends
+    else:
+        state = next_state
 ```
 
+## 📊 Performance Comparison
+
+Our simple policy gradient implementation consistently solves the CartPole-v1 environment in under 1000 episodes, achieving the maximum reward of 500. The GRPO implementations offer competitive performance with additional benefits:
+
+- **Lower variance**: More stable learning across different random seeds
+- **Improved sample efficiency**: Learns from fewer interactions with the environment
+- **Better regularization**: Prevents policy collapse during training
+
+## Kaggle Notebook
+
+You can view the "OptimRL Trading Experiment" notebook on Kaggle:
+[![OptimRL Trading Experiment](https://img.shields.io/badge/Kaggle-OptimRL_Trading_Experiment-orange)](https://www.kaggle.com/code/noir1112/optimrl-trading-experiment/edit)
+
+Alternatively, you can open the notebook locally as an `.ipynb` file:
+[Open the OptimRL Trading Experiment Notebook (.ipynb)](./notebooks/OptimRL_Trading_Experiment.ipynb)
+
 ---
 
 ## 🤝 Contributing
 
-We’re excited to have you onboard! Here’s how you can help improve **OptimRL**:
+We're excited to have you onboard! Here's how you can help improve **OptimRL**:
 1. **Fork the repo.**  
 2. **Create a feature branch**:  
    ```bash
@@ -185,7 +258,7 @@ If you use OptimRL in your research, please cite:
 ```bibtex
 @software{optimrl2024,
   title={OptimRL: Group Relative Policy Optimization},
-  author={Your Name},
+  author={Subashan Nair},
   year={2024},
   url={https://github.com/subaashnair/optimrl}
 }
@@ -194,3 +267,5 @@ If you use OptimRL in your research, please cite:
 ---
 
 
+
+
@@ -0,0 +1,107 @@
+#!/usr/bin/env python
+# Example of training a GRPO agent on the CartPole environment
+
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import gym
+import numpy as np
+import matplotlib.pyplot as plt
+from optimrl import GRPO, GRPOAgent, create_agent
+
+# Define a simple policy network for CartPole
+class PolicyNetwork(nn.Module):
+    def __init__(self, input_dim, output_dim):
+        super().__init__()
+        self.network = nn.Sequential(
+            nn.Linear(input_dim, 64),
+            nn.ReLU(),
+            nn.Linear(64, 64),
+            nn.ReLU(),
+            nn.Linear(64, output_dim),
+            nn.LogSoftmax(dim=-1)
+        )
+        
+    def forward(self, x):
+        return self.network(x)
+
+def train_cartpole(episodes=500, render=False):
+    # Create the CartPole environment
+    env = gym.make('CartPole-v1')
+    
+    # Get environment dimensions
+    state_dim = env.observation_space.shape[0]  # 4 for CartPole
+    action_dim = env.action_space.n  # 2 for CartPole
+    
+    # Create the policy network
+    policy_network = PolicyNetwork(state_dim, action_dim)
+    
+    # Initialize the GRPO agent
+    agent = create_agent(
+        "grpo",
+        policy_network=policy_network,
+        optimizer_class=optim.Adam,
+        learning_rate=0.001,
+        gamma=0.99,
+        grpo_params={"epsilon": 0.2, "beta": 0.01},
+        buffer_capacity=10000,
+        batch_size=32
+    )
+    
+    # Training loop
+    rewards_history = []
+    
+    for episode in range(episodes):
+        state, _ = env.reset()
+        episode_reward = 0
+        done = False
+        
+        while not done:
+            if render and episode % 50 == 0:
+                env.render()
+                
+            # Select an action
+            action = agent.act(state)
+            
+            # Take the action in the environment
+            next_state, reward, done, truncated, _ = env.step(action)
+            done = done or truncated
+            
+            # Store experience and update policy
+            agent.store_experience(reward, done)
+            
+            # Update state and reward
+            state = next_state
+            episode_reward += reward
+            
+        # Update policy after episode ends
+        agent.update()
+        
+        # Record rewards
+        rewards_history.append(episode_reward)
+        
+        # Print progress
+        if (episode + 1) % 10 == 0:
+            avg_reward = np.mean(rewards_history[-10:])
+            print(f"Episode {episode + 1}/{episodes} | Avg Reward: {avg_reward:.2f}")
+            
+    env.close()
+    
+    # Plot rewards
+    plt.figure(figsize=(10, 6))
+    plt.plot(rewards_history)
+    plt.xlabel('Episode')
+    plt.ylabel('Total Reward')
+    plt.title('GRPO on CartPole-v1')
+    plt.savefig('cartpole_rewards.png')
+    plt.show()
+    
+    return rewards_history, policy_network
+
+if __name__ == "__main__":
+    rewards, model = train_cartpole(episodes=300, render=False)
+    print("Training completed!")
+    
+    # Save the trained model
+    torch.save(model.state_dict(), "cartpole_grpo_model.pt")
+    print("Model saved to cartpole_grpo_model.pt")