Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@sourcera-ai #65

Merged
merged 1 commit into from
Nov 26, 2024
Merged

@sourcera-ai #65

merged 1 commit into from
Nov 26, 2024

Conversation

leonvanbokhorst
Copy link
Owner

@leonvanbokhorst leonvanbokhorst commented Nov 26, 2024

  • Add weight checkpointing for best performance
  • Implement automatic recovery with cooldown and attempt limits
  • Add adaptive noise scaling based on performance
  • Fix tensor type/device handling for state inputs
  • Add warmup period before optimization
  • Clear replay memory during recovery to avoid learning from poor experiences
  • Reduce exploration during recovery phases

The agent now maintains performance better after loading pretrained models and
recovers more gracefully from performance drops. Tensor handling is more robust
across different device types (CPU/MPS/CUDA).

Summary by Sourcery

Stabilize the Hopper training by introducing a Double DQN architecture with continuous action space, experience replay with prioritization, and advanced reward shaping. Implement weight checkpointing, automatic recovery mechanisms, and adaptive noise scaling to enhance performance stability and recovery. Fix tensor handling for compatibility across different devices.

New Features:

  • Introduce a Double DQN architecture with continuous action space for the Hopper environment.
  • Implement experience replay with prioritization and advanced reward shaping for stable hopping.

Bug Fixes:

  • Fix tensor type and device handling for state inputs to ensure compatibility across CPU, MPS, and CUDA devices.

Enhancements:

  • Add weight checkpointing for best performance and implement automatic recovery with cooldown and attempt limits.
  • Introduce adaptive noise scaling based on performance and a warmup period before optimization.
  • Clear replay memory during recovery to avoid learning from poor experiences and reduce exploration during recovery phases.

- Add weight checkpointing for best performance
- Implement automatic recovery with cooldown and attempt limits
- Add adaptive noise scaling based on performance
- Fix tensor type/device handling for state inputs
- Add warmup period before optimization
- Clear replay memory during recovery to avoid learning from poor experiences
- Reduce exploration during recovery phases

The agent now maintains performance better after loading pretrained models and
recovers more gracefully from performance drops. Tensor handling is more robust
across different device types (CPU/MPS/CUDA).
Copy link
Contributor

sourcery-ai bot commented Nov 26, 2024

Reviewer's Guide by Sourcery

This PR implements several stability improvements for the Hopper training system, focusing on preventing catastrophic forgetting and maintaining consistent performance. The implementation includes a sophisticated weight management system with checkpointing, an adaptive recovery mechanism, and improved tensor handling across different device types (CPU/MPS/CUDA).

No diagrams generated as the changes look simple and do not need a visual representation.

File-Level Changes

Change Details Files
Implemented a weight checkpointing and recovery system
  • Added best weights saving mechanism based on episode rewards
  • Implemented weight restoration functionality when performance drops
  • Added recovery attempts tracking with cooldown periods
  • Implemented small noise injection to weights during recovery to escape local optima
gym/hopper.py
Enhanced exploration and noise handling
  • Added adaptive noise scaling based on performance ratio
  • Reduced exploration during recovery phases
  • Implemented performance-based epsilon adjustment
gym/hopper.py
Improved tensor handling and device compatibility
  • Added automatic device selection (MPS/CUDA/CPU)
  • Implemented consistent tensor type conversion and device placement
  • Added proper tensor handling in state processing
gym/hopper.py
Enhanced training stability mechanisms
  • Added warmup period before optimization begins
  • Implemented replay memory clearing during recovery
  • Added gradient clipping and loss scaling based on performance
  • Reduced actor update frequency during unstable periods
gym/hopper.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@leonvanbokhorst leonvanbokhorst changed the title fix: stabilize hopper training and prevent catastrophic forgetting @sourcera-ai Nov 26, 2024
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @leonvanbokhorst - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider making recovery thresholds and cooldown periods configurable parameters rather than hardcoded values for better tuning flexibility
Here's what I looked at during the review
  • 🟡 General issues: 2 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 2 issues found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

self.alpha = 0.6 # Priority exponent
self.beta = 0.4 # Importance sampling

def push(self, experience: Experience, error: float = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Consider making the push operation atomic to prevent deque desynchronization

If an exception occurs between appending to memory and priorities, the deques will have different lengths. Consider wrapping both operations in a try-finally block.


class HopperAI:
def __init__(self):
if torch.backends.mps.is_available():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Add error handling for MPS device initialization

MPS device initialization can fail even when available. Consider wrapping in try-except and falling back to CPU if initialization fails.

        try:
            if torch.backends.mps.is_available():
                self.device = torch.device("mps")
            elif torch.cuda.is_available():
        except:
            self.device = torch.device("cpu")

noise = torch.randn_like(action) * noise_scale
return torch.clamp(action + noise, -ACTION_HIGH, ACTION_HIGH)

def optimize_model(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider breaking down the optimize_model method into smaller focused methods with single responsibilities

The optimize_model method would be clearer if split into focused methods. Consider restructuring like this:

def optimize_model(self):
    if len(self.memory) < MIN_MEMORY_SIZE:
        return

    experiences, weights, indices = self.memory.sample(BATCH_SIZE)
    state_batch, action_batch, reward_batch, next_state_batch, done_batch = self._prepare_batches(experiences)

    loss_scale = min(1.0, max(0.1, self.current_avg_reward / self.best_reward))
    td_error = self._update_critic(state_batch, action_batch, reward_batch, next_state_batch, done_batch, weights, loss_scale)

    if self.steps_done % 2 == 0:
        self._update_actor(state_batch, weights, loss_scale)

    self._soft_update_target_networks()
    self._update_priorities(indices, td_error)

def _update_critic(self, state_batch, action_batch, reward_batch, next_state_batch, done_batch, weights, loss_scale):
    self.critic_optimizer.zero_grad()
    with torch.no_grad():
        next_actions = self.actor_target(next_state_batch)
        target_Q = self.critic_target(next_state_batch, next_actions)
        target_Q = reward_batch.unsqueeze(1) + GAMMA * target_Q * (1 - done_batch).unsqueeze(1)

    current_Q = self.critic(state_batch, action_batch)
    critic_loss = (weights.unsqueeze(1) * F.mse_loss(current_Q, target_Q, reduction='none')).mean()
    (critic_loss * loss_scale).backward()
    torch.nn.utils.clip_grad_norm_(self.critic.parameters(), max_norm=0.5)
    self.critic_optimizer.step()
    return abs(target_Q - current_Q).detach()

This improves readability while maintaining functionality and performance. Each method has a single responsibility making the code easier to understand and maintain.

self.actor.load_state_dict(self.best_weights['actor'])
self.critic.load_state_dict(self.best_weights['critic'])

def train(self, num_episodes: int, previous_best: float = float('-inf')) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider extracting recovery logic into a dedicated RecoveryManager class to improve code organization.

The recovery mechanism adds unnecessary complexity through nested conditions and state tracking. Consider extracting it into a dedicated RecoveryManager class:

class RecoveryManager:
    def __init__(self, cooldown=10, max_attempts=3, warmup=10):
        self.attempts = 0
        self.max_attempts = max_attempts
        self.last_recovery = 0
        self.cooldown = cooldown
        self.warmup = warmup

    def should_attempt_recovery(self, episode, recent_avg, best_reward):
        if episode <= self.warmup or episode - self.last_recovery <= self.cooldown:
            return False

        return (recent_avg < best_reward * 0.3 and 
                self.attempts < self.max_attempts)

    def record_recovery(self, episode):
        self.attempts += 1
        self.last_recovery = episode

    def record_success(self):
        self.attempts = max(0, self.attempts - 1)

This simplifies the train method:

def train(self, num_episodes: int, previous_best: float = float('-inf')) -> bool:
    recovery = RecoveryManager()
    # ... setup code ...

    for episode in pbar:
        # ... training loop ...

        if len(episode_rewards) > window_size:
            recent_avg = np.mean(episode_rewards[-window_size:])

            if recovery.should_attempt_recovery(episode, recent_avg, self.best_reward):
                print(f"\n⚠️ Attempting recovery ({recovery.attempts + 1}/{recovery.max_attempts})")
                self.restore_best_weights()
                self._add_noise_to_weights(0.01)
                recovery.record_recovery(episode)
                self.memory = ReplayMemory(MEMORY_SIZE)

            if episode_reward > (self.best_reward * 0.7):
                recovery.record_success()

This refactoring:

  1. Separates recovery logic from training logic
  2. Reduces nesting and state tracking complexity
  3. Makes recovery conditions and state transitions explicit
  4. Maintains all functionality while improving maintainability


# Gradient clipping and loss scaling
critic_loss_scale = min(1.0, max(0.1, self.current_avg_reward / self.best_reward))
actor_loss_scale = critic_loss_scale # Scale both losses similarly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Use previously assigned local variable (use-assigned-variable)

self.actor.load_state_dict(self.best_weights['actor'])
self.critic.load_state_dict(self.best_weights['critic'])

def train(self, num_episodes: int, previous_best: float = float('-inf')) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:


Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

@leonvanbokhorst leonvanbokhorst merged commit 31bfef1 into main Nov 26, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant