Hardening: Prevent Gradient Spikes in Advantage Normalization

## Describe the Issue

The advantage normalization logic in `example_trainer/data.py` was found to be numerically unstable for groups with low or zero reward variance. 

The original code used `scores / max(scores.std(), 1e-8)`. If all rollouts in a group received the same score (e.g., all 1.0), the standard deviation would be effectively zero. Any tiny floating-point noise would be amplified by the `1e-8` floor, resulting in extremely large advantages. This leads to gradient spikes, massive `grad_norm` values, and eventual training divergence (NaNs).

## Environment/API Details

- **Environment Class/Name:** `example_trainer/data.py`
- **Environment Configuration:** Any configuration with `group_size > 1`.
- **API Endpoint/Method Involved:** `pad_data_to_good_offset`

## Steps to Reproduce

1. Run a training task where multiple rollouts in the same group receive identical rewards (e.g., all succeed or all fail).
2. Observe the calculated advantages in `data.py`.
3. Monitor the `grad_norm` in wandb.
4. Observe intermittent spikes in gradient magnitude that do not correspond to actual policy changes.

## Interaction Details (if applicable)

- **Expected Behavior:** 
  1. The normalization should use a magnitude-relative epsilon (e.g., `max(1e-8, 1e-4 * abs(mean))`) to ignore statistically insignificant variance.
  2. If the standard deviation is below this threshold, the advantages should be centered but not scaled.

## Setup Details

- **OS:** Linux
- **Python Version:** 3.10+
- **Atropos Version:** commit c20c852
- **Relevant Libraries/Versions:** `numpy`, `torch`

## Additional Context & Logs

This fix ensures that the RL signal remains stable even when the environment provides sparse or uniform rewards within a group.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardening: Prevent Gradient Spikes in Advantage Normalization #457

Describe the Issue

Environment/API Details

Steps to Reproduce

Interaction Details (if applicable)

Setup Details

Additional Context & Logs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Hardening: Prevent Gradient Spikes in Advantage Normalization #457

Description

Describe the Issue

Environment/API Details

Steps to Reproduce

Interaction Details (if applicable)

Setup Details

Additional Context & Logs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions