Skip to content

Add opt-in handling for failed SkyRLGym rollouts#1641

Open
taivu1998 wants to merge 1 commit into
NovaSky-AI:mainfrom
taivu1998:tdv/issue-1613-skip-failed-rollouts
Open

Add opt-in handling for failed SkyRLGym rollouts#1641
taivu1998 wants to merge 1 commit into
NovaSky-AI:mainfrom
taivu1998:tdv/issue-1613-skip-failed-rollouts

Conversation

@taivu1998
Copy link
Copy Markdown

Summary

Adds an opt-in generator.skip_failed_rollouts flag for non-batched SkyRLGymGenerator rollouts. When enabled, an individual failed rollout is logged and replaced with a structurally valid, loss-masked placeholder row using stop_reason="rollout_error", allowing the rest of the generation batch to complete.

Fixes #1613.

Root Cause

SkyRLGymGenerator.generate() fans out non-batched rollouts through tqdm.gather. Under normal gather semantics, the first ordinary exception from any rollout propagates and aborts the entire training step, which is brittle for flaky multi-turn agentic environments.

Changes

  • Adds generator.skip_failed_rollouts: false to the Python config, default YAML, and docs.
  • Keeps existing behavior unchanged by default.
  • Rejects skip_failed_rollouts=True with batched generation, where per-row recovery is ambiguous.
  • Wraps non-batched rollouts only when the flag is enabled.
  • Re-raises asyncio.CancelledError so cancellations and interrupts still stop the step.
  • Substitutes failed rows with zero-reward, zero-loss placeholder outputs that preserve batch shape and optional rollout-logprob shape.
  • Supports step-wise rollout output with a one-step placeholder.
  • Handles mixed VLM success/failure batches by filling missing multimodal tensor features with empty tensor placeholders.
  • Adds rollout-error count/rate metrics and recomputes them correctly after generator-output concatenation.
  • Adds best-effort env cleanup on primary rollout failure points.

Validation

  • python3.12 -m py_compile skyrl/train/generators/skyrl_gym_generator.py skyrl/train/generators/skyrl_vlm_generator.py skyrl/train/generators/utils.py skyrl/train/config/config.py tests/train/generators/test_skyrl_gym_generator.py tests/train/generators/test_generator_output_utils.py
  • git diff --check
  • uv run --python 3.12 --with transformers --with ruff --extra dev --extra skyrl-train --isolated ruff check skyrl/train/generators/skyrl_gym_generator.py skyrl/train/generators/skyrl_vlm_generator.py skyrl/train/generators/utils.py skyrl/train/config/config.py tests/train/generators/test_skyrl_gym_generator.py tests/train/generators/test_generator_output_utils.py
  • uv run --python 3.12 --with black --extra dev --extra skyrl-train --isolated black --check --target-version py312 skyrl/train/generators/skyrl_gym_generator.py skyrl/train/generators/skyrl_vlm_generator.py skyrl/train/generators/utils.py skyrl/train/config/config.py tests/train/generators/test_skyrl_gym_generator.py tests/train/generators/test_generator_output_utils.py
  • uv run --python 3.12 --with transformers --extra dev --extra skyrl-train --isolated pytest tests/train/generators/test_skyrl_gym_generator.py tests/train/generators/test_generator_output_utils.py tests/train/generators/test_skyrl_vlm_generator.py tests/train/generators/test_utils.py tests/train/test_config.py tests/train/test_trainer_utils.py -q

The combined pytest slice passed with 173 passed, 4 warnings; the warnings were existing Ray/Hydra/legacy-config noise.

@taivu1998 taivu1998 marked this pull request as ready for review May 11, 2026 03:11
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the skip_failed_rollouts feature for non-batched generation, which replaces failed rollouts with zero-reward, loss-masked placeholders to prevent training interruptions. The implementation includes enhanced error handling across SkyRLGymGenerator and SkyRLVLMGymGenerator to ensure environments are properly closed after exceptions, along with new metrics for tracking rollout error rates. Review feedback identifies a high-severity issue where vision features are lost during step-wise trajectories and suggests adding guards against potential ZeroDivisionError when calculating error metrics for empty batches.

Comment on lines +958 to +963
pixel_values = self._normalize_optional_tensor_features(
[getattr(output, "pixel_values", None) for output in all_outputs]
)
image_grid_thw = self._normalize_optional_tensor_features(
[getattr(output, "image_grid_thw", None) for output in all_outputs]
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The collection of vision features here does not account for StepWiseOutput when step_wise_trajectories=True. In step-wise mode, output is a StepWiseOutput object which does not have a pixel_values attribute; instead, these features are stored within the individual TrajectoryOutput objects in output.step_outputs. Consequently, vision features will be lost during flattening. Additionally, the detection logic on context line 954 will fail to identify vision features in step-wise mode for the same reason.

Comment on lines +986 to +989
if num_rollout_errors == len(stop_reasons):
logger.warning(
"All SkyRLGym rollouts in this batch failed and were replaced with loss-masked placeholders."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Potential ZeroDivisionError if stop_reasons is empty. Although batches are typically non-empty, it's safer to guard against this, especially since an empty batch would also trigger the "All rollouts failed" warning incorrectly.

Suggested change
if num_rollout_errors == len(stop_reasons):
logger.warning(
"All SkyRLGym rollouts in this batch failed and were replaced with loss-masked placeholders."
)
rollout_metrics["generate/rollout_error_rate"] = num_rollout_errors / len(stop_reasons) if stop_reasons else 0.0
if stop_reasons and num_rollout_errors == len(stop_reasons):
logger.warning(
"All SkyRLGym rollouts in this batch failed and were replaced with loss-masked placeholders."
)

if result.get("stop_reasons") is not None and has_rollout_error_metric:
num_rollout_errors = sum(reason == ROLLOUT_ERROR_STOP_REASON for reason in result["stop_reasons"])
rollout_metrics["generate/num_rollout_errors"] = num_rollout_errors
rollout_metrics["generate/rollout_error_rate"] = num_rollout_errors / len(result["stop_reasons"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Potential ZeroDivisionError if result["stop_reasons"] is empty. Guarding against zero length ensures robustness for empty generator outputs.

Suggested change
rollout_metrics["generate/rollout_error_rate"] = num_rollout_errors / len(result["stop_reasons"])
rollout_metrics["generate/rollout_error_rate"] = num_rollout_errors / len(result["stop_reasons"]) if result["stop_reasons"] else 0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SkyRLGymGenerator crashes whole training step when one rollout fails

1 participant