Add opt-in handling for failed SkyRLGym rollouts by taivu1998 · Pull Request #1641 · NovaSky-AI/SkyRL

taivu1998 · 2026-05-10T10:50:18Z

Summary

Adds an opt-in generator.skip_failed_rollouts flag for non-batched SkyRLGymGenerator rollouts. When enabled, an individual failed rollout is logged and replaced with a structurally valid, loss-masked placeholder row using stop_reason="rollout_error", allowing the rest of the generation batch to complete.

Fixes #1613.

Root Cause

SkyRLGymGenerator.generate() fans out non-batched rollouts through tqdm.gather. Under normal gather semantics, the first ordinary exception from any rollout propagates and aborts the entire training step, which is brittle for flaky multi-turn agentic environments.

Changes

Adds generator.skip_failed_rollouts: false to the Python config, default YAML, and docs.
Keeps existing behavior unchanged by default.
Rejects skip_failed_rollouts=True with batched generation, where per-row recovery is ambiguous.
Wraps non-batched rollouts only when the flag is enabled.
Re-raises asyncio.CancelledError so cancellations and interrupts still stop the step.
Substitutes failed rows with zero-reward, zero-loss placeholder outputs that preserve batch shape and optional rollout-logprob shape.
Supports step-wise rollout output with a one-step placeholder.
Handles mixed VLM success/failure batches by filling missing multimodal tensor features with empty tensor placeholders.
Adds rollout-error count/rate metrics and recomputes them correctly after generator-output concatenation.
Adds best-effort env cleanup on primary rollout failure points.

Validation

python3.12 -m py_compile skyrl/train/generators/skyrl_gym_generator.py skyrl/train/generators/skyrl_vlm_generator.py skyrl/train/generators/utils.py skyrl/train/config/config.py tests/train/generators/test_skyrl_gym_generator.py tests/train/generators/test_generator_output_utils.py
git diff --check
uv run --python 3.12 --with transformers --with ruff --extra dev --extra skyrl-train --isolated ruff check skyrl/train/generators/skyrl_gym_generator.py skyrl/train/generators/skyrl_vlm_generator.py skyrl/train/generators/utils.py skyrl/train/config/config.py tests/train/generators/test_skyrl_gym_generator.py tests/train/generators/test_generator_output_utils.py
uv run --python 3.12 --with black --extra dev --extra skyrl-train --isolated black --check --target-version py312 skyrl/train/generators/skyrl_gym_generator.py skyrl/train/generators/skyrl_vlm_generator.py skyrl/train/generators/utils.py skyrl/train/config/config.py tests/train/generators/test_skyrl_gym_generator.py tests/train/generators/test_generator_output_utils.py
uv run --python 3.12 --with transformers --extra dev --extra skyrl-train --isolated pytest tests/train/generators/test_skyrl_gym_generator.py tests/train/generators/test_generator_output_utils.py tests/train/generators/test_skyrl_vlm_generator.py tests/train/generators/test_utils.py tests/train/test_config.py tests/train/test_trainer_utils.py -q

The combined pytest slice passed with 173 passed, 4 warnings; the warnings were existing Ray/Hydra/legacy-config noise.

gemini-code-assist

Code Review

This pull request introduces the skip_failed_rollouts feature for non-batched generation, which replaces failed rollouts with zero-reward, loss-masked placeholders to prevent training interruptions. The implementation includes enhanced error handling across SkyRLGymGenerator and SkyRLVLMGymGenerator to ensure environments are properly closed after exceptions, along with new metrics for tracking rollout error rates. Review feedback identifies a high-severity issue where vision features are lost during step-wise trajectories and suggests adding guards against potential ZeroDivisionError when calculating error metrics for empty batches.

gemini-code-assist · 2026-05-11T03:14:01Z

+            pixel_values = self._normalize_optional_tensor_features(
+                [getattr(output, "pixel_values", None) for output in all_outputs]
+            )
+            image_grid_thw = self._normalize_optional_tensor_features(
+                [getattr(output, "image_grid_thw", None) for output in all_outputs]
+            )


The collection of vision features here does not account for StepWiseOutput when step_wise_trajectories=True. In step-wise mode, output is a StepWiseOutput object which does not have a pixel_values attribute; instead, these features are stored within the individual TrajectoryOutput objects in output.step_outputs. Consequently, vision features will be lost during flattening. Additionally, the detection logic on context line 954 will fail to identify vision features in step-wise mode for the same reason.

gemini-code-assist · 2026-05-11T03:14:01Z

+            if num_rollout_errors == len(stop_reasons):
+                logger.warning(
+                    "All SkyRLGym rollouts in this batch failed and were replaced with loss-masked placeholders."
+                )


Potential ZeroDivisionError if stop_reasons is empty. Although batches are typically non-empty, it's safer to guard against this, especially since an empty batch would also trigger the "All rollouts failed" warning incorrectly.

Suggested change

if num_rollout_errors == len(stop_reasons):

logger.warning(

"All SkyRLGym rollouts in this batch failed and were replaced with loss-masked placeholders."

)

rollout_metrics["generate/rollout_error_rate"] = num_rollout_errors / len(stop_reasons) if stop_reasons else 0.0

if stop_reasons and num_rollout_errors == len(stop_reasons):

logger.warning(

"All SkyRLGym rollouts in this batch failed and were replaced with loss-masked placeholders."

)

gemini-code-assist · 2026-05-11T03:14:01Z

+    if result.get("stop_reasons") is not None and has_rollout_error_metric:
+        num_rollout_errors = sum(reason == ROLLOUT_ERROR_STOP_REASON for reason in result["stop_reasons"])
+        rollout_metrics["generate/num_rollout_errors"] = num_rollout_errors
+        rollout_metrics["generate/rollout_error_rate"] = num_rollout_errors / len(result["stop_reasons"])


Potential ZeroDivisionError if result["stop_reasons"] is empty. Guarding against zero length ensures robustness for empty generator outputs.

Suggested change

rollout_metrics["generate/rollout_error_rate"] = num_rollout_errors / len(result["stop_reasons"])

rollout_metrics["generate/rollout_error_rate"] = num_rollout_errors / len(result["stop_reasons"]) if result["stop_reasons"] else 0.0

Add skip failed rollout handling

50b4466

taivu1998 marked this pull request as ready for review May 11, 2026 03:11

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add opt-in handling for failed SkyRLGym rollouts#1641

Add opt-in handling for failed SkyRLGym rollouts#1641
taivu1998 wants to merge 1 commit into
NovaSky-AI:mainfrom
taivu1998:tdv/issue-1613-skip-failed-rollouts

taivu1998 commented May 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	rollout_metrics["generate/rollout_error_rate"] = num_rollout_errors / len(result["stop_reasons"])
	rollout_metrics["generate/rollout_error_rate"] = num_rollout_errors / len(result["stop_reasons"]) if result["stop_reasons"] else 0.0

Conversation

taivu1998 commented May 10, 2026

Summary

Root Cause

Changes

Validation

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant