[worktrial] Taste reward shaping#1618
Conversation
Add fleet_task environment that integrates Fleet-hosted tasks with SkyRL via OpenEnv's FleetTaskEnv abstraction layer. Supports multi-turn tool-use and computer-use (multimodal) modalities. - FleetTaskEnv(BaseTextEnv): provisions Fleet env, multi-turn episodes, reward via verifier, partial reward support, hint augmentation - Tool call parser: handles <tool_call>/<function_call> tag formats with JSON repair for missing closing braces - Multimodal observations: returns image_url content blocks for CUA, compatible with upstream's extract_images_from_conversation() - Per-env metrics aggregation with environment breakdown - Context management integration for long trajectories - Trace upload support for eval telemetry Co-Authored-By: Claude Opus 4.6 <[email protected]>
…onfigs Port Fleet-specific training infrastructure from fork to fresh SkyRL-v2: Entrypoints: - main_fleet.py: GRPO training on Fleet-hosted envs with S3 checkpoints - main_task_gen.py: Task generation training entrypoint - main_fleet_tinker.py: Tinker-based training with Fleet envs (LoRA, async) Dataset & Checkpoints: - prepare_dataset.py: Convert Fleet task JSON to SkyRL parquet format (stratified split, dedup, env capping, difficulty filtering) - s3_checkpoints.py: Async S3 upload, cross-VM resume, local cleanup - export_tasks.py: CLI to export tasks from Fleet API Training Scripts: - fleet-common-setup.sh: Shared setup (deps, OpenEnv, dataset download) - fleet-common-run.sh: Multi-node Ray cluster + training launch - fleet-35b-run.sh: Qwen3.5-35B config (TP=2, multi-node) - fleet-qwen35-extra-setup.sh: Qwen3.5 deps (transformers 5.3, flash-attn) - fleet-task-gen-run.sh: Task generation config SkyPilot YAML Configs: - openenv-fleet-grpo-qwen3_5-35b.yaml: 2-node H200 training - task-gen-grpo-qwen3_5-9b.yaml: Single-node task gen Also adds fleet_task and task_gen config to skyrl_gym_config/default.yaml. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Port the task generation environment from fleet-ai/SkyRL that enables RL-based training of task-generating models. The environment supports multi-turn task generation where the model generates (prompt, verifier) pairs that are evaluated via Fleet harness rollouts. Key components: - TaskGenEnv(BaseTextEnv): Multi-turn env with tool-based DB exploration, task generation, and reward computation via variance + hint gap - VerifierSandbox: AST-based static analysis for generated verifier code safety (blocked imports/builtins, complexity bounds, signature checks) - Tool call parser: Handles <tool_call>/<function_call> tag formats Reward formula: R = gate * (base_quality + alpha * var(raw_scores) + hint_gap) Depends on PR #2 (fleet/training) for integrations.fleet.task_gen_reward. Co-Authored-By: Claude Opus 4.6 <[email protected]>
When all raw rollout samples for a prompt score 0, hint augmentation generates additional rollouts with verifier feedback injected into the prompt. This rescues GRPO signal for otherwise dead prompts. Key components: - _run_hint_augmentation() in SkyRLGymGenerator: groups outputs by instance_id, identifies failing prompts, builds hint text from verifier ERROR/SUCCESS_ACCUMULATOR, launches hinted rollouts - RLTF-SD: replaces hinted prompt_ids with original unhinted prompt_ids so the model learns to produce hint-quality outputs from the original prompt alone (grad log pi(y_hint | x_0) not grad log pi(y_hint | x_0 + hint)) - First-turn baseline in compute_grpo_outcome_advantage: when is_hinted is present, computes group mean/std from raw samples only, preventing hinted samples from contaminating the GRPO baseline - Metrics: hint/total_hinted_rollouts, hint/hint_success_rate, hint/prompts_hinted, hint/signal_rescued Config: enable_hints, hint_reward_threshold, n_hint_samples in fleet_task section of skyrl_gym_config. Only runs during training (not eval), only for non-step-wise trajectories, and only when fleet_task.enable_hints=true. Depends on PR #1 (fleet/task-env) for FleetTaskEnv.build_hint_text(). Co-Authored-By: Claude Opus 4.6 <[email protected]>
Port vision-language model support from SkyRL v1 (feat/vl-support-clean) to SkyRL-v2's architecture: - Generator: VL-aware chat template, image accumulation across turns, multi_modal_data construction for vLLM - Engine pipeline: thread multi_modal_data through preprocess/generate in both sync and async vLLM engines - Fleet env: Qwen coordinate adaptation ([0,1000] <-> pixel), initial screenshot capture, computer_use browser hints, done signal detection - Utilities: image extraction, base64 decode, processor loading, VL chat template with proper vision token expansion - New VL run script and SkyPilot YAML for CUA training - Update existing YAMLs to use fleet/all branch Co-Authored-By: Claude Opus 4.6 <[email protected]>
RunPod/Lambda/Nebius/Vast were all out of H200 capacity. Add GCP spot with proper NVIDIA 570 driver image. Co-Authored-By: Claude Opus 4.6 <[email protected]>
SkyRL-v2 pyproject.toml defines 'fsdp' extra (includes vllm, flash-attn, torch, flashinfer) but not a standalone 'vllm' extra. The old SkyRL had 'vllm' as a separate extra. Co-Authored-By: Claude Opus 4.6 <[email protected]>
uv pip install silently fails to build causal-conv1d CUDA extension (reports "Checked 1 package" but module is not importable). Use pip with --no-build-isolation to ensure it finds torch from the venv. Co-Authored-By: Claude Opus 4.6 <[email protected]>
In SkyRL-v2, scripts/ is directly under repo root (not nested under skyrl-train/). Changed cd from "../.." to ".." so the run scripts correctly resolve the repo root directory. Co-Authored-By: Claude Opus 4.6 <[email protected]>
TrainerConfig: loss_chunk_size, use_hybrid_env_sampling, min_samples_per_env GeneratorConfig: inject_context_status, context_warning_threshold, trajectory_timeout_seconds SkyRL-v2's strict Hydra config rejects unknown keys (no + prefix), so these must be defined in the dataclass and YAML defaults. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The fleet entrypoints use @hydra.main which loads the legacy YAML directly, but validate_cfg expects generator.inference_engine.* (the new structured format). Apply translate_legacy_config to convert flat generator.* keys before validation. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The legacy YAML has flat generator.* keys (e.g. generator.backend) but validate_cfg expects generator.inference_engine.* with all fields including distributed_executor_backend. Add the full inference_engine section with defaults so all fields are present after Hydra loads the config and translate_legacy_config moves CLI overrides into it. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Remove explicit fleet_task register() call from main_fleet.py since skyrl_gym.envs.__init__ already auto-registers it - Remove --data-dir-name task_gen from task-gen run script so it uses the default MODALITY-based path (matching setup's download path) Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Replace OmegaConf.create() approach (loses dataclass type info) with in-place sync of flat generator.* CLI overrides into the structured generator.inference_engine section. This preserves the Hydra DictConfig and avoids TypeError on dataclasses.asdict(). - Remove --skip-prepare from task-gen YAML so parquet files are generated - Remove duplicate fleet_task registration (auto-registered by __init__) Co-Authored-By: Claude Opus 4.6 <[email protected]>
Hydra entrypoints pass DictConfig (not dataclass instances), so dataclasses.asdict() fails. Fall back to OmegaConf.to_yaml() for DictConfig objects. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Hydra's @hydra.main produces DictConfig objects, but the codebase expects typed dataclass instances (asdict(), attribute access, etc.). Switch Fleet entrypoints to use SkyRLTrainConfig.from_cli_overrides() which produces proper typed dataclasses via the legacy config translation path. - Add fleet_task/task_gen as Optional[Dict] fields on SkyRLGymConfig - Strip ++/+ Hydra prefixes from CLI args before from_cli_overrides - Remove _sync_legacy_generator_to_inference_engine (legacy path handles it) Co-Authored-By: Claude Opus 4.6 <[email protected]>
accelerate 1.12.0 passes param.__dict__ (which includes transformers 5.3.0's _is_hf_initialized flag) to Parameter.__new__() during init_empty_weights. PyTorch 2.10.0 rejects this unknown kwarg. Newer accelerate filters it. Co-Authored-By: Claude Opus 4.6 <[email protected]>
uv pip install -U accelerate pulls newer torch with CUDA 13.0, breaking torchvision (CUDA 12.8). Use pip install --no-deps instead to upgrade only accelerate without re-resolving transitive dependencies. Co-Authored-By: Claude Opus 4.6 <[email protected]>
accelerate's init_empty_weights passes param.__dict__ to Parameter() which includes _is_hf_initialized (set by transformers 5.x). torch 2.10 rejects this unknown kwarg. Patch Parameter.__new__ in fsdp_utils.py to filter it out. Revert accelerate upgrade attempt (latest is 1.13.0, still has the same issue). Co-Authored-By: Claude Opus 4.6 <[email protected]>
- config.py: Always use legacy config path in from_cli_overrides to ensure flat keys (generator.backend etc.) are properly translated via translate_legacy_config. Fixes VL/35B ValueError on GeneratorConfig. - prepare_dataset.py: Add --env-class CLI arg (fleet_task|task_gen) to set per-record env_class in parquet data. Previously hardcoded to fleet_task, causing task_gen training to create FleetTaskEnv (requires tasks_file). - fleet-common-setup.sh: Accept --env-class and pass to prepare_dataset. - task-gen YAML: Pass --env-class task_gen in setup block. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- task_gen_env.py: Default ROLLOUT_DIR to ~/rollouts instead of /workspace/rollouts.
/workspace doesn't exist on GCP (only RunPod), causing PermissionError.
- config.py: Disable OmegaConf struct flag on base config before merging
CLI overrides. Empty dicts in YAML (like chat_template_kwargs: {}) are
loaded as closed structs, rejecting new keys during merge.
- config.py: Add try/except around asdict() in get_config_as_yaml_str
to handle edge cases where asdict fails on Ray-serialized dataclasses.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
… logging FLEET_API_KEY was not being propagated to Ray workers via runtime_env, causing task_gen's import_single_task to fail with empty API key. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The dataset prepare step stores the environment name as 'data_source' column, but TaskGenEnv.__init__ only looked for 'env_key'. This caused all import_single_task calls to use env_id='unknown', which fails with "Environment 'unknown' not found" from Fleet API. Co-Authored-By: Claude Opus 4.6 <[email protected]>
expandable_segments:True in PYTORCH_CUDA_ALLOC_CONF is incompatible with vLLM's CuMemAllocator, causing AssertionError during model load. The 9B script already had this flag; the 35B was missing it. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Fleet env returns list-format content (from OpenEnv multimodal observations) that text-only templates like Qwen3.5-35B-A3B can't handle. This converts list content (strings or image_url dicts) to plain text before applying the chat template, preventing jinja2 TemplateError on non-VL models. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Hint augmentation extends trajectory_ids in generator_input in-place, but the separate uids variable in the trainer was never updated. This caused IndexError in postprocess_generator_output when uids had fewer entries than rewards (128 raw + N hinted rewards vs 128 uids). Co-Authored-By: Claude Opus 4.6 <[email protected]>
When a Fleet environment fails to provision (e.g., list_tools timeout), return a zero-reward trajectory instead of propagating the exception through tqdm.gather and crashing the entire training step. This makes training resilient to transient Fleet API / MCP failures. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Need step 0 eval to measure training improvement vs base model. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
upload_eval_results_to_s3 was defined but never called from the trainer. Eval results were dumped to local disk only and lost when clusters terminated. Now uploads to s3://skyrl-trajectories/evals/ after every eval (both eval_before_train and periodic). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
multi-node resume
fix: multi-node checkpoint save & resume for FSDP training
- lr 1e-6 → 5e-7: prevent entropy collapse (v0 collapsed at step 3) - max_turns 50 → 64: more headroom for browser workflows - eval_before_train false: skip 11h initial eval (have step 0 baseline) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Steps with ~95K response length produce grad_norm=NaN with SDPA, causing entropy collapse on the next step. Reducing to 72K matches the stable 35B parity run. Sequences truncate earlier, avoiding the NaN gradient region. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This reverts commit 6a5a81a.
grad_norm=NaN on every step with padded response_length > 90K. Capping at 80K to stay below the NaN threshold while keeping more context than the 72K 35B config. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
80K still produced NaN grad_norm at 74K response_length. Threshold is between 63K (fine) and 74K (NaN). Dropping to 64K. zero_variance_filter=true drops prompts where all 4 rollouts get same reward (no GRPO learning signal). With shorter context, more trajectories will truncate → more zero-reward prompts → filter them. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Ensures minimum representation from each environment per batch. With 26 envs and batch_size=16, each batch gets at least 1 sample from every env (if min_samples_per_env=1 and enough envs fit). Remaining slots filled proportionally by dataset size. Previously the config fields existed but nothing read them — sampling was purely proportional, so small envs (rops-mail 93 tasks) got zero samples while large zero-reward envs (zillow 1000, ticketmaster 1000) dominated batches. Ported from fleet-ai/SkyRL-archived skyrl_train/utils/trainer_utils.py Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- num_nodes: 1 → 2 (16 H200 GPUs, 16 inference engines) - train_batch_size: 16 → 50 (50 prompts × 4 samples = 200 trajectories/step) - min_samples_per_env: 1 → 2 (guarantees 2 prompts from each of 25 envs per batch) - policy_mini_batch_size: 16 → 50 With HybridEnvSampler, every batch covers all 25 envs with at least 2 samples each. Previously 11/25 envs were never sampled in 20 steps. Co-authored-by: Deniz <[email protected]> Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Adds `integrations/fleet/entrypoints/main_eval.py` (FleetEvalExp), a
sibling to FleetPPOExp that mirrors the S3 download / FSDP weight load /
inference-engine sync path but skips the training loop. Calls
`trainer.eval()` once; the trainer's existing dump_eval_results path
handles S3 upload of the eval results.
Why: until now, replaying a checkpoint to get an extra independent eval
required launching `main_fleet` with `eval_before_train=true`, which then
proceeded into the full training loop and burned ~50min of GPU per
cluster after the eval was already done. The new entrypoint keeps the
exact same eval contract (same cfg, same trainer.eval(), same S3 upload
prefix), just without the trailing train loop.
Also updates `scripts/fleet-eval-only-run.sh` to drive the new
entrypoint, with RESUME_RUN_NAME / RESUME_CKPT_PATH / RESUME_MODE env
vars (auto-defaults RESUME_MODE=latest when RESUME_RUN_NAME is set,
none otherwise so the same script works for base-model evals).
Adds unit tests covering all branches of `_load_policy_only`
(NONE / LATEST without marker / LATEST with valid marker /
FROM_PATH / FROM_PATH missing dir / FROM_PATH invalid dir name).
Run: uv run --extra dev --extra skyrl-train pytest \
integrations/fleet/tests/test_main_eval.py
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
feat: Fleet eval-only entrypoint with S3 checkpoint resume
Switch from any_of to ordered in all 4 task YAMLs. Ordering: RunPod reserved H200s → GKE/GCP → other providers. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
NaN grad_norm at long sequences is safely handled by the optimizer (skips step, zeros grads) — no weight corruption. 80K allows browser_use trajectories to complete without truncation for envs that need 50+ turns. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
v6-v2 showed 70% of trajectories hit the 64-turn ceiling at step 1 while using only 28-70% of context. Turn limit, not context length, is the binding constraint. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Propagate accumulated_images from AgentLoopState through TrajectoryOutput to GeneratorOutput (multi_modal_data field was defined but never set) - Save PIL images as JPEG alongside trajectory JSONL in dumped_trajectories/global_step_N_images/ - Store image_paths and num_screenshots in trajectory entries - URLs are stored as-is (not downloaded during training) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Extends screenshot saving from training dumps to eval dumps (dump_per_dataset_eval_results). Eval JSONL entries now include image_paths and num_screenshots when VL images are available. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
There was a problem hiding this comment.
Code Review
This pull request introduces extensive Fleet-specific enhancements to the SkyRL-v2 framework, primarily focusing on multi-node FSDP2 training stability, memory optimizations for 35B models via chunked loss computation, and integrated S3 checkpoint management. It also implements multimodal support for Vision-Language tasks and a 'Taste-Judge' gated reward system. The review feedback identifies several documentation issues, such as hardcoded user paths and outdated line number references, which hinder reproducibility. Furthermore, the reviewer suggests clarifying environment variable requirements in the training scripts to better reflect when certain keys, like the OpenRouter API key, are actually mandatory.
| git apply /Users/alliegu/Desktop/fleet/integration/env.py.diff | ||
|
|
||
| # 2. Vendor the taste-judge package into the workdir Python path. | ||
| cp -r /Users/alliegu/Desktop/fleet/integration/skyrl_taste skyrl-gym/skyrl_taste | ||
| cp -r /Users/alliegu/Desktop/fleet/research/judge research/judge | ||
|
|
||
| # 3. Drop the new YAML config into tasks/. | ||
| cp /Users/alliegu/Desktop/fleet/integration/configs/openenv-fleet-grpo-vl-taste.yaml \ | ||
| tasks/openenv-fleet-grpo-vl-taste.yaml |
There was a problem hiding this comment.
The launch and rollback instructions include hardcoded, user-specific absolute paths (e.g., /Users/alliegu/...). This prevents other developers from being able to follow these instructions directly. Please replace these with relative paths from the repository root or use placeholders like <path-to-repo> to make the documentation reproducible.
| Repo: `https://github.com/fleet-ai/skyrl-fleet` (cloned to `/tmp/skyrl-fleet-2` in sandbox; `git clone` into `/sessions/.../outputs` failed because the existing mount blocked write to `.git/`, so we cloned to `/tmp/skyrl-fleet-2`). | ||
|
|
||
| The `skyrl-train` package has been merged into `skyrl/` (per `skyrl-train/README.md`). Modern code paths live under `skyrl/train/...`. | ||
|
|
||
| --- | ||
|
|
||
| ## Reward emit point | ||
|
|
||
| The Fleet env returns reward in **two places**, both in `skyrl-gym/skyrl_gym/envs/fleet_task/env.py`: | ||
|
|
||
| ### Per-step reward — `step_async()` returns | ||
| File: `skyrl-gym/skyrl_gym/envs/fleet_task/env.py` | ||
|
|
||
| The reward is initialized to `0.0` at line **552**, populated from OpenEnv at lines **590–592** and **615–617**, and finally emitted on the `BaseTextEnvStepOutput` returns at lines **674, 708, 762**. |
There was a problem hiding this comment.
This documentation contains a few issues that hinder its usability:
- Hardcoded Paths: The document refers to a hardcoded temporary path (
/tmp/skyrl-fleet-2), which is specific to a particular development session. Please use relative paths from the repository root to make the documentation more general and easier for others to follow. - Outdated Line Numbers: The line numbers referenced for
skyrl-gym/skyrl_gym/envs/fleet_task/env.py(e.g., 552, 590, 674) are incorrect and do not match the code being added in this pull request. Please update these references to reflect the current state of the file.
| # Single source of truth for Qwen3.5-35B-A3B GRPO training config. | ||
| # Called by the SkyPilot YAML and by fleet-research run.sh. | ||
| # | ||
| # Required env vars: FLEET_API_KEY, WANDB_API_KEY, OPENROUTER_API_KEY |
There was a problem hiding this comment.
The comment states that OPENROUTER_API_KEY is a required environment variable. However, the script provides a default empty value for it on line 33, and the comment on line 32 clarifies it's only needed when enable_hints=true. Since hints are disabled for this configuration (line 41), this variable is not actually required. Please update the comment on line 5 to reflect that OPENROUTER_API_KEY is optional to avoid confusion.
| : "${FLEET_API_KEY:?set FLEET_API_KEY}" | ||
| : "${WANDB_API_KEY:?set WANDB_API_KEY}" | ||
| : "${OPENROUTER_API_KEY:?set OPENROUTER_API_KEY}" | ||
| : "${AWS_ACCESS_KEY_ID:?set AWS_ACCESS_KEY_ID}" | ||
| : "${AWS_SECRET_ACCESS_KEY:?set AWS_SECRET_ACCESS_KEY}" |
There was a problem hiding this comment.
The script correctly checks for required environment variables using VAR:?message. However, it uses this for OPENROUTER_API_KEY, which is only needed if hint synthesis is enabled in the underlying run script. If hint synthesis is disabled by default in the training configurations this script launches, consider making this check conditional or providing a default empty value, similar to how it's handled in fleet-35b-run.sh.
bc98899 to
90c528d
Compare
191bac3 to
4751143
Compare
Cherry-picks the taste judge integration from the original taste-reward-shaping branch onto current main, which includes all fleet scripts and the async env wrapper fix for MCP transport errors. - skyrl_taste/ package: async judge wrapper with provider routing - env.py: taste_floor config, _apply_taste_reward gating at episode end - YAML: single-node H200, corrected workdir URL to skyrl-fleet Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
4751143 to
9e9f648
Compare
No description provided.