[ci] add SFT Megatron Tulu3 E2E test

SumanthRH · claude · SumanthRH · commit be1d20820063 · 2026-05-14T01:16:12.000Z
Adds a nightly E2E CI job that exercises the full SFT pipeline against
the Megatron backend on the Tulu3 dataset, mirroring the structure of
the existing GSM8K GPU E2E jobs.

* Backend: Megatron with TP=1, PP=1 on L4_ci (4 GPUs).
* Workload: Qwen/Qwen2.5-0.5B-Instruct, 100 steps, train[:2000] from
  allenai/tulu-3-sft-mixture, batch_size=8, lr=1e-4 (bumped from the
  source script's 1e-6 so the run produces a downward trend in 100
  steps), train_on_what=all_assistant_messages.
* Assertions: exit code 0, "SFT training complete!" appears in stdout,
  no nan/inf loss values, and mean of the last 5 logged losses is less
  than the mean of the first 5 (lenient windowed trend check, no
  magnitude thresholds).
* Logger: wandb (project=skyrl_sft_ci), reusing the existing
  WANDB_API_KEY secret injection pattern.

Files:
  - tests/train/gpu_e2e_test/sft_tulu3_megatron.sh: driver + assertions
  - ci/gpu_e2e_test_run_sft.sh: anyscale-job entrypoint
  - ci/anyscale_gpu_e2e_test_sft.yaml: anyscale job spec (megatron image)
  - .github/workflows/gpu_e2e_ci_sft.yaml: nightly GitHub Actions workflow

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.github/workflows/gpu_e2e_ci_sft.yaml b/.github/workflows/gpu_e2e_ci_sft.yaml
@@ -0,0 +1,47 @@
+name: SkyRL-GPU-E2E-CI-SFT
+
+on:
+  schedule:
+    - cron: '5 8 * * *'   # Every day at 08:05 UTC (~00:05 PST / ~01:05 PDT)
+  workflow_dispatch:
+
+permissions:
+  checks: write   # for status checks to appear
+  contents: read
+
+jobs:
+
+  skyrl_gpu_e2e_test_sft:
+    runs-on: ubuntu-latest
+    defaults:
+      run:
+        shell: bash
+        working-directory: .
+
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python
+        # This is the version of the action for setting up Python, not the Python version.
+        uses: actions/setup-python@v5
+        with:
+          # Semantic version range syntax or exact version of a Python version
+          python-version: '3.12'
+          cache: 'pip'
+      - name: Install the latest version of uv
+        uses: astral-sh/setup-uv@v6
+        with:
+          activate-environment: true
+      - name: Install basic dependencies
+        run: uv pip install anyscale==0.24.79 typer==0.9.0
+      - name: Install envsubst
+        run: sudo apt-get update && sudo apt-get install -y gettext-base
+      - name: Basic convergence test
+        env:
+          ANYSCALE_CLI_TOKEN: ${{ secrets.ANYSCALE_CLI_TOKEN }}
+          ANYSCALE_HOST: https://console.anyscale.com
+          WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
+        run: |
+          envsubst < ci/anyscale_gpu_e2e_test_sft.yaml > ci/anyscale_gpu_e2e_test_sft_envsubst.yaml
+          anyscale job submit -f ci/anyscale_gpu_e2e_test_sft_envsubst.yaml --timeout 4500
+          anyscale job wait --cloud sky-anyscale-aws-us-east-1 --name skyrl-train-gpu-e2e-test-sft --timeout 4500
+          rm -f ci/anyscale_gpu_e2e_test_sft_envsubst.yaml
diff --git a/ci/anyscale_gpu_e2e_test_sft.yaml b/ci/anyscale_gpu_e2e_test_sft.yaml
@@ -0,0 +1,11 @@
+name: skyrl-train-gpu-e2e-test-sft
+entrypoint: bash ci/gpu_e2e_test_run_sft.sh
+image_uri: novaskyai/skyrl-train-ray-2.51.1-py3.12-cu12.8-megatron # (Optional) Exclusive with `containerfile`.
+cloud: sky-anyscale-aws-us-east-1
+ray_version: "2.51.1"
+compute_config: l4_ci
+working_dir: . # (Optional) Use current working directory "." as the working_dir. Can be any local path or remote .zip file in cloud storage.
+env_vars:
+  RAY_OVERRIDE_JOB_RUNTIME_ENV: "1"
+  WANDB_API_KEY: $WANDB_API_KEY
+max_retries: 1 # (Optional) Maximum number of times the job will be retried before being marked failed. Defaults to `1`.
diff --git a/ci/gpu_e2e_test_run_sft.sh b/ci/gpu_e2e_test_run_sft.sh
@@ -0,0 +1,4 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+bash tests/train/gpu_e2e_test/sft_tulu3_megatron.sh
diff --git a/tests/train/gpu_e2e_test/check_sft_trend.py b/tests/train/gpu_e2e_test/check_sft_trend.py
@@ -0,0 +1,128 @@
+"""Trend/health check for an SFT CI run, sourced from wandb.
+
+Replaces the bash stdout-parsing block in ``sft_tulu3_megatron.sh``:
+pulls the run's logged ``train/loss`` history from wandb and asserts on it.
+
+Checks performed (any failure exits non-zero):
+  * Run exists in the given project (matched by display name; most recent wins).
+  * At least ``--min_steps`` ``train/loss`` rows are logged
+    (defaults to ``2 * window``, i.e. enough for non-overlapping windows).
+  * No NaN/inf in the logged loss history.
+  * ``mean(last N losses) < mean(first N losses)`` where N is ``--window``.
+  * Optionally: the run's final ``_step`` >= ``--expected_steps`` (skipped if
+    ``--expected_steps`` is not provided).
+
+The first 4 checks are CI-critical; the last is opt-in because some callers
+don't know the exact step count up front.
+"""
+
+import argparse
+import math
+import sys
+
+import wandb
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    parser.add_argument("--run_name", type=str, required=True, help="wandb run display name")
+    parser.add_argument("--project_name", type=str, required=True, help="wandb project name")
+    parser.add_argument(
+        "--entity",
+        type=str,
+        default=None,
+        help="wandb entity. If omitted, project_name is passed as-is "
+        "(matching the convention used by get_summary.py).",
+    )
+    parser.add_argument(
+        "--metric",
+        type=str,
+        default="train/loss",
+        help="History metric to pull (default: train/loss).",
+    )
+    parser.add_argument(
+        "--window",
+        type=int,
+        default=5,
+        help="Window size N for the first-vs-last mean comparison (default: 5).",
+    )
+    parser.add_argument(
+        "--min_steps",
+        type=int,
+        default=None,
+        help="Minimum number of logged loss rows required. Defaults to 2 * window.",
+    )
+    parser.add_argument(
+        "--expected_steps",
+        type=int,
+        default=None,
+        help="If set, assert the run's final _step is >= this value (completion check).",
+    )
+    return parser.parse_args()
+
+
+def main() -> int:
+    args = parse_args()
+    min_steps = args.min_steps if args.min_steps is not None else 2 * args.window
+    project_path = f"{args.entity}/{args.project_name}" if args.entity else args.project_name
+
+    api = wandb.Api()
+    runs = api.runs(project_path, filters={"display_name": args.run_name}, order="-created_at")
+    matched_run = next(iter(runs), None)
+    if matched_run is None:
+        print(f"FAIL: run '{args.run_name}' not found in project '{project_path}'", file=sys.stderr)
+        return 1
+    print(f"Matched run: id={matched_run.id} state={matched_run.state} url={matched_run.url}")
+
+    # Pull the full loss history. scan_history streams every logged row (vs.
+    # the sampled 500-point default from .history()).
+    rows = list(matched_run.scan_history(keys=[args.metric]))
+    losses = [row[args.metric] for row in rows if args.metric in row]
+    print(f"Pulled {len(losses)} '{args.metric}' rows from wandb history.")
+
+    # ---- Completion check (optional) ----
+    if args.expected_steps is not None:
+        final_step = matched_run.summary_metrics.get("_step")
+        if final_step is None or final_step < args.expected_steps:
+            print(
+                f"FAIL: run final _step={final_step} < expected_steps={args.expected_steps}",
+                file=sys.stderr,
+            )
+            return 1
+        print(f"PASS: run completed (final _step={final_step} >= {args.expected_steps}).")
+
+    # ---- Minimum-rows check ----
+    if len(losses) < min_steps:
+        print(
+            f"FAIL: only {len(losses)} '{args.metric}' rows, need at least {min_steps} "
+            f"(2 * window={args.window}) for windowed trend check",
+            file=sys.stderr,
+        )
+        return 1
+
+    # ---- NaN/inf check ----
+    bad = [(i, v) for i, v in enumerate(losses) if not math.isfinite(v)]
+    if bad:
+        print(f"FAIL: non-finite '{args.metric}' values detected: {bad[:5]}", file=sys.stderr)
+        return 1
+    print(f"PASS: no NaN/inf in '{args.metric}' history.")
+
+    # ---- Windowed trend check ----
+    n = args.window
+    first_mean = sum(losses[:n]) / n
+    last_mean = sum(losses[-n:]) / n
+    print(f"Mean of first {n} losses: {first_mean:.6f}; Mean of last {n} losses: {last_mean:.6f}")
+    if not (last_mean < first_mean):
+        print(
+            f"FAIL: mean of last {n} losses ({last_mean:.6f}) is not < " f"mean of first {n} ({first_mean:.6f})",
+            file=sys.stderr,
+        )
+        return 1
+    print(f"PASS: loss trend check (mean last {n} < mean first {n}).")
+
+    print("All SFT CI assertions passed.")
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/tests/train/gpu_e2e_test/sft_tulu3_megatron.sh b/tests/train/gpu_e2e_test/sft_tulu3_megatron.sh
@@ -0,0 +1,82 @@
+#!/usr/bin/env bash
+# E2E CI test for SFT with the Megatron backend on Tulu3.
+#
+# Runs ``examples/train/sft/run_sft_megatron_tulu3_50k.sh`` with shorter
+# overrides (100 steps, train[:2000]) and asserts:
+#   * Process exits 0.
+#   * "SFT training complete!" appears in stdout.
+#   * Via ``check_sft_trend.py`` (sourcing the run's history from wandb):
+#       - The run completed all expected steps.
+#       - No NaN/inf in the ``train/loss`` history.
+#       - Mean of the last 5 logged losses is strictly less than the mean of the
+#         first 5 (lenient trend check averaged over windows to absorb step
+#         noise; no magnitude thresholds).
+#
+# Logger is wandb so that the run is visible alongside other CI runs and
+# downstream assertions can introspect the run's logged history directly.
+set -euo pipefail
+
+SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
+RUN_NAME="sft_megatron_run_$(date +%Y%m%d%H)"
+PROJECT_NAME="skyrl_sft_ci"
+ENTITY="sky-posttraining-uc-berkeley"
+NUM_STEPS=100
+LOG_FILE="${LOG_FILE:-/tmp/${RUN_NAME}.log}"
+
+# The anyscale job's working_dir is the repo root, so we can use relative paths.
+# We pipe through `tee` so the full stdout is mirrored to ``$LOG_FILE`` for
+# downstream parsing of the loss trend / completion signal.
+#
+# Notes on overrides vs the source script:
+#   * lr is bumped from 1e-6 to 1e-4 so the model produces a clear downward
+#     trend in 100 steps; the source script's 1e-6 is calibrated for 4166 steps.
+#   * batch_size=8, micro_train_batch_size_per_gpu=2 are sized for L4_ci (4 GPUs).
+bash examples/train/sft/run_sft_megatron_tulu3_50k.sh \
+  num_steps=$NUM_STEPS \
+  dataset_split="train[:2000]" \
+  batch_size=8 \
+  micro_train_batch_size_per_gpu=2 \
+  max_length=1024 \
+  model.path=Qwen/Qwen2.5-0.5B-Instruct \
+  optimizer_config.lr=1e-4 \
+  placement.num_nodes=1 \
+  placement.num_gpus_per_node=4 \
+  megatron_config.tensor_model_parallel_size=1 \
+  megatron_config.pipeline_model_parallel_size=1 \
+  megatron_config.context_parallel_size=1 \
+  train_on_what="all_assistant_messages" \
+  logger=wandb \
+  project_name="$PROJECT_NAME" \
+  run_name="$RUN_NAME" \
+  ckpt_path="" \
+  ckpt_interval=0 \
+  hf_save_interval=0 \
+  resume_from="" \
+  2>&1 | tee "$LOG_FILE"
+
+# `set -o pipefail` ensures the failure of the training command propagates
+# through the `tee` pipeline, so by the time we get here the training run
+# itself succeeded (exit code 0).
+
+# ---- Completion marker (stdout-side, cheap sanity check) ----
+# Confirms the trainer reached its final print before exiting. The wandb-side
+# completion/trend/nan-inf assertions follow.
+if ! grep -q "SFT training complete!" "$LOG_FILE"; then
+  echo "FAIL: 'SFT training complete!' not found in $LOG_FILE"
+  exit 1
+fi
+echo "PASS: 'SFT training complete!' marker found."
+
+# ---- Wandb-side assertions ----
+# Pulls the run's logged ``train/loss`` history and asserts:
+#   * final _step >= NUM_STEPS (completion),
+#   * no NaN/inf in the history,
+#   * mean(last 5) < mean(first 5) (lenient windowed trend).
+uv run --isolated --extra fsdp $SCRIPT_DIR/check_sft_trend.py \
+  --run_name "$RUN_NAME" \
+  --project_name "$PROJECT_NAME" \
+  --entity "$ENTITY" \
+  --window 5 \
+  --expected_steps "$NUM_STEPS"
+
+echo "All SFT CI assertions passed."