|
1 | | -# Remote Execution Notes |
| 1 | +# Agent Guide: sgl-jax Data Parallelism |
2 | 2 |
|
3 | | -- Do not prepend Sky commands with: |
4 | | - `env -u ALL_PROXY -u all_proxy -u HTTPS_PROXY -u https_proxy -u HTTP_PROXY -u http_proxy` |
5 | | -- Use plain `sky status`, `sky exec`, `sky logs`, etc. |
6 | | -- Active TPU cluster for this task: |
7 | | - `tpu-tpu-v6e-4-pr-scheduler-mixin-v-19684` |
| 3 | +This document guides code agents working on the data-parallelism feature for the sgl-jax project. All agents work in **isolation** on separate git worktrees and submit individual PRs to a shared integration branch. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Project Background |
| 8 | + |
| 9 | +### Goal |
| 10 | + |
| 11 | +Rebase and fix compatibility issues in the sgl-jax data-parallelism (DP) implementation so it can be merged into `main`. |
| 12 | + |
| 13 | +The DP feature has been developed on a long-lived branch and has accumulated significant drift from `main` — including rebase conflicts, API incompatibilities, and test failures. The objective of this effort is to **fix these issues one by one** so all changes can be merged cleanly. |
| 14 | + |
| 15 | +### Repository |
| 16 | + |
| 17 | +- **Project**: sgl-jax — JAX backend for SGLang |
| 18 | +- **Main source**: `python/sgl_jax/`, core serving runtime under `python/sgl_jax/srt/` |
| 19 | +- **Tests**: `test/` directory; test suite entry point: `test/srt/run_suite.py` |
| 20 | +- **Integration branch**: `feat/data-parallelism` — all agent PRs target this branch |
| 21 | +- **End goal**: `feat/data-parallelism` passes CI and merges into `main` |
| 22 | + |
| 23 | +### Key Constraints |
| 24 | + |
| 25 | +- **TPU required**: All JAX tests must run on a remote TPU cluster. Never run JAX/TPU tests locally. |
| 26 | +- **Remote execution**: Use the `sglang-jax-skypilot-dev` skill for all remote test and debug sessions. |
| 27 | +- **Sky commands**: Do not prepend proxy environment variables to sky commands. Use `sky exec`, `sky status`, etc. directly. |
| 28 | + |
| 29 | +### DP Architecture Overview |
| 30 | + |
| 31 | +Core components of data-parallelism in sgl-jax: |
| 32 | + |
| 33 | +| Component | Path | Description | |
| 34 | +|-----------|------|-------------| |
| 35 | +| Scheduler | `python/sgl_jax/srt/managers/` | DP-aware request scheduling | |
| 36 | +| Allocator | `python/sgl_jax/srt/mem_cache/` | Memory allocation across DP ranks | |
| 37 | +| Radix Cache | `python/sgl_jax/srt/mem_cache/` | DP-safe KV cache management | |
| 38 | +| Control Plane | `python/sgl_jax/srt/` | Communication and coordination between DP ranks | |
| 39 | + |
| 40 | +--- |
| 41 | + |
| 42 | +## SOP |
| 43 | + |
| 44 | +Follow these phases **in order**. Do not skip ahead. |
| 45 | + |
| 46 | +### Phase 0 — Receive Task |
| 47 | + |
| 48 | +Before touching any code: |
| 49 | + |
| 50 | +- [ ] Confirm from your task description: **feature name**, **integration branch name**, **task type** (bugfix / feature), **test mode** (test file / service debug) |
| 51 | +- [ ] If any of the above is missing or ambiguous → **stop and report immediately**. Do not guess. |
| 52 | + |
| 53 | +--- |
| 54 | + |
| 55 | +### Phase 1 — Set Up Worktree |
| 56 | + |
| 57 | +Create an isolated worktree based on the integration branch. **All subsequent work happens inside this worktree only.** |
| 58 | + |
| 59 | +```bash |
| 60 | +# Create worktree and working branch |
| 61 | +git worktree add .worktrees/<feature-name> <integration-branch> |
| 62 | +cd .worktrees/<feature-name> |
| 63 | +git checkout -b <feature-name> |
| 64 | +``` |
| 65 | + |
| 66 | +Rules: |
| 67 | +- Never modify the main working directory or any other worktree. |
| 68 | +- Never push directly to the integration branch or main. |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +### Phase 2 — Analyze the Problem |
| 73 | + |
| 74 | +Before writing any code, produce a short written analysis (this will become the PR description draft): |
| 75 | + |
| 76 | +**For bugfix:** |
| 77 | +- Root cause of the bug |
| 78 | +- Affected modules / files |
| 79 | +- Expected behavior after the fix |
| 80 | + |
| 81 | +**For feature development:** |
| 82 | +- Functional boundary: what is in scope, what is out of scope |
| 83 | +- Affected modules / files |
| 84 | +- Expected behavior / interface |
| 85 | + |
| 86 | +If during analysis you find you need to modify files that belong to another agent's functional area → **stop and report**. Do not modify those files. |
| 87 | + |
| 88 | +--- |
| 89 | + |
| 90 | +### Phase 3 — Write Tests First |
| 91 | + |
| 92 | +Before implementing, write the tests that define success. Tests must **fail** (red) at this point. |
| 93 | + |
| 94 | +**Test File mode:** |
| 95 | +- Write or update test files under `test/` |
| 96 | +- Run the tests on the remote TPU cluster and confirm they fail |
| 97 | +- Commit the failing tests |
| 98 | + |
| 99 | +**Service Debug mode:** |
| 100 | +- Write the client script (benchmark / debug / accuracy test) that will be run against the live server |
| 101 | +- Document the expected output or pass criteria |
| 102 | +- Commit the client script |
| 103 | + |
| 104 | +Use the `sglang-jax-skypilot-dev` skill for all remote execution. |
| 105 | + |
| 106 | +If the test infrastructure itself is broken or unclear → **stop and report**. |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +### Phase 4 — Implement |
| 111 | + |
| 112 | +Write the implementation to make the tests pass. |
| 113 | + |
| 114 | +Rules: |
| 115 | +- Only modify files within your functional area (identified in Phase 2) |
| 116 | +- Follow project code style: lazy log formatting (`logger.info("msg %s", var)`), Ruff-compliant code |
| 117 | +- If you discover a necessary change is outside your functional area → **stop and report** |
| 118 | + |
| 119 | +Commit incrementally with clear messages. |
| 120 | + |
| 121 | +--- |
| 122 | + |
| 123 | +### Phase 5 — Verify |
| 124 | + |
| 125 | +Run your tests on the remote TPU cluster and confirm they pass (green). |
| 126 | + |
| 127 | +**Test File mode:** |
| 128 | +```bash |
| 129 | +# Via sglang-jax-skypilot-dev skill — SSH into cluster, then: |
| 130 | +uv run --extra tpu python -m pytest test/srt/<your_test_file.py> -v |
| 131 | +``` |
| 132 | + |
| 133 | +**Service Debug mode:** |
| 134 | +```bash |
| 135 | +# Via sglang-jax-skypilot-dev skill — SSH into cluster, then: |
| 136 | +# tmux session "server": start the service |
| 137 | +# tmux session "client": run the client/benchmark/accuracy test after service is ready |
| 138 | +``` |
| 139 | + |
| 140 | +Do not proceed to Phase 6 until all your tests are green. |
| 141 | + |
| 142 | +--- |
| 143 | + |
| 144 | +### Phase 6 — Submit PR |
| 145 | + |
| 146 | +Open a PR from your working branch targeting the **integration branch**. |
| 147 | + |
| 148 | +**PR title format:** `[DP] <feature-name>: <one-line description>` |
| 149 | + |
| 150 | +**PR description must include:** |
| 151 | + |
| 152 | +``` |
| 153 | +## Problem Analysis |
| 154 | +<Root cause (bugfix) or functional boundary (feature)> |
| 155 | +
|
| 156 | +## Changes |
| 157 | +<List of modified files and what changed in each> |
| 158 | +
|
| 159 | +## Test Results |
| 160 | +Test mode: [test file | service debug] |
| 161 | +Command: <exact command used> |
| 162 | +Result: <pass/fail counts or benchmark output summary> |
| 163 | +``` |
| 164 | + |
| 165 | +- If all tests pass → open as a **ready-for-review** PR |
| 166 | +- If there are unresolved blockers → open as a **Draft PR** and add a comment explaining the blocker (what the problem is, which phase you are stuck in) |
| 167 | + |
| 168 | +--- |
| 169 | + |
| 170 | +### Blocker Protocol |
| 171 | + |
| 172 | +At **any** phase, if you encounter a blocker: |
| 173 | + |
| 174 | +1. **Stop immediately** — do not attempt workarounds |
| 175 | +2. Commit your current state with message: `WIP: blocked on <description>` |
| 176 | +3. Open a Draft PR (or add a comment to an existing one) with: |
| 177 | + - Phase you are stuck in |
| 178 | + - Exact error or conflict |
| 179 | + - What you have already tried |
| 180 | +4. Wait for human intervention |
| 181 | + |
| 182 | +--- |
| 183 | + |
| 184 | +## Acceptance Criteria |
| 185 | + |
| 186 | +A PR is ready to merge into the integration branch when **all** of the following are true: |
| 187 | + |
| 188 | +| Criterion | Requirement | |
| 189 | +|-----------|-------------| |
| 190 | +| Linter | `ruff check` passes with no errors | |
| 191 | +| Code hygiene | No unresolved TODO/FIXME in newly added code | |
| 192 | +| Tests (test file mode) | All specified test files pass; no new failures introduced | |
| 193 | +| Tests (service debug mode) | Service starts successfully; client/benchmark/accuracy test completes with results meeting expected criteria | |
| 194 | +| PR description | Problem analysis, file list, and test output summary all present | |
| 195 | +| Branch state | Based on integration branch, no unresolved merge conflicts | |
| 196 | +| Commit history | No stray debug commits; history is clean or squashed | |
| 197 | +| Blockers | If any blocker exists, PR must be Draft with blocker described in a comment | |
0 commit comments