Skip to content

Commit e3b40ff

Browse files
committed
agent markdown
1 parent 920baf7 commit e3b40ff

3 files changed

Lines changed: 202 additions & 7 deletions

File tree

agent.md

Lines changed: 196 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,197 @@
1-
# Remote Execution Notes
1+
# Agent Guide: sgl-jax Data Parallelism
22

3-
- Do not prepend Sky commands with:
4-
`env -u ALL_PROXY -u all_proxy -u HTTPS_PROXY -u https_proxy -u HTTP_PROXY -u http_proxy`
5-
- Use plain `sky status`, `sky exec`, `sky logs`, etc.
6-
- Active TPU cluster for this task:
7-
`tpu-tpu-v6e-4-pr-scheduler-mixin-v-19684`
3+
This document guides code agents working on the data-parallelism feature for the sgl-jax project. All agents work in **isolation** on separate git worktrees and submit individual PRs to a shared integration branch.
4+
5+
---
6+
7+
## Project Background
8+
9+
### Goal
10+
11+
Rebase and fix compatibility issues in the sgl-jax data-parallelism (DP) implementation so it can be merged into `main`.
12+
13+
The DP feature has been developed on a long-lived branch and has accumulated significant drift from `main` — including rebase conflicts, API incompatibilities, and test failures. The objective of this effort is to **fix these issues one by one** so all changes can be merged cleanly.
14+
15+
### Repository
16+
17+
- **Project**: sgl-jax — JAX backend for SGLang
18+
- **Main source**: `python/sgl_jax/`, core serving runtime under `python/sgl_jax/srt/`
19+
- **Tests**: `test/` directory; test suite entry point: `test/srt/run_suite.py`
20+
- **Integration branch**: `feat/data-parallelism` — all agent PRs target this branch
21+
- **End goal**: `feat/data-parallelism` passes CI and merges into `main`
22+
23+
### Key Constraints
24+
25+
- **TPU required**: All JAX tests must run on a remote TPU cluster. Never run JAX/TPU tests locally.
26+
- **Remote execution**: Use the `sglang-jax-skypilot-dev` skill for all remote test and debug sessions.
27+
- **Sky commands**: Do not prepend proxy environment variables to sky commands. Use `sky exec`, `sky status`, etc. directly.
28+
29+
### DP Architecture Overview
30+
31+
Core components of data-parallelism in sgl-jax:
32+
33+
| Component | Path | Description |
34+
|-----------|------|-------------|
35+
| Scheduler | `python/sgl_jax/srt/managers/` | DP-aware request scheduling |
36+
| Allocator | `python/sgl_jax/srt/mem_cache/` | Memory allocation across DP ranks |
37+
| Radix Cache | `python/sgl_jax/srt/mem_cache/` | DP-safe KV cache management |
38+
| Control Plane | `python/sgl_jax/srt/` | Communication and coordination between DP ranks |
39+
40+
---
41+
42+
## SOP
43+
44+
Follow these phases **in order**. Do not skip ahead.
45+
46+
### Phase 0 — Receive Task
47+
48+
Before touching any code:
49+
50+
- [ ] Confirm from your task description: **feature name**, **integration branch name**, **task type** (bugfix / feature), **test mode** (test file / service debug)
51+
- [ ] If any of the above is missing or ambiguous → **stop and report immediately**. Do not guess.
52+
53+
---
54+
55+
### Phase 1 — Set Up Worktree
56+
57+
Create an isolated worktree based on the integration branch. **All subsequent work happens inside this worktree only.**
58+
59+
```bash
60+
# Create worktree and working branch
61+
git worktree add .worktrees/<feature-name> <integration-branch>
62+
cd .worktrees/<feature-name>
63+
git checkout -b <feature-name>
64+
```
65+
66+
Rules:
67+
- Never modify the main working directory or any other worktree.
68+
- Never push directly to the integration branch or main.
69+
70+
---
71+
72+
### Phase 2 — Analyze the Problem
73+
74+
Before writing any code, produce a short written analysis (this will become the PR description draft):
75+
76+
**For bugfix:**
77+
- Root cause of the bug
78+
- Affected modules / files
79+
- Expected behavior after the fix
80+
81+
**For feature development:**
82+
- Functional boundary: what is in scope, what is out of scope
83+
- Affected modules / files
84+
- Expected behavior / interface
85+
86+
If during analysis you find you need to modify files that belong to another agent's functional area → **stop and report**. Do not modify those files.
87+
88+
---
89+
90+
### Phase 3 — Write Tests First
91+
92+
Before implementing, write the tests that define success. Tests must **fail** (red) at this point.
93+
94+
**Test File mode:**
95+
- Write or update test files under `test/`
96+
- Run the tests on the remote TPU cluster and confirm they fail
97+
- Commit the failing tests
98+
99+
**Service Debug mode:**
100+
- Write the client script (benchmark / debug / accuracy test) that will be run against the live server
101+
- Document the expected output or pass criteria
102+
- Commit the client script
103+
104+
Use the `sglang-jax-skypilot-dev` skill for all remote execution.
105+
106+
If the test infrastructure itself is broken or unclear → **stop and report**.
107+
108+
---
109+
110+
### Phase 4 — Implement
111+
112+
Write the implementation to make the tests pass.
113+
114+
Rules:
115+
- Only modify files within your functional area (identified in Phase 2)
116+
- Follow project code style: lazy log formatting (`logger.info("msg %s", var)`), Ruff-compliant code
117+
- If you discover a necessary change is outside your functional area → **stop and report**
118+
119+
Commit incrementally with clear messages.
120+
121+
---
122+
123+
### Phase 5 — Verify
124+
125+
Run your tests on the remote TPU cluster and confirm they pass (green).
126+
127+
**Test File mode:**
128+
```bash
129+
# Via sglang-jax-skypilot-dev skill — SSH into cluster, then:
130+
uv run --extra tpu python -m pytest test/srt/<your_test_file.py> -v
131+
```
132+
133+
**Service Debug mode:**
134+
```bash
135+
# Via sglang-jax-skypilot-dev skill — SSH into cluster, then:
136+
# tmux session "server": start the service
137+
# tmux session "client": run the client/benchmark/accuracy test after service is ready
138+
```
139+
140+
Do not proceed to Phase 6 until all your tests are green.
141+
142+
---
143+
144+
### Phase 6 — Submit PR
145+
146+
Open a PR from your working branch targeting the **integration branch**.
147+
148+
**PR title format:** `[DP] <feature-name>: <one-line description>`
149+
150+
**PR description must include:**
151+
152+
```
153+
## Problem Analysis
154+
<Root cause (bugfix) or functional boundary (feature)>
155+
156+
## Changes
157+
<List of modified files and what changed in each>
158+
159+
## Test Results
160+
Test mode: [test file | service debug]
161+
Command: <exact command used>
162+
Result: <pass/fail counts or benchmark output summary>
163+
```
164+
165+
- If all tests pass → open as a **ready-for-review** PR
166+
- If there are unresolved blockers → open as a **Draft PR** and add a comment explaining the blocker (what the problem is, which phase you are stuck in)
167+
168+
---
169+
170+
### Blocker Protocol
171+
172+
At **any** phase, if you encounter a blocker:
173+
174+
1. **Stop immediately** — do not attempt workarounds
175+
2. Commit your current state with message: `WIP: blocked on <description>`
176+
3. Open a Draft PR (or add a comment to an existing one) with:
177+
- Phase you are stuck in
178+
- Exact error or conflict
179+
- What you have already tried
180+
4. Wait for human intervention
181+
182+
---
183+
184+
## Acceptance Criteria
185+
186+
A PR is ready to merge into the integration branch when **all** of the following are true:
187+
188+
| Criterion | Requirement |
189+
|-----------|-------------|
190+
| Linter | `ruff check` passes with no errors |
191+
| Code hygiene | No unresolved TODO/FIXME in newly added code |
192+
| Tests (test file mode) | All specified test files pass; no new failures introduced |
193+
| Tests (service debug mode) | Service starts successfully; client/benchmark/accuracy test completes with results meeting expected criteria |
194+
| PR description | Problem analysis, file list, and test output summary all present |
195+
| Branch state | Based on integration branch, no unresolved merge conflicts |
196+
| Commit history | No stray debug commits; history is clean or squashed |
197+
| Blockers | If any blocker exists, PR must be Draft with blocker described in a comment |

python/pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ dependencies = [
3939
"pandas",
4040
"aiohttp",
4141
"pybase64",
42+
"datasets",
4243
"partial_json_parser",
4344
"pathwaysutils",
4445
"omegaconf",

python/sgl_jax/srt/mem_cache/memory_pool.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -475,7 +475,11 @@ def set_kv_buffer_legacy(
475475
N = self.kv_buffer[layer_idx].shape[0]
476476
safe_loc = jnp.where(loc >= 0, loc, jnp.int32(N))
477477
# for jax function
478-
updated_layer = self.kv_buffer[layer_idx].at[safe_loc].set(fused_kv, mode="drop")
478+
updated_layer = (
479+
self.kv_buffer[layer_idx]
480+
.at[safe_loc]
481+
.set(fused_kv, mode="drop", out_sharding=self.kv_sharding)
482+
)
479483
return updated_layer
480484

481485

0 commit comments

Comments
 (0)