Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,35 @@ Improve pass:
- `use_swiglu_limit`
- `use_swiglu_limit_shared`

### Eval experiment pattern

- `playground/eval/*` currently uses a thin `GenableEvalConfig` wrapper, not the trainer stack.
- The common eval skeleton is three roles in `resource_cfg.task_specs`: `router`, `vllm`, `evaluator`.
- Runtime flow is: `Exp.entrypoint()` dispatch by `ROLE` -> router publishes `VLLM_ROUTER_ADDR_PORT_<key>` in exp Redis -> vLLM worker health-checks then registers -> evaluator waits via `vllm_cfg.build_cli().wait_for_server()` and runs `eval_cfg.eval()`.
- Sample execution path is `GenableEvalConfig.eval()` -> `GenerationController.generate()` -> `SimpleTrainable.generate()` -> router `/v1/completions`; task-specific metrics are computed in the concrete eval config, not in a shared benchmark harness.
- These eval jobs depend on `STEPTRON_MEET_DIR` because `get_exp_redis()` uses the shared filesystem there for Redis rendezvous.
- For a single-node eval, prefer `python tools/mp_run.py playground/eval/...py`; for multi-node/manual scheduling, generate per-role scripts with `python tools/build_scripts.py playground/eval/...py <output_root>`.
- A fresh vLLM 0.17 eval startup can spend several minutes before `/health` opens: model inspection, distributed init, weight load, `torch.compile`, KV-cache sizing, and CUDA graph capture all happen before the controller can register success. Repeated controller-side `Connection refused` during that phase is expected if logs still show forward progress.
- Use `GenableItem` for generation-only eval items and reserve `TrainableItem` for objects that actually implement `generate_for_train()`. `GenerationController` now accepts `GenableItem` on the normal generate path and only requires `TrainableItem` when `for_train=True`.
- The chat eval wraps each `chat/completions` request with `retry_on(...)` for transient HTTP failures. Retry only transport errors and transient statuses (`408`, `425`, `429`, `5xx`); keep permanent `4xx` responses fail-fast so bad requests are not retried blindly.
- `steptronoss/generation/vllm/vllm_router.py` now aims to be timeout-transparent: its upstream `aiohttp` session disables `total/connect/sock_connect/sock_read` timeouts instead of imposing an extra router-side timeout layer. Client-side timeouts still exist, but they are not propagated over HTTP; the closest transparent behavior is for the router not to add its own.
- `GenerationController.set_tqdm(disabled, total, desc)` is the supported way to customize progress output. The callback thread owns the actual tqdm instance; callers should configure it from the main thread instead of constructing ad-hoc bars around controller callbacks.
- In `steptronoss/generation/async_generation.py`, global generation throttling must happen in the main `GenerationController` dispatch path, not by bounding the worker `mp.Queue` alone. Each worker immediately drains that queue into its own local asyncio queue, so a plain queue `maxsize` is not a real global concurrency cap. Use callback/result arrival as the ack that frees one in-flight slot and dispatches the next pending genable.
- In the eval, do not pass raw `max_decode_steps=max_seq_len` straight through to `chat/completions`. Cap each request by `max_model_len - len(prompt.tokens)`; otherwise vLLM rejects every call with `VLLMValidationError` because the prompt leaves zero completion budget.
- `vllm_gpu_memory_utilization=0.95` can make the full mixed-benchmark eval collapse with `EngineDeadError` / `Process EngineCore_DP* died` once generation starts. Lowering the vLLM flag to `0.85` stabilized the default `num_generation_workers=32` subset run (benchmark `down_sample_to=1`) and allowed the full run to start cleanly without immediate OOM spam.
- If you introduce benchmark abstractions on the OSS side, keep the base protocol under `steptronoss/generation/base_benchmark.py`, and put concrete benchmark implementations under `playground/eval/benchmarks/<BenchmarkName>/`. Avoid hiding benchmark selection behind a registry when the benchmark set is still evolving quickly; explicit construction in the eval exp is easier to audit and refactor.
- Shared simple-benchmark eval plumbing now lives in `playground/eval/eval_sets/simple_eval.py`. That module owns the simple-benchmark list itself; model-specific eval files should only bind model/resource/tokenizer config on top of `SimpleBenchmarksEvalConfig`.
- Some Step3/Step3.5 training exports under `/oss/checkpoints/.../hf` contain only safetensor shards plus `model.safetensors.index.json`, without `config.json` or tokenizer assets. Those raw dirs are not directly serveable by vLLM; prepare a wrapper HF dir (for example `hf_vllm`) that adds a compatible `config.json`, and point `tokenizer_path` at a separate mounted tokenizer.
- Sampling policy for the shared simple-benchmark eval should live in `SimpleBenchmarksEvalConfig.get_sampling_params(...)`, not be hardcoded inside `SimpleChatGeneratable`. Keep the generatable responsible only for per-request normalization such as context-budget clamping and filling a default seed when the config leaves it unset.
- Benchmark-focused tests under `tests/` should live in `tests/benchmarks/` instead of the top-level `tests/` directory, so benchmark wrappers and their fixtures stay grouped together.
- Benchmark-specific code should stay inside its own benchmark folder under `playground/eval/benchmarks/<BenchmarkName>/`; avoid spreading benchmark logic, helper modules, or downloaded benchmark assets into unrelated directories.
- Benchmark class initialization must stay lightweight. Prefer lazy import, lazy data parsing, and lazy verifier/resource setup; importing a benchmark module or constructing the benchmark object should not trigger heavyweight package imports, network access, or resource downloads.
- Benchmark resources should live under one explicit root path agreed for that benchmark, and the benchmark class should receive that path through initialization parameters or derive it from a caller-provided parent such as `datasets_dir`. Do not hide resource paths across multiple hardcoded locations.
- If a benchmark depends on external resources beyond Python packages, such as NLTK corpora/models, download them ahead of time into that benchmark resource root and have runtime code read from there. Do not rely on import-time auto-download behavior.
- `playground/eval/benchmarks/IFBench/benchmark.py` should lazy-import the official AllenAI verifier from `playground/eval/benchmarks/IFBench/official/` and derive its explicit resource root from the caller-provided simple-benchmark `datasets_dir`, using `<datasets_dir>/IFBENCH/` for `IFBench_test.jsonl` plus `nltk_data/`. `simple_eval` should pass that directory root directly, and the benchmark should only accept that directory-root form instead of carrying compatibility for explicit prompt-file paths. Do not keep a second hardcoded IFBench resource root in the benchmark or helper modules. Keep NLTK/resource setup lazy too; do not trigger imports, downloads, or directory creation at module import time. It defaults to official `loose` scoring and strips inline `<think>...</think>`-style reasoning before verification; rollout/sampling settings still come from `simple_eval`, so leaderboard parity still requires matching the official generation settings such as `temperature=0`.
- Keep IFBench benchmark-owned sampling overrides narrow. `playground/eval/benchmarks/IFBench/benchmark.py` should pin official settings like `temperature=0`, but should not hardcode `extra_body.chat_template_kwargs`; IFBench thinking/chat-template behavior should flow from `SimpleBenchmarksEvalConfig.chat_template_args`.
- For vendored official benchmark helpers such as `playground/eval/benchmarks/IFBench/official/`, keep only the runtime scoring path needed by the OSS benchmark wrapper. Script-style file I/O helpers, report printers, and other standalone-binary scaffolding from the upstream repo are dead weight unless the OSS call path actually invokes them.

## 6. Parallelism and Checkpointing

### Parallel state
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,6 @@ set_optimization(

- [x] SFT exps
- [x] Reference configs: Qwen3 8B `playground/pretrain/qwen3/qwen3_8.py`, Step3.5 Flash `playground/pretrain/step3p5/step3p5_flash.py`
- [ ] Eval
- [x] Eval
- [ ] RLVR implementation
- [ ] Triton kernel implementation
- [x] Triton kernel implementation
4 changes: 2 additions & 2 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,6 @@ set_optimization(

- [x] SFT exps
- [x] Reference configs: Qwen3 8B `playground/pretrain/qwen3/qwen3_8.py`, Step3.5 Flash `playground/pretrain/step3p5/step3p5_flash.py`
- [ ] Eval
- [x] Eval
- [ ] RLVR 实现
- [ ] Triton kernel 实现
- [x] Triton kernel 实现
Loading
Loading