stepfun-ai · Randomizez · Mar 18, 2026 · Mar 10, 2026 · Mar 17, 2026 · Mar 18, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -171,6 +171,35 @@ Improve pass:
   - `use_swiglu_limit`
   - `use_swiglu_limit_shared`
 
+### Eval experiment pattern
+
+- `playground/eval/*` currently uses a thin `GenableEvalConfig` wrapper, not the trainer stack.
+- The common eval skeleton is three roles in `resource_cfg.task_specs`: `router`, `vllm`, `evaluator`.
+- Runtime flow is: `Exp.entrypoint()` dispatch by `ROLE` -> router publishes `VLLM_ROUTER_ADDR_PORT_<key>` in exp Redis -> vLLM worker health-checks then registers -> evaluator waits via `vllm_cfg.build_cli().wait_for_server()` and runs `eval_cfg.eval()`.
+- Sample execution path is `GenableEvalConfig.eval()` -> `GenerationController.generate()` -> `SimpleTrainable.generate()` -> router `/v1/completions`; task-specific metrics are computed in the concrete eval config, not in a shared benchmark harness.
+- These eval jobs depend on `STEPTRON_MEET_DIR` because `get_exp_redis()` uses the shared filesystem there for Redis rendezvous.
+- For a single-node eval, prefer `python tools/mp_run.py playground/eval/...py`; for multi-node/manual scheduling, generate per-role scripts with `python tools/build_scripts.py playground/eval/...py <output_root>`.
+- A fresh vLLM 0.17 eval startup can spend several minutes before `/health` opens: model inspection, distributed init, weight load, `torch.compile`, KV-cache sizing, and CUDA graph capture all happen before the controller can register success. Repeated controller-side `Connection refused` during that phase is expected if logs still show forward progress.
+- Use `GenableItem` for generation-only eval items and reserve `TrainableItem` for objects that actually implement `generate_for_train()`. `GenerationController` now accepts `GenableItem` on the normal generate path and only requires `TrainableItem` when `for_train=True`.
+- The chat eval wraps each `chat/completions` request with `retry_on(...)` for transient HTTP failures. Retry only transport errors and transient statuses (`408`, `425`, `429`, `5xx`); keep permanent `4xx` responses fail-fast so bad requests are not retried blindly.
+- `steptronoss/generation/vllm/vllm_router.py` now aims to be timeout-transparent: its upstream `aiohttp` session disables `total/connect/sock_connect/sock_read` timeouts instead of imposing an extra router-side timeout layer. Client-side timeouts still exist, but they are not propagated over HTTP; the closest transparent behavior is for the router not to add its own.
+- `GenerationController.set_tqdm(disabled, total, desc)` is the supported way to customize progress output. The callback thread owns the actual tqdm instance; callers should configure it from the main thread instead of constructing ad-hoc bars around controller callbacks.
+- In `steptronoss/generation/async_generation.py`, global generation throttling must happen in the main `GenerationController` dispatch path, not by bounding the worker `mp.Queue` alone. Each worker immediately drains that queue into its own local asyncio queue, so a plain queue `maxsize` is not a real global concurrency cap. Use callback/result arrival as the ack that frees one in-flight slot and dispatches the next pending genable.
+- In the eval, do not pass raw `max_decode_steps=max_seq_len` straight through to `chat/completions`. Cap each request by `max_model_len - len(prompt.tokens)`; otherwise vLLM rejects every call with `VLLMValidationError` because the prompt leaves zero completion budget.
+- `vllm_gpu_memory_utilization=0.95` can make the full mixed-benchmark eval collapse with `EngineDeadError` / `Process EngineCore_DP* died` once generation starts. Lowering the vLLM flag to `0.85` stabilized the default `num_generation_workers=32` subset run (benchmark `down_sample_to=1`) and allowed the full run to start cleanly without immediate OOM spam.
+- If you introduce benchmark abstractions on the OSS side, keep the base protocol under `steptronoss/generation/base_benchmark.py`, and put concrete benchmark implementations under `playground/eval/benchmarks/<BenchmarkName>/`. Avoid hiding benchmark selection behind a registry when the benchmark set is still evolving quickly; explicit construction in the eval exp is easier to audit and refactor.
+- Shared simple-benchmark eval plumbing now lives in `playground/eval/eval_sets/simple_eval.py`. That module owns the simple-benchmark list itself; model-specific eval files should only bind model/resource/tokenizer config on top of `SimpleBenchmarksEvalConfig`.
+- Some Step3/Step3.5 training exports under `/oss/checkpoints/.../hf` contain only safetensor shards plus `model.safetensors.index.json`, without `config.json` or tokenizer assets. Those raw dirs are not directly serveable by vLLM; prepare a wrapper HF dir (for example `hf_vllm`) that adds a compatible `config.json`, and point `tokenizer_path` at a separate mounted tokenizer.
+- Sampling policy for the shared simple-benchmark eval should live in `SimpleBenchmarksEvalConfig.get_sampling_params(...)`, not be hardcoded inside `SimpleChatGeneratable`. Keep the generatable responsible only for per-request normalization such as context-budget clamping and filling a default seed when the config leaves it unset.
+- Benchmark-focused tests under `tests/` should live in `tests/benchmarks/` instead of the top-level `tests/` directory, so benchmark wrappers and their fixtures stay grouped together.
+- Benchmark-specific code should stay inside its own benchmark folder under `playground/eval/benchmarks/<BenchmarkName>/`; avoid spreading benchmark logic, helper modules, or downloaded benchmark assets into unrelated directories.
+- Benchmark class initialization must stay lightweight. Prefer lazy import, lazy data parsing, and lazy verifier/resource setup; importing a benchmark module or constructing the benchmark object should not trigger heavyweight package imports, network access, or resource downloads.
+- Benchmark resources should live under one explicit root path agreed for that benchmark, and the benchmark class should receive that path through initialization parameters or derive it from a caller-provided parent such as `datasets_dir`. Do not hide resource paths across multiple hardcoded locations.
+- If a benchmark depends on external resources beyond Python packages, such as NLTK corpora/models, download them ahead of time into that benchmark resource root and have runtime code read from there. Do not rely on import-time auto-download behavior.
+- `playground/eval/benchmarks/IFBench/benchmark.py` should lazy-import the official AllenAI verifier from `playground/eval/benchmarks/IFBench/official/` and derive its explicit resource root from the caller-provided simple-benchmark `datasets_dir`, using `<datasets_dir>/IFBENCH/` for `IFBench_test.jsonl` plus `nltk_data/`. `simple_eval` should pass that directory root directly, and the benchmark should only accept that directory-root form instead of carrying compatibility for explicit prompt-file paths. Do not keep a second hardcoded IFBench resource root in the benchmark or helper modules. Keep NLTK/resource setup lazy too; do not trigger imports, downloads, or directory creation at module import time. It defaults to official `loose` scoring and strips inline `<think>...</think>`-style reasoning before verification; rollout/sampling settings still come from `simple_eval`, so leaderboard parity still requires matching the official generation settings such as `temperature=0`.
+- Keep IFBench benchmark-owned sampling overrides narrow. `playground/eval/benchmarks/IFBench/benchmark.py` should pin official settings like `temperature=0`, but should not hardcode `extra_body.chat_template_kwargs`; IFBench thinking/chat-template behavior should flow from `SimpleBenchmarksEvalConfig.chat_template_args`.
+- For vendored official benchmark helpers such as `playground/eval/benchmarks/IFBench/official/`, keep only the runtime scoring path needed by the OSS benchmark wrapper. Script-style file I/O helpers, report printers, and other standalone-binary scaffolding from the upstream repo are dead weight unless the OSS call path actually invokes them.
+
 ## 6. Parallelism and Checkpointing
 
 ### Parallel state

diff --git a/README.md b/README.md
@@ -180,6 +180,6 @@ set_optimization(
 
 - [x] SFT exps
 - [x] Reference configs: Qwen3 8B `playground/pretrain/qwen3/qwen3_8.py`, Step3.5 Flash `playground/pretrain/step3p5/step3p5_flash.py`
-- [ ] Eval
+- [x] Eval
 - [ ] RLVR implementation
-- [ ] Triton kernel implementation
+- [x] Triton kernel implementation
diff --git a/README_ZH.md b/README_ZH.md
@@ -175,6 +175,6 @@ set_optimization(
 
 - [x] SFT exps
 - [x] Reference configs: Qwen3 8B `playground/pretrain/qwen3/qwen3_8.py`, Step3.5 Flash `playground/pretrain/step3p5/step3p5_flash.py`
-- [ ] Eval
+- [x] Eval
 - [ ] RLVR 实现
-- [ ] Triton kernel 实现
+- [x] Triton kernel 实现