SteptronOss/AGENTS.md at dev · stepfun-ai/SteptronOss

StepTronOSS AGENTS

This file stores repo-specific priors for future agents. Keep it short, practical, and biased toward things that save repeated exploration.

1. Working Rules

Self-Improvement Loop

Do an improve pass for key tasks:

new environment
new task type
new project area
high risk / high cost work
collaboration / handoff work

Improve pass:

Identify friction
Extract reusable priors
Write them down in:
- AGENTS.md for repo-wide priors
- docs/ for process / runbook details

2. Repo Layout

Core package: steptronoss/
- core, model, data, exp, optimizer, generation, tokenizer, utils, checkpointing
Experiments: playground/
Tests: tests/
Docs: docs/

Utilities overview

steptronoss/utils/arguments.py: config overrides from CLI
steptronoss/utils/comm_utils.py: Redis rendezvous, queues, LocalFuture / RemoteFuture
steptronoss/utils/dist_utils.py: broadcast / all-to-all helpers, packing helpers, balancing helpers
steptronoss/utils/general.py: numeric helpers, list split/balance, RNG fork, retry, recursion helpers, git hash
steptronoss/utils/logger.py: rank-aware logging and StepWriter
steptronoss/utils/metrics.py: metrics system (Metric, Avg, Percentage, Histogram, Text, GradNorm, GlobalMetrics)
steptronoss/utils/optimizable.py: @optimizable(...) and set_optimization(...)
steptronoss/utils/utils.py: model unwrap, param norms, memory report, layer map, IO helpers, generic load
steptronoss/utils/weight_loader.py: HF safetensors mapping / merge

3. Code Style

Config style

Config class fields should include a short triple-quoted docstring immediately after the attribute definition.
Follow the configurize pattern:
- class attrs declare sub-config types
- instance __init__ sets concrete values
- use Ref("..path") for cross-node linkage
- configs expose build() / build_*, sanity_check(), to_dict()
Only Ref(...) the exact parameter needed, not whole config objects.

Experiment style

SFT experiments under playground/sft/qwen3/*_sft_step3_data.py typically follow:
- class Exp(BaseExp)
- model_cfg / data_cfg declared as class attrs
- trainer / checkpoint / model fields adjusted in __init__
- entrypoint is if __name__ == "__main__": Exp().train()

4. Setup Priors

After uv sync, also install redis-server:
- apt install -y redis-server

DeepEP build

Set:
- CUDA_HOME=/data/cuda/cuda-12.9/cuda
- CUDACXX=$CUDA_HOME/bin/nvcc
Install:
- pip install -e /data/DeepEP --no-build-isolation

nv-grouped-gemm build

Do not rely on random prebuilt wheels; ABI mismatch is common.
Use repo source in third_party/grouped_gemm.
Ensure CUTLASS headers exist by linking to FlashInfer CUTLASS:
- rmdir third_party/grouped_gemm/third_party/cutlass
- ln -s .venv/lib/python3.10/site-packages/flashinfer/data/cutlass third_party/grouped_gemm/third_party/cutlass
Build with CUDA 12.9:
- CUDA_HOME=/data/cuda/cuda-12.9/cuda CUDACXX=/data/cuda/cuda-12.9/cuda/bin/nvcc .venv/bin/pip install -e third_party/grouped_gemm --no-build-isolation
Runtime constraints:
- batch_sizes must be CPU-visible / torch.int64
- inputs must be bf16 for nv_grouped_gemm

5. Experiments and Configs

Config/module map

steptronoss/exp provides abstract *Config interfaces (build_*, get_trainer_cls)
Concrete configs live mainly in steptronoss/exp/base_exp.py
Ready-made experiment families:
- PretrainExp / NTPTrainerConfig in ntp.py
- SFTExp / SFTDataConfig in sft.py
- inference configs in inference.py
Common training configs:
- AdamConfig
- constant / linear / cosine schedulers
- checkpoint config (SaveOptions, LoadOptions, CheckpointConfig)

Experiment workflow

After creating or editing an experiment:
- run cfshow <exp.py> to inspect the config tree
- make sure sanity_check() passes
- run mypy <exp.py>
If experiment B is derived from experiment A, use cfshow diff to verify changes.

Pretrain config notes

Pretrain configs live under playground/pretrain/
playground/pretrain/step3p5/step3p5_flash.py is the main recent Qwen3 config reference
When translating a full ModelConfig into step3p5_flash.py, update only existing attrs
Some keys map indirectly:
- disable_qk_norm ↔ use_qk_norm (inverted)
- use_swiglu_limit ↔ swiglu_limit
If you change num_layers, keep all layer-wise lists in sync:
- qk_rope_head_dim
- rope_theta
- use_fused_qknorm_and_rope
- use_swiglu_limit
- use_swiglu_limit_shared

6. Parallelism and Checkpointing

Parallel state

Global PM in steptronoss.core.parallel_state is the ParallelManager
Typical flow:
- PM.initialize()
- PM.set_mesh(parallel_cfg)
- or with PM.use_mesh(parallel_cfg): ...
Common helpers:
- PM.define_parallel(pattern, **sizes)
- PM.size_of("TP")
- PM.rank_in("DP")
- PM.group_of("PP")
- PM.ranks_of("EP")
VPP uses:
- virtual_pipeline_model_parallel_size
- get_vpp_rank()
- set_vpp_rank()

EP / TP sizing

ParallelConfig.sanity_check() requires:
- WORLD_SIZE divisible by attention MP size = PP * TP * CP
- WORLD_SIZE divisible by MoE MP size = PP * ETP * EP
For an 8-GPU run with TP=8 and EP=8, set:
- expert_tensor_parallel_size=1
- otherwise MoE MP size becomes 64 and the config is invalid
In mixed dense/MoE topologies, expert params are reduced over EDP, not dense DP; the current gradient manager compensates with TP/EP scaling on expert grad buffers before the EDP reduction, so check that path before blaming an apparent extra EP factor.

Checkpoint reshape

steptronoss/checkpointing/reshape_ops.py contains reshape primitives:
- VocabPad
- ColumnParallel / RowParallel
- KeepThisTP / KeepThisEP
- GQAMergeQKV
- FFNMergeGateUp
- UnbindMoE
- Rename
- Inverse
Typical usage:
- build Script(src=..., op=..., dst=...)
- return OnlineReshaper(scripts)
For expert slicing from per-expert keys:
- use Inverse(UnbindMoE(...)) + KeepThisEP() before TP ops

7. RLVR Priors

TrainableItem should not pickle / serialize tokenizer instances; drop them in __getstate__
model_name often includes exp_id; persist a template like deployed-model-{EXP_ID} so resume survives exp_id changes

8. Optimization Guidance

Muon

Use MuonConfig.mark_muon_params(model) before grouping
In experiments, prefer overriding optimizer_cfg via a GradientManagerConfig subclass that sets optimizer_cfg = MuonConfig
Leave distributed optimizer on, but avoid byte-level sharding
For Muon tests, prefer composing existing reshape ops instead of inventing new ones

Triton workflow

Follow:
- docs/TRITON_ACCELERATION_WORKFLOW.md
- docs/TRITON_ACCELERATION_WORKFLOW_ZH.md
Rules:
- optimized implementations belong under steptronoss/model/optimizations/*
- semantic entrypoints stay in steptronoss/model/utils/*
- expose alternatives through @optimizable(...)
- alternatives must be strict drop-in replacements
- add correctness tests, backward-aware benchmarks, and a short real experiment trace

9. Tests and Tooling

Tooling notes

rg may be unavailable; fall back to find / grep
python may be missing and python3 may not include pytest; prefer project tooling if available
tests/conftest.py now applies a shared skip to every @pytest.mark.node2 test unless the run is launched under torchrun --nproc-per-node=2; plain pytest should skip them instead of hanging in distributed init.

GPU test notes

This environment may not have worker / GPU access; avoid running GPU-only tests when the machine does not actually have GPUs
@pytest.mark.node2 tests should also use pytest.mark.xdist_group("torchrun")
Test layout:
- single-node GPU tests: tests/test_muon_optimizer.py
- 2-node GPU tests: tests/test_muon_optimizer_node2.py
steptronoss/model/ep_dispatcher/deepep_dispatcher.py must keep recv_token_probs differentiable and pass grad_recv_token_probs into buffer.combine(...); otherwise router main-loss gradients are cut when TokenDispatcher="deep_ep".

10. Debugging Priors

steptronoss.utils.memory_tracker.CMT only records when MEM_DIAGNOSE=1
If training hangs on Waiting for debugger... ip: ... rank: 56, check for a stray debug(56) in steptronoss/core/trainers/lm_trainer.py
TorchDynamo graph breaks are often triggered by Tensor.item() in optimizable helpers; prefer tensor-safe checks like masked amax + torch._assert

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StepTronOSS AGENTS

1. Working Rules

Self-Improvement Loop

2. Repo Layout

Utilities overview

3. Code Style

Config style

Experiment style

4. Setup Priors

DeepEP build

nv-grouped-gemm build

5. Experiments and Configs

Config/module map

Experiment workflow

Pretrain config notes

6. Parallelism and Checkpointing

Parallel state

EP / TP sizing

Checkpoint reshape

7. RLVR Priors

8. Optimization Guidance

Muon

Triton workflow

9. Tests and Tooling

Tooling notes

GPU test notes

10. Debugging Priors

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

StepTronOSS AGENTS

1. Working Rules

Self-Improvement Loop

2. Repo Layout

Utilities overview

3. Code Style

Config style

Experiment style

4. Setup Priors

DeepEP build

nv-grouped-gemm build

5. Experiments and Configs

Config/module map

Experiment workflow

Pretrain config notes

6. Parallelism and Checkpointing

Parallel state

EP / TP sizing

Checkpoint reshape

7. RLVR Priors

8. Optimization Guidance

Muon

Triton workflow

9. Tests and Tooling

Tooling notes

GPU test notes

10. Debugging Priors