This file stores repo-specific priors for future agents. Keep it short, practical, and biased toward things that save repeated exploration.
Do an improve pass for key tasks:
- new environment
- new task type
- new project area
- high risk / high cost work
- collaboration / handoff work
Improve pass:
- Identify friction
- Extract reusable priors
- Write them down in:
AGENTS.mdfor repo-wide priorsdocs/for process / runbook details
- Core package:
steptronoss/- core, model, data, exp, optimizer, generation, tokenizer, utils, checkpointing
- Experiments:
playground/ - Tests:
tests/ - Docs:
docs/
steptronoss/utils/arguments.py: config overrides from CLIsteptronoss/utils/comm_utils.py: Redis rendezvous, queues,LocalFuture/RemoteFuturesteptronoss/utils/dist_utils.py: broadcast / all-to-all helpers, packing helpers, balancing helperssteptronoss/utils/general.py: numeric helpers, list split/balance, RNG fork, retry, recursion helpers, git hashsteptronoss/utils/logger.py: rank-aware logging andStepWritersteptronoss/utils/metrics.py: metrics system (Metric,Avg,Percentage,Histogram,Text,GradNorm,GlobalMetrics)steptronoss/utils/optimizable.py:@optimizable(...)andset_optimization(...)steptronoss/utils/utils.py: model unwrap, param norms, memory report, layer map, IO helpers, generic loadsteptronoss/utils/weight_loader.py: HF safetensors mapping / merge
- Config class fields should include a short triple-quoted docstring immediately after the attribute definition.
- Follow the
configurizepattern:- class attrs declare sub-config types
- instance
__init__sets concrete values - use
Ref("..path")for cross-node linkage - configs expose
build()/build_*,sanity_check(),to_dict()
- Only
Ref(...)the exact parameter needed, not whole config objects.
- SFT experiments under
playground/sft/qwen3/*_sft_step3_data.pytypically follow:class Exp(BaseExp)model_cfg/data_cfgdeclared as class attrs- trainer / checkpoint / model fields adjusted in
__init__ - entrypoint is
if __name__ == "__main__": Exp().train()
- After
uv sync, also installredis-server:apt install -y redis-server
- Set:
CUDA_HOME=/data/cuda/cuda-12.9/cudaCUDACXX=$CUDA_HOME/bin/nvcc
- Install:
pip install -e /data/DeepEP --no-build-isolation
- Do not rely on random prebuilt wheels; ABI mismatch is common.
- Use repo source in
third_party/grouped_gemm. - Ensure CUTLASS headers exist by linking to FlashInfer CUTLASS:
rmdir third_party/grouped_gemm/third_party/cutlassln -s .venv/lib/python3.10/site-packages/flashinfer/data/cutlass third_party/grouped_gemm/third_party/cutlass
- Build with CUDA 12.9:
CUDA_HOME=/data/cuda/cuda-12.9/cuda CUDACXX=/data/cuda/cuda-12.9/cuda/bin/nvcc .venv/bin/pip install -e third_party/grouped_gemm --no-build-isolation
- Runtime constraints:
batch_sizesmust be CPU-visible /torch.int64- inputs must be bf16 for
nv_grouped_gemm
steptronoss/expprovides abstract*Configinterfaces (build_*,get_trainer_cls)- Concrete configs live mainly in
steptronoss/exp/base_exp.py - Ready-made experiment families:
PretrainExp/NTPTrainerConfiginntp.pySFTExp/SFTDataConfiginsft.py- inference configs in
inference.py
- Common training configs:
AdamConfig- constant / linear / cosine schedulers
- checkpoint config (
SaveOptions,LoadOptions,CheckpointConfig)
- After creating or editing an experiment:
- run
cfshow <exp.py>to inspect the config tree - make sure
sanity_check()passes - run
mypy <exp.py>
- run
- If experiment B is derived from experiment A, use
cfshowdiff to verify changes.
- Pretrain configs live under
playground/pretrain/ playground/pretrain/step3p5/step3p5_flash.pyis the main recent Qwen3 config reference- When translating a full
ModelConfigintostep3p5_flash.py, update only existing attrs - Some keys map indirectly:
disable_qk_norm↔use_qk_norm(inverted)use_swiglu_limit↔swiglu_limit
- If you change
num_layers, keep all layer-wise lists in sync:qk_rope_head_dimrope_thetause_fused_qknorm_and_ropeuse_swiglu_limituse_swiglu_limit_shared
- Global
PMinsteptronoss.core.parallel_stateis theParallelManager - Typical flow:
PM.initialize()PM.set_mesh(parallel_cfg)- or
with PM.use_mesh(parallel_cfg): ...
- Common helpers:
PM.define_parallel(pattern, **sizes)PM.size_of("TP")PM.rank_in("DP")PM.group_of("PP")PM.ranks_of("EP")
- VPP uses:
virtual_pipeline_model_parallel_sizeget_vpp_rank()set_vpp_rank()
ParallelConfig.sanity_check()requires:WORLD_SIZEdivisible by attention MP size =PP * TP * CPWORLD_SIZEdivisible by MoE MP size =PP * ETP * EP
- For an 8-GPU run with
TP=8andEP=8, set:expert_tensor_parallel_size=1- otherwise MoE MP size becomes 64 and the config is invalid
- In mixed dense/MoE topologies, expert params are reduced over
EDP, not denseDP; the current gradient manager compensates withTP/EPscaling on expert grad buffers before theEDPreduction, so check that path before blaming an apparent extraEPfactor.
steptronoss/checkpointing/reshape_ops.pycontains reshape primitives:VocabPadColumnParallel/RowParallelKeepThisTP/KeepThisEPGQAMergeQKVFFNMergeGateUpUnbindMoERenameInverse
- Typical usage:
- build
Script(src=..., op=..., dst=...) - return
OnlineReshaper(scripts)
- build
- For expert slicing from per-expert keys:
- use
Inverse(UnbindMoE(...)) + KeepThisEP()before TP ops
- use
TrainableItemshould not pickle / serialize tokenizer instances; drop them in__getstate__model_nameoften includesexp_id; persist a template likedeployed-model-{EXP_ID}so resume survivesexp_idchanges
- Use
MuonConfig.mark_muon_params(model)before grouping - In experiments, prefer overriding
optimizer_cfgvia aGradientManagerConfigsubclass that setsoptimizer_cfg = MuonConfig - Leave distributed optimizer on, but avoid byte-level sharding
- For Muon tests, prefer composing existing reshape ops instead of inventing new ones
- Follow:
docs/TRITON_ACCELERATION_WORKFLOW.mddocs/TRITON_ACCELERATION_WORKFLOW_ZH.md
- Rules:
- optimized implementations belong under
steptronoss/model/optimizations/* - semantic entrypoints stay in
steptronoss/model/utils/* - expose alternatives through
@optimizable(...) - alternatives must be strict drop-in replacements
- add correctness tests, backward-aware benchmarks, and a short real experiment trace
- optimized implementations belong under
rgmay be unavailable; fall back tofind/greppythonmay be missing andpython3may not includepytest; prefer project tooling if availabletests/conftest.pynow applies a shared skip to every@pytest.mark.node2test unless the run is launched undertorchrun --nproc-per-node=2; plainpytestshould skip them instead of hanging in distributed init.
- This environment may not have worker / GPU access; avoid running GPU-only tests when the machine does not actually have GPUs
@pytest.mark.node2tests should also usepytest.mark.xdist_group("torchrun")- Test layout:
- single-node GPU tests:
tests/test_muon_optimizer.py - 2-node GPU tests:
tests/test_muon_optimizer_node2.py
- single-node GPU tests:
steptronoss/model/ep_dispatcher/deepep_dispatcher.pymust keeprecv_token_probsdifferentiable and passgrad_recv_token_probsintobuffer.combine(...); otherwise router main-loss gradients are cut whenTokenDispatcher="deep_ep".
steptronoss.utils.memory_tracker.CMTonly records whenMEM_DIAGNOSE=1- If training hangs on
Waiting for debugger... ip: ... rank: 56, check for a straydebug(56)insteptronoss/core/trainers/lm_trainer.py - TorchDynamo graph breaks are often triggered by
Tensor.item()in optimizable helpers; prefer tensor-safe checks like maskedamax+torch._assert