Megatron for Miles PR: #28
Usage
Docker:
H200/B200 (cu129 x86):
docker pull radixark/miles:deepseek-v4
GB300 (cu130 arm64):
Coming soon
Single-node, layer-pruned model smoke test:
This run will not generate readable output - just for the minimal check of the infrastructure.
docker run -it --name flash-smoke --gpus all --privileged --network=host --ipc=host --shm-size=16g --ulimit memlock=-1 radixark/miles:deepseek-v4 bash
# in docker run
rm -rf /root/miles && git clone --depth 1 --branch deepseek-v4 https://github.com/yueming-yuan/miles.git /root/miles && cd /root/miles && git pull --ff-only origin deepseek-v4 && pip install -e . --no-deps
rm -rf /root/sglang && git clone --depth 1 --branch deepseek_v4 https://github.com/sgl-project/sglang.git /root/sglang && cd /root/sglang && git pull --ff-only origin deepseek_v4 && pip install -e python --no-deps
cd /root/miles && python scripts/run_deepseek_v4.py full-train --model-name DeepSeek-V4-Flash-FP8-4layer --num-nodes 1 --num-gpus-per-node 8
DeepSeek-V4-Flash (284B)
This command may differ, depending on your cluster settings. Check and configure in run_deepseek_v4.py for full training configs, parallel settings, and RL features.
By default, it requires 8 H200 nodes with 8 GPUs per node, shared storage for /root/models and /root/datasets, working NCCL/Gloo networking, and either the script-managed local Ray head or an existing external Ray cluster enabled via MILES_SCRIPT_EXTERNAL_RAY=1 and RAY_ADDRESS.
cd /root/miles
python scripts/run_deepseek_v4.py full-train \
--model-name DeepSeek-V4-Flash-FP8 \
--num-nodes 8 \
--num-gpus-per-node 8
DeepSeek-V4-Pro (1.6T)
TODO, ETA Apr 29/30
Verified Model Sizes
Low Precisions
Kernels
@Zhichenzzz
Features / Optimizations
Miles PR: #1045
Megatron for Miles PR: #28
Usage
Docker:
H200/B200 (cu129 x86):
GB300 (cu130 arm64):
Coming soon
Single-node, layer-pruned model smoke test:
This run will not generate readable output - just for the minimal check of the infrastructure.
DeepSeek-V4-Flash (284B)
This command may differ, depending on your cluster settings. Check and configure in
run_deepseek_v4.pyfor full training configs, parallel settings, and RL features.By default, it requires 8 H200 nodes with 8 GPUs per node, shared storage for /root/models and /root/datasets, working NCCL/Gloo networking, and either the script-managed local Ray head or an existing external Ray cluster enabled via MILES_SCRIPT_EXTERNAL_RAY=1 and RAY_ADDRESS.
DeepSeek-V4-Pro (1.6T)
TODO, ETA Apr 29/30
Verified Model Sizes
Low Precisions
Kernels
@Zhichenzzz
Features / Optimizations