Deepseek V4 RL Roadmap

### Miles PR: #1045 
### Megatron for Miles PR: [#28](https://github.com/radixark/Megatron-LM/pull/28)


## Usage
### Docker: 
**H200/B200 (cu129 x86)**:
```
docker pull radixark/miles:deepseek-v4
```
**GB300 (cu130 arm64)**:
Coming soon

### Single-node, layer-pruned model smoke test:
**This run will not generate readable output - just for the minimal check of the infrastructure.**
```
docker run -it --name flash-smoke --gpus all --privileged --network=host --ipc=host --shm-size=16g --ulimit memlock=-1 radixark/miles:deepseek-v4 bash
# in docker run
rm -rf /root/miles && git clone --depth 1 --branch deepseek-v4 https://github.com/yueming-yuan/miles.git /root/miles && cd /root/miles && git pull --ff-only origin deepseek-v4 && pip install -e . --no-deps
rm -rf /root/sglang && git clone --depth 1 --branch deepseek_v4 https://github.com/sgl-project/sglang.git /root/sglang && cd /root/sglang && git pull --ff-only origin deepseek_v4 && pip install -e python --no-deps
cd /root/miles && python scripts/run_deepseek_v4.py full-train --model-name DeepSeek-V4-Flash-FP8-4layer --num-nodes 1 --num-gpus-per-node 8
```

### DeepSeek-V4-Flash (284B)
**This command may differ, depending on your cluster settings. Check and configure in `run_deepseek_v4.py` for full training configs, parallel settings, and RL features.**
By default, it requires 8 H200 nodes with 8 GPUs per node, shared storage for /root/models and /root/datasets, working NCCL/Gloo networking, and either the script-managed local Ray head or an existing external Ray cluster enabled via MILES_SCRIPT_EXTERNAL_RAY=1 and RAY_ADDRESS.
```
cd /root/miles
python scripts/run_deepseek_v4.py full-train \
    --model-name DeepSeek-V4-Flash-FP8 \
    --num-nodes 8 \
    --num-gpus-per-node 8
```

### DeepSeek-V4-Pro (1.6T)
*TODO, ETA Apr 29/30*


## Verified Model Sizes
- [x] DeepSeek-V4-Flash 284B
- [ ] DeepSeek-V4-Pro 1.6T

## Low Precisions
- [x] FP8 Training
- [ ] FP4 QAT training recipe
    - [ ] MXFP4/NVFP4 support
    - [ ] FP4 QAT

## Kernels
@Zhichenzzz 
- [ ] Integration of [TileKernels](https://github.com/deepseek-ai/TileKernels)
    - [x] mHC forward & backward
- [ ] Compressor kernels

## Features / Optimizations
- [ ] Anticipatory Routing
- [ ] Ring-style CSA Context Parallel



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepseek V4 RL Roadmap #1046

Miles PR: #1045

Megatron for Miles PR: #28

Usage

Docker:

Single-node, layer-pruned model smoke test:

DeepSeek-V4-Flash (284B)

DeepSeek-V4-Pro (1.6T)

Verified Model Sizes

Low Precisions

Kernels

Features / Optimizations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deepseek V4 RL Roadmap #1046

Description

Miles PR: #1045

Megatron for Miles PR: #28

Usage

Docker:

Single-node, layer-pruned model smoke test:

DeepSeek-V4-Flash (284B)

DeepSeek-V4-Pro (1.6T)

Verified Model Sizes

Low Precisions

Kernels

Features / Optimizations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions