-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Description
When running multi-turn reinforcement learning training on AMD MI300X GPUs, the compute node crashes consistently within 2 training steps. The crash is preceded by CPU utilization spiking to 100%, indicating a potential resource contention or memory management issue specific to the AMD platform.
Environment
- Hardware: AMD MI300X GPUs (8x GPUs)
- Docker Image:
rlsys/tritonforge:stable - ROCm Version: 6.3.4
- Platform: Ubuntu 22.04
- Framework: SLIME with KBenchEval integration
Reproduction Steps
- Launch the AMD Docker container:
docker pull rlsys/tritonforge:stable
docker run -it \
--device /dev/dri \
--device /dev/kfd \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
--shm-size 128G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v "$HOME:$HOME" \
--name tritonforge_dev \
rlsys/tritonforge:stable \
/bin/bash- Navigate to SLIME directory:
cd /root/TritonForge/SLIME- Run the multi-turn training script:
bash scripts/run_agent_kbench_qwen3_8B_sft_amd_multiturn_robust.shExpected Behavior
The training should proceed through multiple epochs with stable CPU and GPU utilization, successfully generating and evaluating Triton kernels across multiple turns.
Actual Behavior
- Training starts normally with initial setup completing successfully
- Within 2 training steps, CPU utilization rapidly increases to 100%
- Node becomes unresponsive and crashes
- Tmux sessions (slime, buffer, eval_server) terminate unexpectedly
Observations
- Single-turn training works: The single-turn variant runs without issues in 12 hours
- NVIDIA platform unaffected: The same multi-turn training runs successfully on NVIDIA H100 GPUs
- Robust evaluation server: Despite using the enhanced
eval_server_subprocess_robust.pywith memory fault handling, crashes still occur - Resource allocation: GPUs 0-5 for training, GPUs 6-7 for evaluation server
Attempted Mitigations
The following environment variables and configurations have been set but do not prevent the crash:
export HSA_ENABLE_COREDUMP=0
export AMD_LOG_LEVEL=0
export ROCM_DISABLE_CRASH_DUMP=1
export HIP_ENABLE_COREDUMP=0
export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=1Logs
Key log locations for debugging:
- Training:
/root/TritonForge/SLIME/logs/slime_qwen3_sft_amd_robust_train.log - Buffer:
/root/TritonForge/SLIME/logs/buffer_qwen3_sft_amd_robust.log - Eval Server:
/root/TritonForge/SLIME/logs/eval_server_qwen3_sft_amd_robust.log
Status
🔍 Under Investigation
We are actively investigating this issue and working on a fix. Current areas of investigation include:
- Ray cluster memory management on AMD platforms
- ROCm/HIP specific resource allocation conflicts
- Multi-process coordination between training and evaluation servers
- Potential memory leaks in the multi-turn feedback loop
Workaround
For now, AMD users can:
- Use single-turn training which remains stable
- Run evaluation on NVIDIA GPUs if available in a hybrid setup
- Reduce the number of parallel processes
Related Information
- This issue is specific to AMD MI300X GPUs
- The robust evaluation server (
eval_server_subprocess_robust.py) was designed to handle GPU memory faults but doesn't prevent this CPU-related crash - The issue may be related to Ray's interaction with ROCm in multi-GPU scenarios
Priority: High
Labels: bug, AMD, MI300X, multi-turn, training
We appreciate your patience as we work to resolve this issue. Updates will be posted here as we make progress.