Skip to content

[BUG] AMD MI300X Multi-Turn Training Node Crash - CPU 100% Utilization #1

@jhinpan

Description

@jhinpan

Description

When running multi-turn reinforcement learning training on AMD MI300X GPUs, the compute node crashes consistently within 2 training steps. The crash is preceded by CPU utilization spiking to 100%, indicating a potential resource contention or memory management issue specific to the AMD platform.

Environment

  • Hardware: AMD MI300X GPUs (8x GPUs)
  • Docker Image: rlsys/tritonforge:stable
  • ROCm Version: 6.3.4
  • Platform: Ubuntu 22.04
  • Framework: SLIME with KBenchEval integration

Reproduction Steps

  1. Launch the AMD Docker container:
docker pull rlsys/tritonforge:stable
docker run -it \
  --device /dev/dri \
  --device /dev/kfd \
  --group-add video \
  --cap-add SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --privileged \
  --shm-size 128G \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v "$HOME:$HOME" \
  --name tritonforge_dev \
  rlsys/tritonforge:stable \
  /bin/bash
  1. Navigate to SLIME directory:
cd /root/TritonForge/SLIME
  1. Run the multi-turn training script:
bash scripts/run_agent_kbench_qwen3_8B_sft_amd_multiturn_robust.sh

Expected Behavior

The training should proceed through multiple epochs with stable CPU and GPU utilization, successfully generating and evaluating Triton kernels across multiple turns.

Actual Behavior

  • Training starts normally with initial setup completing successfully
  • Within 2 training steps, CPU utilization rapidly increases to 100%
  • Node becomes unresponsive and crashes
  • Tmux sessions (slime, buffer, eval_server) terminate unexpectedly

Observations

  1. Single-turn training works: The single-turn variant runs without issues in 12 hours
  2. NVIDIA platform unaffected: The same multi-turn training runs successfully on NVIDIA H100 GPUs
  3. Robust evaluation server: Despite using the enhanced eval_server_subprocess_robust.py with memory fault handling, crashes still occur
  4. Resource allocation: GPUs 0-5 for training, GPUs 6-7 for evaluation server

Attempted Mitigations

The following environment variables and configurations have been set but do not prevent the crash:

export HSA_ENABLE_COREDUMP=0
export AMD_LOG_LEVEL=0
export ROCM_DISABLE_CRASH_DUMP=1
export HIP_ENABLE_COREDUMP=0
export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=1

Logs

Key log locations for debugging:

  • Training: /root/TritonForge/SLIME/logs/slime_qwen3_sft_amd_robust_train.log
  • Buffer: /root/TritonForge/SLIME/logs/buffer_qwen3_sft_amd_robust.log
  • Eval Server: /root/TritonForge/SLIME/logs/eval_server_qwen3_sft_amd_robust.log

Status

🔍 Under Investigation

We are actively investigating this issue and working on a fix. Current areas of investigation include:

  • Ray cluster memory management on AMD platforms
  • ROCm/HIP specific resource allocation conflicts
  • Multi-process coordination between training and evaluation servers
  • Potential memory leaks in the multi-turn feedback loop

Workaround

For now, AMD users can:

  1. Use single-turn training which remains stable
  2. Run evaluation on NVIDIA GPUs if available in a hybrid setup
  3. Reduce the number of parallel processes

Related Information

  • This issue is specific to AMD MI300X GPUs
  • The robust evaluation server (eval_server_subprocess_robust.py) was designed to handle GPU memory faults but doesn't prevent this CPU-related crash
  • The issue may be related to Ray's interaction with ROCm in multi-GPU scenarios

Priority: High
Labels: bug, AMD, MI300X, multi-turn, training

We appreciate your patience as we work to resolve this issue. Updates will be posted here as we make progress.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions