[BUG] AMD MI300X Multi-Turn Training Node Crash - CPU 100% Utilization

## Description

When running multi-turn reinforcement learning training on AMD MI300X GPUs, the compute node crashes consistently within 2 training steps. The crash is preceded by CPU utilization spiking to 100%, indicating a potential resource contention or memory management issue specific to the AMD platform.

## Environment

- **Hardware**: AMD MI300X GPUs (8x GPUs)
- **Docker Image**: `rlsys/tritonforge:stable`
- **ROCm Version**: 6.3.4
- **Platform**: Ubuntu 22.04
- **Framework**: SLIME with KBenchEval integration

## Reproduction Steps

1. Launch the AMD Docker container:
```bash
docker pull rlsys/tritonforge:stable
docker run -it \
  --device /dev/dri \
  --device /dev/kfd \
  --group-add video \
  --cap-add SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --privileged \
  --shm-size 128G \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v "$HOME:$HOME" \
  --name tritonforge_dev \
  rlsys/tritonforge:stable \
  /bin/bash
```

2. Navigate to SLIME directory:
```bash
cd /root/TritonForge/SLIME
```

3. Run the multi-turn training script:
```bash
bash scripts/run_agent_kbench_qwen3_8B_sft_amd_multiturn_robust.sh
```

## Expected Behavior

The training should proceed through multiple epochs with stable CPU and GPU utilization, successfully generating and evaluating Triton kernels across multiple turns.

## Actual Behavior

- Training starts normally with initial setup completing successfully
- Within 2 training steps, CPU utilization rapidly increases to 100%
- Node becomes unresponsive and crashes
- Tmux sessions (slime, buffer, eval_server) terminate unexpectedly

## Observations

1. **Single-turn training works**: The single-turn variant runs without issues in 12 hours
2. **NVIDIA platform unaffected**: The same multi-turn training runs successfully on NVIDIA H100 GPUs
3. **Robust evaluation server**: Despite using the enhanced `eval_server_subprocess_robust.py` with memory fault handling, crashes still occur
4. **Resource allocation**: GPUs 0-5 for training, GPUs 6-7 for evaluation server

## Attempted Mitigations

The following environment variables and configurations have been set but do not prevent the crash:
```bash
export HSA_ENABLE_COREDUMP=0
export AMD_LOG_LEVEL=0
export ROCM_DISABLE_CRASH_DUMP=1
export HIP_ENABLE_COREDUMP=0
export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=1
```

## Logs

Key log locations for debugging:
- Training: `/root/TritonForge/SLIME/logs/slime_qwen3_sft_amd_robust_train.log`
- Buffer: `/root/TritonForge/SLIME/logs/buffer_qwen3_sft_amd_robust.log`
- Eval Server: `/root/TritonForge/SLIME/logs/eval_server_qwen3_sft_amd_robust.log`

## Status

**🔍 Under Investigation**

We are actively investigating this issue and working on a fix. Current areas of investigation include:
- Ray cluster memory management on AMD platforms
- ROCm/HIP specific resource allocation conflicts
- Multi-process coordination between training and evaluation servers
- Potential memory leaks in the multi-turn feedback loop

## Workaround

For now, AMD users can:
1. Use single-turn training which remains stable
2. Run evaluation on NVIDIA GPUs if available in a hybrid setup
3. Reduce the number of parallel processes

## Related Information

- This issue is specific to AMD MI300X GPUs
- The robust evaluation server (`eval_server_subprocess_robust.py`) was designed to handle GPU memory faults but doesn't prevent this CPU-related crash
- The issue may be related to Ray's interaction with ROCm in multi-GPU scenarios

---

**Priority**: High
**Labels**: bug, AMD, MI300X, multi-turn, training

We appreciate your patience as we work to resolve this issue. Updates will be posted here as we make progress.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] AMD MI300X Multi-Turn Training Node Crash - CPU 100% Utilization #1

Description

Environment

Reproduction Steps

Expected Behavior

Actual Behavior

Observations

Attempted Mitigations

Logs

Status

Workaround

Related Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] AMD MI300X Multi-Turn Training Node Crash - CPU 100% Utilization #1

Description

Description

Environment

Reproduction Steps

Expected Behavior

Actual Behavior

Observations

Attempted Mitigations

Logs

Status

Workaround

Related Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions