You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blogs/artificial-intelligence/verl-large-scale/README.md
+61-13Lines changed: 61 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
blogpost: true
3
3
blog_title: "Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm Integration"
4
4
date: 24 Apr 2025
5
-
author: 'Yusheng Su, Vicky Tsang, Satya Jandhyala, Yao Liu, Phani Vaddadi, Vish Vadlamani, Zicheng Liu'
5
+
author: 'Yusheng Su, Vicky Tsang, Yao Liu, Phani Vaddadi, Vish Vadlamani, Zicheng Liu'
6
6
thumbnail: 'verl.jpg'
7
7
tags: Fine-Tuning, Reinforcement Learning, AI/ML
8
8
category: Applications & models
@@ -11,7 +11,7 @@ key_value_propositions: verl, an efficient RLHF framework designed for scalable
11
11
language: English
12
12
myst:
13
13
html_meta:
14
-
"author": "Yusheng Su, Vicky Tsang, Satya Jandhyala, Yao Liu, Phani Vaddadi, Vish Vadlamani, Zicheng Liu"
14
+
"author": "Yusheng Su, Vicky Tsang, Yao Liu, Phani Vaddadi, Vish Vadlamani, Zicheng Liu"
15
15
"description lang=en": "Deploy verl on AMD GPUs for fast, scalable RLHF training with ROCm optimization, Docker scripts, and impressive throughput-convergence results"
16
16
"keywords": "verl, Training, AMD GPUs."
17
17
"property=og:locale": "en_US"
@@ -54,7 +54,7 @@ Follow this guide to get started with verl on AMD Instinct GPUs and accelerate y
54
54
## Key Takeaways
55
55
56
56
1. verl framework and its advantages
57
-
2. AMD ROCm software support and docker image for the [v0.6.0](https://github.com/volcengine/verl/tree/0eb50ec4a33cda97e05ed8caab9c7f17a30c05a9) version verl
57
+
2. AMD ROCm software support and docker image for the [v0.3.0.post0](https://github.com/volcengine/verl/releases/tag/v0.3.0.post0) version verl
58
58
3. Single-node and multi-node training scripts for verl
59
59
4. Throughput and convergence accuracy
60
60
@@ -68,15 +68,62 @@ Building on this foundation, an open-source community ([Volcengine](https://gith
68
68
69
69
## Enabling verl on AMD Instinct™ with ROCm and Docker
70
70
71
-
To better support AMD Instinct GPUs and ROCm 7, we a provide a pre-built [docker images](https://hub.docker.com/r/rocm/verl/tags) to simplify the setup of the verl training environment.
71
+
To ensure verl functions effectively on AMD Instinct GPUs, we contributed some key ROCm enhancements. First, we updated the verl codebase to ensure compatibility with the ROCm kernel, enabling stable and efficient execution on AMD GPUs (see PR: [\[Hardware\] Support AMD (ROCMm Kernel) #360](https://github.com/volcengine/verl/pull/360)). Additionally, we addressed issues in the third-party library **Ray** on AMD Instinct, allowing it to handle dynamic resource allocation reliably---an essential step toward achieving high overall training efficiency (see PR: [Replace AMD device env var with HIP_VISIBLE_DEVICES #51104](https://github.com/ray-project/ray/pull/51104)).Additionally, to better support AMD Instinct GPUs, we provide [Dockerfiles](https://github.com/volcengine/verl/blob/main/docker/Dockerfile.rocm) to simplify the setup of the verl training environment. The aforementioned supports have been integrated into the official verl upstream repository and are included in the [v0.3.0.post0](https://github.com/volcengine/verl/releases/tag/v0.3.0.post0) version.
72
72
73
73
## Running verl on AMD GPUs: Single-Node and Multi-Node Setups
74
74
75
75
Let's get started with running verl on AMD Instinct:
76
76
77
77
### Single-node Training
78
78
79
-
The first step is to launch the verl docker image:
79
+
The first step is to launch the verl docker image. You can build by your own with:
80
+
81
+
```shell
82
+
FROM rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
After building the Docker image (verl-rocm) using the provided Dockerfile with the command `docker build -t verl-rocm .`, you can proceed to the next steps.
You can choose any model supported by verl and assign it to the `$MODEL_PATH` variable. In our case, we use`Qwen/Qwen2-7B-Instruct` and `deepseek-ai/deepseek-llm-7b-chat`. As for the dataset, you are free to use any dataset of your choice---just ensure it is converted into the required format. In this example, we use `gsm8k`, as verl already provides preprocessing code to format it appropriately.
119
166
120
167
#### Environment Variable Setup
121
168
122
169
```shell
123
170
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
171
+
export ROCR_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
124
172
GPUS_PER_NODE=8
125
173
```
126
174
127
-
You must assign `HIP_VISIBLE_DEVICES`.
128
-
This is the most crucial part---and the only difference compared to running on CUDA version torch.
175
+
You must assign `HIP_VISIBLE_DEVICES` and `ROCR_VISIBLE_DEVICES`.
176
+
This is the most crucial part---and the only difference compared to running on CUDA version torch.
129
177
130
178
Then, you can run the RLHF algorithms provided by verl, including PPO, GRPO, ReMax, REINFORCE++, RLOO, PRIME, and others. In this example, we use PPO and GRPO to illustrate the workflow.
131
179
@@ -238,7 +286,7 @@ For more detailed guidance, comprehensive single-node and multi-node training tu
238
286
239
287
## Performance Benchmarks: Throughput and Convergence on MI300X vs. H100
240
288
241
-
In this section, we run verl (v0.6.0) and present the throughput and convergence accuracy results on H100 and MI300, respectively, using the same hyperparameter settings.
289
+
In this section, we run our modified verl (v0.3.0.post0) and present the throughput and convergence accuracy results on H100 and MI300, respectively, using the same hyperparameter settings.
Note: For throughput (measured in Tokens/GPU/Second), we run 350 training steps for each setting and compute the average throughput over the remaining steps.
309
+
Note: For throughput (measured in Tokens/GPU/Second), we run 350 training steps for each setting. To ensure stability, we exclude any step where step % 10 == 0, then compute the average throughput over the remaining steps.
262
310
Note: For convergence (measured by accuracy), we run 350 training steps and report the highest accuracy achieved during this period, as the training typically converges within these steps.
263
311
264
312
## Summary
@@ -267,7 +315,7 @@ As RLHF becomes a cornerstone in fine-tuning LLMs, verl offers a scalable, open-
267
315
268
316
## Contributors
269
317
270
-
Core contributors: Yusheng Su, Vicky Tsang, Satya Jandhyala, Yao Liu, Zicheng Liu
318
+
Core contributors: Yusheng Su, Vicky Tsang, Yao Liu, Zicheng Liu
<metaname="keywords"content="blog, contributor, blog author">
5
+
</head>
6
+
7
+
(guanchen-li)=
8
+
9
+
# Guanchen Li
10
+
11
+
Guanchen Li is a Software Development Engineer at AMD, where he focuses on large-scale model optimization and efficiency-oriented AI systems. His work centers on LLM compression methods—including structural pruning, sparsity, and quantization—to improve inference speed, reduce memory footprint, and enhance deployment efficiency on modern hardware.
12
+
13
+
He received his M.S. degree in Computer Technology from the University of Science and Technology Beijing and his B.S. degree in Information Management and Information Systems from the Capital University of Economics and Business. His research interests span model compression, inference acceleration, and large-scale training systems.
14
+
15
+
Guanchen has authored several peer-reviewed papers at top NLP/ML conferences, including NeurIPS, NAACL, COLING, and ICML.
0 commit comments