Skip to content

Commit febcc6b

Browse files
committed
New author and blog update
1 parent 701e3ef commit febcc6b

File tree

3 files changed

+78
-13
lines changed

3 files changed

+78
-13
lines changed

.authorlist.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,8 @@ Gowtham
150150
Ramesh
151151
Grant
152152
Pinkert
153+
Guanchen
154+
Li
153155
Guihong
154156
Li
155157
Han

blogs/artificial-intelligence/verl-large-scale/README.md

Lines changed: 61 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
blogpost: true
33
blog_title: "Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm Integration"
44
date: 24 Apr 2025
5-
author: 'Yusheng Su, Vicky Tsang, Satya Jandhyala, Yao Liu, Phani Vaddadi, Vish Vadlamani, Zicheng Liu'
5+
author: 'Yusheng Su, Vicky Tsang, Yao Liu, Phani Vaddadi, Vish Vadlamani, Zicheng Liu'
66
thumbnail: 'verl.jpg'
77
tags: Fine-Tuning, Reinforcement Learning, AI/ML
88
category: Applications & models
@@ -11,7 +11,7 @@ key_value_propositions: verl, an efficient RLHF framework designed for scalable
1111
language: English
1212
myst:
1313
html_meta:
14-
"author": "Yusheng Su, Vicky Tsang, Satya Jandhyala, Yao Liu, Phani Vaddadi, Vish Vadlamani, Zicheng Liu"
14+
"author": "Yusheng Su, Vicky Tsang, Yao Liu, Phani Vaddadi, Vish Vadlamani, Zicheng Liu"
1515
"description lang=en": "Deploy verl on AMD GPUs for fast, scalable RLHF training with ROCm optimization, Docker scripts, and impressive throughput-convergence results"
1616
"keywords": "verl, Training, AMD GPUs."
1717
"property=og:locale": "en_US"
@@ -54,7 +54,7 @@ Follow this guide to get started with verl on AMD Instinct GPUs and accelerate y
5454
## Key Takeaways
5555

5656
1. verl framework and its advantages
57-
2. AMD ROCm software support and docker image for the [v0.6.0](https://github.com/volcengine/verl/tree/0eb50ec4a33cda97e05ed8caab9c7f17a30c05a9) version verl
57+
2. AMD ROCm software support and docker image for the [v0.3.0.post0](https://github.com/volcengine/verl/releases/tag/v0.3.0.post0) version verl
5858
3. Single-node and multi-node training scripts for verl
5959
4. Throughput and convergence accuracy
6060

@@ -68,15 +68,62 @@ Building on this foundation, an open-source community ([Volcengine](https://gith
6868

6969
## Enabling verl on AMD Instinct™ with ROCm and Docker
7070

71-
To better support AMD Instinct GPUs and ROCm 7, we a provide a pre-built [docker images](https://hub.docker.com/r/rocm/verl/tags) to simplify the setup of the verl training environment.
71+
To ensure verl functions effectively on AMD Instinct GPUs, we contributed some key ROCm enhancements. First, we updated the verl codebase to ensure compatibility with the ROCm kernel, enabling stable and efficient execution on AMD GPUs (see PR: [\[Hardware\] Support AMD (ROCMm Kernel) #360](https://github.com/volcengine/verl/pull/360)). Additionally, we addressed issues in the third-party library **Ray** on AMD Instinct, allowing it to handle dynamic resource allocation reliably---an essential step toward achieving high overall training efficiency (see PR: [Replace AMD device env var with HIP_VISIBLE_DEVICES #51104](https://github.com/ray-project/ray/pull/51104)).Additionally, to better support AMD Instinct GPUs, we provide [Dockerfiles](https://github.com/volcengine/verl/blob/main/docker/Dockerfile.rocm) to simplify the setup of the verl training environment. The aforementioned supports have been integrated into the official verl upstream repository and are included in the [v0.3.0.post0](https://github.com/volcengine/verl/releases/tag/v0.3.0.post0) version.
7272

7373
## Running verl on AMD GPUs: Single-Node and Multi-Node Setups
7474

7575
Let's get started with running verl on AMD Instinct:
7676

7777
### Single-node Training
7878

79-
The first step is to launch the verl docker image:
79+
The first step is to launch the verl docker image. You can build by your own with:
80+
81+
```shell
82+
FROM rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
83+
84+
# Set working directory
85+
WORKDIR $PWD/app
86+
87+
# Set environment variables
88+
ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"
89+
90+
# Install vllm
91+
RUN pip uninstall -y vllm && \
92+
rm -rf vllm && \
93+
git clone -b v0.6.3 https://github.com/vllm-project/vllm.git && \
94+
cd vllm && \
95+
MAX_JOBS=$(nproc) python3 setup.py install && \
96+
cd .. && \
97+
rm -rf vllm
98+
99+
# Copy the entire project directory
100+
COPY . .
101+
102+
# Install dependencies
103+
RUN pip install "tensordict<0.6" --no-deps && \
104+
pip install accelerate \
105+
codetiming \
106+
datasets \
107+
dill \
108+
hydra-core \
109+
liger-kernel \
110+
numpy \
111+
pandas \
112+
peft \
113+
"pyarrow>=15.0.0" \
114+
pylatexenc \
115+
"ray[data,train,tune,serve]" \
116+
torchdata \
117+
transformers \
118+
wandb \
119+
orjson \
120+
pybind11 && \
121+
pip install -e . --no-deps
122+
```
123+
124+
After building the Docker image (verl-rocm) using the provided Dockerfile with the command `docker build -t verl-rocm .`, you can proceed to the next steps.
125+
126+
#### Docker Launch
80127

81128
```shell
82129
docker run --rm -it \
@@ -91,7 +138,7 @@ docker run --rm -it \
91138
-v $HOME:$HOME \
92139
--shm-size 128G \
93140
-w $PWD \
94-
rocm/verl:verl-0.6.0.amd0_rocm7.0_vllm0.11.0.dev
141+
verl-rocm
95142
```
96143

97144
#### Data Prepare
@@ -102,7 +149,7 @@ python3 examples/data_preprocess/gsm8k.py --local_dir ../data/gsm8k
102149

103150
#### Model Loading
104151

105-
```shell
152+
```shell
106153
python3 -c "import transformers;transformers.pipeline('text-generation', model='Qwen/Qwen2-7B-Instruct')"
107154
python3 -c "import transformers;transformers.pipeline('text-generation', model='deepseek-ai/deepseek-llm-7b-chat')"
108155
```
@@ -114,18 +161,19 @@ MODEL_PATH="Qwen/Qwen2-7B-Instruct"
114161
train_files="../data/gsm8k/train.parquet"
115162
test_files="../data/gsm8k/test.parquet"
116163
```
117-
164+
118165
You can choose any model supported by verl and assign it to the `$MODEL_PATH` variable. In our case, we use`Qwen/Qwen2-7B-Instruct` and `deepseek-ai/deepseek-llm-7b-chat`. As for the dataset, you are free to use any dataset of your choice---just ensure it is converted into the required format. In this example, we use `gsm8k`, as verl already provides preprocessing code to format it appropriately.
119166

120167
#### Environment Variable Setup
121168

122169
```shell
123170
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
171+
export ROCR_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
124172
GPUS_PER_NODE=8
125173
```
126174

127-
You must assign `HIP_VISIBLE_DEVICES`.
128-
This is the most crucial part---and the only difference compared to running on CUDA version torch.
175+
You must assign `HIP_VISIBLE_DEVICES` and `ROCR_VISIBLE_DEVICES`.
176+
This is the most crucial part---and the only difference compared to running on CUDA version torch.
129177

130178
Then, you can run the RLHF algorithms provided by verl, including PPO, GRPO, ReMax, REINFORCE++, RLOO, PRIME, and others. In this example, we use PPO and GRPO to illustrate the workflow.
131179

@@ -238,7 +286,7 @@ For more detailed guidance, comprehensive single-node and multi-node training tu
238286

239287
## Performance Benchmarks: Throughput and Convergence on MI300X vs. H100
240288

241-
In this section, we run verl (v0.6.0) and present the throughput and convergence accuracy results on H100 and MI300, respectively, using the same hyperparameter settings.
289+
In this section, we run our modified verl (v0.3.0.post0) and present the throughput and convergence accuracy results on H100 and MI300, respectively, using the same hyperparameter settings.
242290

243291
PPO:
244292
|Platform |Model |TP_VALUE |INFERENCE_BATCH_SIZE |GPU_MEMORY_UTILIZATION |Throughput (Token/GPU/Sec)|Convergence(Acc.)
@@ -258,7 +306,7 @@ GRPO:
258306
|H100| deepseek-llm-7b-chat| 2| 110| 0.4| 1624.42| 71.2|
259307
|MI300^\[1\]^| deepseek-llm-7b-chat| 2| 110| 0.4| 1899.04| 70.9|
260308

261-
Note: For throughput (measured in Tokens/GPU/Second), we run 350 training steps for each setting and compute the average throughput over the remaining steps.
309+
Note: For throughput (measured in Tokens/GPU/Second), we run 350 training steps for each setting. To ensure stability, we exclude any step where step % 10 == 0, then compute the average throughput over the remaining steps.
262310
Note: For convergence (measured by accuracy), we run 350 training steps and report the highest accuracy achieved during this period, as the training typically converges within these steps.
263311

264312
## Summary
@@ -267,7 +315,7 @@ As RLHF becomes a cornerstone in fine-tuning LLMs, verl offers a scalable, open-
267315

268316
## Contributors
269317

270-
Core contributors: Yusheng Su, Vicky Tsang, Satya Jandhyala, Yao Liu, Zicheng Liu
318+
Core contributors: Yusheng Su, Vicky Tsang, Yao Liu, Zicheng Liu
271319

272320
Contributors: Xiaodong Yu, Gowtham Ramesh, Jiang Liu, Zhenyu Gu, Phani Vaddadi, Vish Vadlamani, Emad Barsoum
273321

blogs/authors/guanchen-li.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
<head>
2+
<meta charset="UTF-8">
3+
<meta name="description" content="Guanchen Li">
4+
<meta name="keywords" content="blog, contributor, blog author">
5+
</head>
6+
7+
(guanchen-li)=
8+
9+
# Guanchen Li
10+
11+
Guanchen Li is a Software Development Engineer at AMD, where he focuses on large-scale model optimization and efficiency-oriented AI systems. His work centers on LLM compression methods—including structural pruning, sparsity, and quantization—to improve inference speed, reduce memory footprint, and enhance deployment efficiency on modern hardware.
12+
13+
He received his M.S. degree in Computer Technology from the University of Science and Technology Beijing and his B.S. degree in Information Management and Information Systems from the Capital University of Economics and Business. His research interests span model compression, inference acceleration, and large-scale training systems.
14+
15+
Guanchen has authored several peer-reviewed papers at top NLP/ML conferences, including NeurIPS, NAACL, COLING, and ICML.

0 commit comments

Comments
 (0)