Skip to content

Commit 7fc12cc

Browse files
committed
Adding GRPO/ART example
Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>
1 parent fb4edf5 commit 7fc12cc

3 files changed

Lines changed: 661 additions & 0 deletions

File tree

examples/fine-tuning/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ All examples are built primarily on top of **Training Hub** algorithms running o
77
- **SFT (Supervised Fine-Tuning)**
88
- **OSFT (Orthogonal Subspace Fine-Tuning)**
99
- **LoRA + SFT (Low-Rank Adaptation)**
10+
- **GRPO (Group Relative Policy Optimization)**
1011

1112
For detailed algorithm documentation and configuration options, see the upstream [Training Hub documentation](https://github.com/Red-Hat-AI-Innovation-Team/training_hub/tree/main).
1213

@@ -93,6 +94,7 @@ Training is offloaded to **dedicated training pods** managed by **Kubeflow Train
9394
- [SFT fine-tuning example](sft/README.md)
9495
- [OSFT fine-tuning example](osft/README.md)
9596
- [LoRA fine-tuning example](lora/README.md)
97+
- [GRPO fine-tuning example](grpo/README.md) (single-GPU TrainJob only)
9698

9799
---
98100

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# GRPO Fine-Tuning with Training Hub
2+
3+
This example provides an overview of Training Hub's [GRPO (Group Relative Policy Optimization)](https://github.com/Red-Hat-AI-Innovation-Team/training_hub?tab=readme-ov-file#grpo) capabilities and demonstrates how to use them with Red Hat OpenShift AI.
4+
5+
## What is GRPO?
6+
7+
GRPO is a reinforcement learning from verifiable rewards (RLVR) algorithm that improves a model's outputs by comparing groups of responses and reinforcing the better ones:
8+
9+
- Generates multiple candidate responses per prompt
10+
- Scores them with a reward function (e.g. tool-call correctness)
11+
- Uses the group's relative ranking to compute advantage signals
12+
- Updates LoRA adapter weights via policy gradient with group normalization
13+
14+
Each training iteration has two phases:
15+
16+
1. **Rollout phase** — vLLM generates candidate responses and a reward function scores them
17+
2. **Train phase** — Unsloth updates the LoRA adapter weights using the advantage signals
18+
19+
The ART backend time-shares a single GPU between vLLM (inference) and Unsloth (training) via `gpu_memory_utilization`.
20+
21+
### Training Task: Tool-Call Verification
22+
23+
The example uses the [Agent-Ark/Toucan-1.5M](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) dataset, which contains tool-calling conversations. The reward function verifies that the model produces syntactically correct tool calls with the expected function name and arguments.
24+
25+
## Execution mode
26+
27+
GRPO runs as a **single-GPU TrainJob** submitted via the Kubeflow SDK. ART is single-GPU by design and manages its own vLLM subprocess internally.
28+
29+
The notebook submits a `TrainJob` from a lightweight workbench, and the training runs on a dedicated GPU pod managed by Kubeflow Trainer.
30+
31+
To learn more about execution modes for other algorithms, see the [fine-tuning execution modes overview](../README.md#execution-modes).
32+
33+
## RHOAI compatibility
34+
35+
This example is compatible with RHOAI version 3.5.
36+
37+
## Requirements
38+
39+
- An OpenShift cluster with OpenShift AI (RHOAI 3.5) installed:
40+
- The `dashboard` and `workbenches` components enabled
41+
- The `trainer` component enabled
42+
- A worker node with an NVIDIA GPU (Ampere-based or newer, 40GB+ VRAM).
43+
- A dynamic storage provisioner supporting RWX PVC provisioning. Talk to your cluster administrator about RWX storage options.
44+
45+
## Hardware requirements
46+
47+
For the workbench image, the example was run on `Training | Jupyter | PyTorch | CUDA | Python` and `Training | Jupyter | PyTorch | CPU Python`.
48+
This is a single image serving both as training runtime and jupyter notebook and comes with pre-installed dependencies required
49+
to seamlessly run fine-tuning jobs.
50+
51+
### Workbench Requirements
52+
53+
| Image Type | Use Case | GPU | CPU | Memory |
54+
|------------|----------|-----|-----|--------|
55+
| Training \| Jupyter \| PyTorch \| CPU Python | Job submission and monitoring | None | 2 cores | 8Gi |
56+
| Training \| Jupyter \| PyTorch \| CUDA \| Python | Job submission + model evaluation | 1× GPU | 2 cores | 8Gi |
57+
58+
> [!NOTE]
59+
>
60+
> - The workbench does not run the training itself — it submits a TrainJob and monitors progress.
61+
> - A GPU on the workbench is only needed if you want to load and test the fine-tuned LoRA adapter after training completes.
62+
63+
### Training Pod Requirements
64+
65+
| Component | GPU | GPU Type | CPU | Memory |
66+
|-----------|-----|----------|-----|--------|
67+
| Training Pod | 1× GPU | NVIDIA A100, H100, or L40S (40GB+ VRAM) | 8 cores | 64Gi |
68+
69+
> [!NOTE]
70+
>
71+
> - GRPO requires a single GPU with at least 40GB VRAM. The `gpu_memory_utilization` parameter (default `0.45`) controls how much GPU memory is reserved for vLLM inference, with the remainder available for Unsloth training.
72+
> - CPU and memory requirements scale with model size and group size. The above values suit the example configuration (Qwen3-4B, group_size=4).
73+
> - The training pod is configured from the `client.train()` call within the notebook.
74+
75+
### Storage Requirements
76+
77+
| Purpose | Size | Access Mode | Storage Class | Notes |
78+
|---------|------|-------------|---------------|-------|
79+
| Shared Storage (PVC) total | 50Gi (Example Default) | RWX | Dynamic provisioner required | Shared between workbench and training pod |
80+
81+
> [!NOTE]
82+
>
83+
> - Storage can be created in `Create Workbench` view on RHOAI Platform, however, dynamic RWX provisioner is required to be configured prior to creating shared file storage in RHOAI.
84+
> - Shared storage is required — the training pod writes checkpoints and metrics to the PVC, and the workbench reads them for inspection and plotting.
85+
86+
## GRPO-specific considerations
87+
88+
- **`/dev/shm` volume**: vLLM requires a memory-backed `/dev/shm` for inter-process communication. The notebook configures this automatically via a `PodSpecOverride` that mounts an `emptyDir` with `medium: Memory`.
89+
- **`gpu_memory_utilization`**: Controls the vLLM/Unsloth memory split on the single GPU. The default `0.45` reserves 45% for vLLM inference and leaves the rest for Unsloth training. Adjust based on your model size and available VRAM.
90+
- **HuggingFace token**: Not strictly required for public models (e.g. Qwen3-4B) but recommended to avoid rate limits. Set `HF_TOKEN` in the environment variables if needed.
91+
92+
## Setup
93+
94+
### Setup Workbench
95+
96+
**Step 1.** Access the OpenShift AI dashboard, for example from the top navigation bar menu:
97+
98+
![](../images/01.png)
99+
100+
**Step 2.** Log in, then go to **_Data Science Projects_** and create a project:
101+
102+
![](../images/02.png)
103+
104+
**Step 3.** Once the project is created, click on **_Create a workbench_**:
105+
106+
![](../images/03.png)
107+
108+
**Step 4.** Select the appropriate Workbench image. See options above:
109+
110+
![](../images/04a.png)
111+
112+
**Step 5.** You may want to create a **Hardware Profile** with GPU support, similar to the one below:
113+
114+
![](../images/04b.png)
115+
116+
**Step 6.** Select the Hardware profile you want to use:
117+
118+
![](../images/04c.png)
119+
120+
> [!NOTE]
121+
> A GPU on the workbench is only needed if you want to test the fine-tuned model after training. The workbench itself only submits and monitors the TrainJob.
122+
123+
**Step 7.** Create **shared storage** that will be shared between the workbench and the training pod. Make sure it uses a storage class with RWX capability:
124+
125+
![](../images/04d.png)
126+
127+
> [!NOTE]
128+
> You can attach an existing shared storage if you already have one instead.
129+
130+
**Step 8.** Review the storage configuration and click "Create workbench":
131+
132+
![](../images/04e.png)
133+
134+
**Step 9.** From "Workbenches" page, click on **_Open_** when the workbench you've just created becomes ready:
135+
136+
![](../images/05.png)
137+
138+
### Running the example notebook
139+
140+
- From the workbench, clone this repository: `https://github.com/red-hat-data-services/red-hat-ai-examples.git`
141+
- Navigate to the `examples/fine-tuning/grpo` directory and open the [`grpo_lora-kubeflow-trainjob.ipynb`](./grpo_lora-kubeflow-trainjob.ipynb) notebook.
142+
143+
> [!NOTE]
144+
>
145+
> - You will need a Hugging Face token if using gated models (e.g., Llama models).
146+
> Set the `HF_TOKEN` environment variable in your job configuration.
147+
> You can skip the token if switching to non-gated models like Qwen3-4B.
148+
149+
You can now proceed with the instructions from the notebook. Enjoy!

0 commit comments

Comments
 (0)