Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

README.md

Alphabet Sort

Adapted from the PRIME-RL alphabet_sort example to use medarc_slurm for single-node SLURM submission.

This trains Qwen3-4B-Instruct-2507 to sort names alphabetically using LoRA. The base model already understands the conversation format, so no SFT warmup is needed — we proceed directly to multi-turn RL against the primeintellect/alphabet-sort environment.

Note: This example uses 8 GPUs on a single node: 2 for training (FSDP) and 6 for inference (data parallel).

Note: medarc_train and medarc_slurm accept arbitrary PRIME-RL config overrides as CLI flags. In these examples, we use that passthrough to set wandb.project and wandb.name.

Setup

Install the bundled PRIME-RL environment packages (assuming you want flash attention 3 for Ampere, Hopper, and Lovelace GPUs):

uv sync --extra envs --extra fa3

Verify it's installed:

uv run python -c "import alphabet_sort"

RL (8 GPUs: 2 train + 6 inference)

Submit an 8-GPU RL job:

medarc_slurm rl --config examples/alphabet_sort/rl.toml \
    --output-dir output/examples/alphabet-sort \
    --train-gpus 2 \
    --infer-gpus 6 \
    --auto-auth \
    --wandb.project alphabet-sort --wandb.name alphabet-sort-4b-example

Or preview without submitting:

medarc_slurm rl --config examples/alphabet_sort/rl.toml \
    --output-dir output/examples/alphabet-sort \
    --train-gpus 2 \
    --infer-gpus 6 \
    --auto-auth \
    --dry-run \
    --wandb.project alphabet-sort --wandb.name alphabet-sort-4b-example

The base model gets ~0.26 average reward. After LoRA RL, expect ~0.8.