Adapted from the PRIME-RL alphabet_sort example to use
medarc_slurmfor single-node SLURM submission.
This trains Qwen3-4B-Instruct-2507 to sort names alphabetically using LoRA. The base model already understands the conversation format, so no SFT warmup is needed — we proceed directly to multi-turn RL against the primeintellect/alphabet-sort environment.
Note: This example uses 8 GPUs on a single node: 2 for training (FSDP) and 6 for inference (data parallel).
Note:
medarc_trainandmedarc_slurmaccept arbitrary PRIME-RL config overrides as CLI flags. In these examples, we use that passthrough to setwandb.projectandwandb.name.
Install the bundled PRIME-RL environment packages (assuming you want flash attention 3 for Ampere, Hopper, and Lovelace GPUs):
uv sync --extra envs --extra fa3Verify it's installed:
uv run python -c "import alphabet_sort"Submit an 8-GPU RL job:
medarc_slurm rl --config examples/alphabet_sort/rl.toml \
--output-dir output/examples/alphabet-sort \
--train-gpus 2 \
--infer-gpus 6 \
--auto-auth \
--wandb.project alphabet-sort --wandb.name alphabet-sort-4b-exampleOr preview without submitting:
medarc_slurm rl --config examples/alphabet_sort/rl.toml \
--output-dir output/examples/alphabet-sort \
--train-gpus 2 \
--infer-gpus 6 \
--auto-auth \
--dry-run \
--wandb.project alphabet-sort --wandb.name alphabet-sort-4b-exampleThe base model gets ~0.26 average reward. After LoRA RL, expect ~0.8.