Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
3f20131
x
hao-aaron Apr 28, 2026
e39d8ea
Merge remote-tracking branch 'upstream/main' into multi-lora
hao-aaron Apr 28, 2026
68ed142
[docs] Add Multi-LoRA Megatron Tinker design doc (v1)
erictang000 May 4, 2026
c0c3a58
[multi-lora] Add AdapterStore for per-worker LoRA slot bookkeeping
erictang000 May 4, 2026
e923894
[multi-lora] Wire AdapterStore into MegatronPolicyWorkerBase
erictang000 May 4, 2026
46c1658
[multi-lora] Add ensure_active_adapter + model_id threading to dispatch
erictang000 May 4, 2026
90dc178
[multi-lora] Allow multiple LoRA policy adapters in SkyRLTrainBackend
erictang000 May 4, 2026
8bb9157
[multi-lora] Add GPU-gated multi-LoRA integration test for Megatron
erictang000 May 4, 2026
301059b
[multi-lora] Add two-client smoke runbook
erictang000 May 4, 2026
b712bca
[multi-lora] Fix _lora_signature_from to not read non-existent target…
erictang000 May 4, 2026
d4a0a04
x
erictang000 May 4, 2026
3c0239e
[multi-lora] Swap grad buffers along with params + optimizer state
erictang000 May 4, 2026
f5ba5c9
[multi-lora-rl] Wire model_id through the LoRA sync + sampling path
erictang000 May 4, 2026
40b1ae4
[multi-lora-rl] Tolerate non-JSON error bodies in load/unload_lora_ad…
erictang000 May 4, 2026
8cff746
[multi-lora-rl] Update design doc + RL two-client smoke runbook
erictang000 May 4, 2026
9d1392d
Merge remote-tracking branch 'origin/main' into multi_lora_rl
erictang000 May 4, 2026
a46b587
[multi-lora-rl] Allow mixed model_ids in a single sample() batch
erictang000 May 5, 2026
edefc84
[multi-lora-rl] Pass load_inplace=True to vLLM load_lora_adapter
erictang000 May 5, 2026
fa3bfbc
Revert "[multi-lora-rl] Pass load_inplace=True to vLLM load_lora_adap…
erictang000 May 5, 2026
cb38614
x
erictang000 May 5, 2026
e18e82b
[docs] Design doc for non-colocated sample routing via EXTERNAL path
erictang000 May 6, 2026
57a474a
[smoke logs] Snapshot rl_loop / sl_loop runs from manual smoke tests
erictang000 May 6, 2026
cb88e88
Merge remote-tracking branch 'origin/main' into multi_lora_rl
erictang000 May 7, 2026
d178ca0
[multi-lora-rl] Reset PR #1579 test files to upstream main version
erictang000 May 7, 2026
243d8b5
[multi-lora-rl] Drop smoke_logs/ from PR diff
erictang000 May 7, 2026
0532d1b
[multi-lora-rl] Remove internal-development docs from PR
erictang000 May 7, 2026
01baa37
Merge remote-tracking branch 'origin/main' into multi_lora_rl
erictang000 May 8, 2026
36b4751
[multi-lora-rl] Remove old test path (moved to tests/tinker/skyrl_tra…
erictang000 May 8, 2026
8c1062d
[multi-lora-rl] Add per-adapter sample tests
erictang000 May 8, 2026
f42db90
run lint
erictang000 May 8, 2026
468d231
[multi-lora-rl] Clean up stale references in test docstrings
erictang000 May 8, 2026
f5e68d8
[multi-lora-rl] Address review feedback: path safety + legacy lora_name
erictang000 May 8, 2026
b12182a
[multi-lora-rl] Move per-tenant subdir cleanup into MegatronPolicyWorker
erictang000 May 8, 2026
27a4ddc
clean up comments and nits
erictang000 May 8, 2026
e037176
add docs for multi-tenancy and update old docs mentioning limitations
erictang000 May 9, 2026
a620667
add chunked logprobs none and plumb through ability to set it to none
erictang000 May 9, 2026
8ce91ae
[multi-lora-rl] Final review polish
erictang000 May 9, 2026
e2d783d
x
erictang000 May 9, 2026
c51b931
fix cpu test
erictang000 May 9, 2026
d0e5d99
lint
erictang000 May 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions docs/content/docs/tinker/architecture.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -122,9 +122,11 @@ The Tinker SDK sends a `sampling_session_seq_id` field when using the ephemeral

Persistent saves can be very expensive because they write full model weights to disk on every call. In RL training loops that sync weights every batch, ephemeral mode avoids this overhead entirely. In typical RL loops (e.g., tinker-cookbook's `rl_loop`), every iteration uses ephemeral mode before sampling, and persistent saves are reserved for periodic checkpointing.

### Single Model Constraint
### Multiple LoRA tenants

SkyRL currently supports only one copy of sampling model weights at a time. This differs from Thinking Machines' hosted service that supports arbitrarily many sampling clients attached to various sampling model weights. In SkyRL, after a weight sync, all subsequent `sample()` calls automatically use the updated weights.
On the Megatron backend, SkyRL supports multiple LoRA adapters trained and sampled concurrently against a single server. Each tenant's adapter weights and optimizer state live in pinned-CPU slots; the live GPU adapter is swapped on demand at the top of every per-model dispatch entry point (forward, forward_backward, optim_step, save_weights_for_sampler). On the inference side, vLLM serves each tenant's adapter by `model_id` after `save_weights_for_sampler` registers it via `load_lora_adapter`. See [Multi-tenancy](./multi_tenancy) for the design and operator contract.

Full-parameter fine-tuning and the FSDP backend remain single-tenant — calling `create_model` a second time on those paths returns an error.

## Checkpointing

Expand Down Expand Up @@ -163,4 +165,4 @@ Tinker represents training data as `Datum` objects with a `ModelInput` (containi
- **Shifts** tokens: Tinker pre-shifts inputs/targets, but SkyRL-Train shifts internally, so the backend appends the last target token to reconstruct full sequences
- Builds `attention_mask`, `loss_mask`, and `response_mask` tensors from token weights

There is currently a limitation that batch size must be divisible by the data parallelism size (number of GPUs). The engine layer handles batching multiple client requests together before passing them to the backend.
The engine layer also batches multiple client requests together before passing them to the backend.
19 changes: 18 additions & 1 deletion docs/content/docs/tinker/configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,24 @@ python -m tinker_cookbook.recipes.sl_loop ... lora_rank=32
python -m tinker_cookbook.recipes.sl_loop ... lora_rank=0
```

No server-side configuration is needed to switch between LoRA and full-parameter fine-tuning.
No server-side configuration is needed to switch between single-tenant LoRA and full-parameter fine-tuning.

### Multi-tenant LoRA

Hosting multiple LoRA tenants concurrently against one server *does* require server-side configuration on the Megatron backend. At minimum:

```json
{
"trainer.placement.colocate_all": false,
"trainer.policy.megatron_config.lora_config.merge_lora": false,
"trainer.policy.model.lora.max_loras": <max concurrent adapters in a single batch>,
"trainer.policy.model.lora.max_cpu_loras": <total adapter capacity>
}
```

`merge_lora: false` is required so vLLM serves each tenant's adapter by name (with `merge_lora: true` vLLM only sees the merged base and per-tenant sampling returns the wrong weights). `max_cpu_loras` must be sized to the peak number of concurrent tenants — there is no on-demand reload, and if vLLM evicts an adapter the next `sample()` against it 404s. All adapters on one server must share the same `(rank, alpha, target_modules)` signature; mismatched signatures are hard-rejected at `create_model`.

See [Multi-tenancy](./multi_tenancy) for the full operator contract and SFT/RL quickstarts.

## Full Config Reference

Expand Down
16 changes: 3 additions & 13 deletions docs/content/docs/tinker/limitations.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,11 @@ The Tinker integration is under active development. This page documents current

## Current Limitations

### Single Model
### Multi-tenant LoRA: Megatron only

Only one training model and one set of sampling weights can be loaded at a time. Calling `create_model` when a model already exists will return an error. After a weight sync, all subsequent `sample()` calls use the updated weightsthere is no support for maintaining multiple sampling snapshots concurrently. To switch models, restart the server.
Multi-tenant LoRA training and sampling are supported on the **Megatron** backend with vLLM serving per-tenant adapters by name. See [Multi-tenancy](./multi_tenancy) for the operator contract and SL/RL quickstarts. **FSDP2** support is pending, and full-parameter fine-tuning remains single-tenant on both backendscalling `create_model` with `lora_rank=0` while another model exists returns an error.

### Single-tenant LoRA
Related to the above limitation, even when training with LoRA adaptors, the SkyRL-Train backend only supports one training model and one set of sampling weights. We plan to support training and sampling on multiple LoRA adaptors concurrently in the future.

### Vision Language Models

Vision language models (VLMs) are supported through the Tinker integration. We have validated the path end-to-end on [Qwen3-VL](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) — see the [Vision Language cookbook recipe](./cookbook#vision-language-vlm_classifier) for a runnable example. We welcome contributions that extend coverage to additional VLM families.


### Batch Size Constraint

The batch size must be evenly divisible by the data parallelism size (number of GPUs). For example, with 4 GPUs you cannot use a batch size of 5.
All adapters registered against one server must share the same `(rank, alpha, target_modules)` signature; mismatched signatures are hard-rejected at `create_model`.

### No Prompt Logprobs

Expand Down
1 change: 1 addition & 0 deletions docs/content/docs/tinker/meta.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
"overview",
"quickstart",
"architecture",
"multi_tenancy",
"cookbook",
"configuration",
"limitations"
Expand Down
168 changes: 168 additions & 0 deletions docs/content/docs/tinker/multi_tenancy.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
---
title: "Multi-tenancy"
---

A single SkyRL Tinker server can host multiple LoRA adapters concurrently against a shared base model. Each adapter is its own Tinker `model_id` and its own client session — multiple `tinker-cookbook` recipes can train and sample in parallel without spinning up a separate server per workload.

This page describes the design, the operator contract, and quickstarts for SFT (sl_loop.py) and RL (rl_loop.py).

<Callout type="info">
Multi-tenancy is wired on the **Megatron** backend with vLLM serving per-tenant adapters. FSDP2 multi-tenancy and multi-tenant full-parameter fine-tuning are not yet supported — see [Limitations](./limitations).
</Callout>

## How it works

The base model is loaded once on the policy workers and shared across all tenants. Each tenant gets a per-adapter slot in pinned CPU memory holding its LoRA params, optimizer state, and step count; the live GPU adapter is swapped on demand at the top of every per-model dispatch entry point. Clients never reason about which adapter is currently resident — they just call the Tinker API with their `model_id`.

What this means for you:

- **GPU memory is bounded** by the base model plus a few small LoRA buffers, regardless of tenant count. The growth from adding a tenant is in *CPU* memory (one slot per adapter, on the order of `~3× lora_param_bytes_per_DP_shard` — tens of MB for Qwen3-0.6B at rank 32).
- **Swap cost is small** relative to a forward pass — a host→device `tensor.copy_()` plus a DP-group barrier. You should not see noticeable per-call latency from tenant churn.
- **Per-tenant sampling on vLLM** is by `model_id`. The worker exports each tenant's adapter into `lora_sync_path/<model_id>/` on `save_weights_for_sampler` and registers it on vLLM via `load_lora_adapter`. Sampling uses `model=<model_id>` and vLLM routes to the right adapter.
- **Capacity is bounded by `max_cpu_loras`**, vLLM's CPU LRU cache. If you have more concurrent tenants than slots, vLLM evicts one and the next `sample()` against it 404s — there is no on-demand reload. Size for your peak.

## Operator contract

Required `--backend-config` keys to run multi-tenant LoRA on Megatron:

```json
{
"trainer.placement.colocate_all": false,
"trainer.policy.megatron_config.lora_config.merge_lora": false,
"trainer.policy.model.lora.max_loras": <max concurrent adapters in a single batch>,
"trainer.policy.model.lora.max_cpu_loras": <total adapter capacity>
}
```

All adapters must share the same `(rank, alpha, target_modules)` signature. Mismatches are hard-rejected at `create_model` with a `LoRA signature mismatch …` error.

The first `create_model` on a fresh server triggers the policy build and bootstraps the per-tenant adapter slot infrastructure; subsequent `create_model` calls register additional adapter slots and complete in milliseconds. When the *last* registered model is unloaded the server tears down the Ray runtime via `ray.shutdown()`; the next `create_model` rebuilds it.

## Quickstart — Two SL clients

Run two `tinker-cookbook` `sl_loop` clients in parallel against one Megatron-backed Tinker server.

### 1. Start the server

```bash
uv run --extra tinker --extra megatron -m skyrl.tinker.api \
--host 0.0.0.0 \
--port 8000 \
--base-model Qwen/Qwen3-0.6B \
--backend megatron \
--backend-config '{
"strategy": "megatron",
"trainer.placement.policy_num_gpus_per_node": 1,
"trainer.placement.policy_num_nodes": 1,
"trainer.placement.colocate_all": false,
"trainer.policy.megatron_config.tensor_model_parallel_size": 1,
"trainer.policy.megatron_config.pipeline_model_parallel_size": 1,
"trainer.policy.megatron_config.lora_config.merge_lora": false,
"trainer.policy.model.lora.max_loras": 2,
"trainer.policy.model.lora.max_cpu_loras": 2,
"trainer.logprobs_chunk_size": null
}'
```

Wait for `init policy model done` after the first client connects.

### 2. Run two `sl_loop` clients

In two separate terminals (in the tinker-cookbook repo):

```bash
# Terminal 2 — client A
TINKER_API_KEY=tml-dummy uv run --with tinker --with tinker-cookbook --with datasets \
python -m tinker_cookbook.recipes.sl_loop \
base_url=http://localhost:8000 \
model_name="Qwen/Qwen3-0.6B" \
train_on_what=LAST_ASSISTANT_MESSAGE \
lora_rank=32 \
log_path=/tmp/sl_loop_a.log
```

```bash
# Terminal 3 — client B
TINKER_API_KEY=tml-dummy uv run --with tinker --with tinker-cookbook --with datasets \
python -m tinker_cookbook.recipes.sl_loop \
base_url=http://localhost:8000 \
model_name="Qwen/Qwen3-0.6B" \
train_on_what=LAST_ASSISTANT_MESSAGE \
lora_rank=32 \
log_path=/tmp/sl_loop_b.log
```

Stagger the launches by ~20s so the second client doesn't race the policy build. Both clients **must** use the same `lora_rank` and `model_name`.

You should see both clients converge on their respective tasks, with NLL trending independently downward in both `sl_loop_a.log` and `sl_loop_b.log`.
GPU memory will stay bounded even as the second client connects (single base model + N LoRA slots).

## Quickstart — Two RL clients

Two `rl_loop` clients each train and sample independently against one server. RL exercises the per-tenant `save_weights_for_sampler` + `sample(model=<model_id>)` path.

### 1. Start the server

```bash
uv run --extra tinker --extra megatron -m skyrl.tinker.api \
--host 0.0.0.0 \
--port 8000 \
--base-model Qwen/Qwen3-0.6B \
--backend megatron \
--backend-config '{
"strategy": "megatron",
"trainer.placement.policy_num_gpus_per_node": 4,
"trainer.placement.policy_num_nodes": 1,
"trainer.placement.colocate_all": false,
"trainer.policy.megatron_config.tensor_model_parallel_size": 1,
"trainer.policy.megatron_config.pipeline_model_parallel_size": 1,
"trainer.policy.megatron_config.lora_config.merge_lora": false,
"trainer.micro_train_batch_size_per_gpu": 64,
"trainer.micro_forward_batch_size_per_gpu": 64,
"generator.inference_engine.num_engines": 1,
"generator.inference_engine.tensor_parallel_size": 1,
"trainer.policy.model.lora.max_loras": 2,
"trainer.policy.model.lora.max_cpu_loras": 2,
"trainer.logprobs_chunk_size": null,
}'
```

Critical knobs vs the SL quickstart:
- `colocate_all: false` is required. In order for sampling and training to progress independently for different client calls, inference engines and trainer workers should be placed on different GPUs.
- `merge_lora: false` is required. With `merge_lora: true`, vLLM serves the merged base model and `sample(model=<adapter>)` returns the wrong tenant's weights.
- `max_loras` ≥ number of adapters in a single batch (typically equal to the client count).
- `max_cpu_loras` must be ≥ the number of adapters you expect to serve concurrently. There is no on-demand reload — if vLLM evicts an adapter, its next `sample()` 404s.

### 2. Run two `rl_loop` clients

```bash
# Terminal 2 — client A
TINKER_API_KEY=tml-dummy uv run --with tinker --with tinker-cookbook --with datasets --with torch \
python -m tinker_cookbook.recipes.rl_loop \
base_url=http://localhost:8000 \
model_name="Qwen/Qwen3-0.6B" \
lora_rank=32 \
log_path=/tmp/rl_loop_a.log
```

```bash
# Terminal 3 — client B
TINKER_API_KEY=tml-dummy uv run --with tinker --with tinker-cookbook --with datasets --with torch \
python -m tinker_cookbook.recipes.rl_loop \
base_url=http://localhost:8000 \
model_name="Qwen/Qwen3-0.6B" \
lora_rank=32 \
log_path=/tmp/rl_loop_b.log
```

Stagger by ~20 s. Both clients **must** use the same `lora_rank` and `model_name`.

You should see both clients' rewards trend upward independently in `rl_loop_a.log` and `rl_loop_b.log`, vLLM logs showing two distinct adapter names registered and `sample` requests routed to each., and GPU memory staying bounded (single base model, two LoRA adapters, CPU LRU holds the same two).

## Troubleshooting

- **`LoRA signature mismatch`** — clients passed different `(rank, alpha, target_modules)`. All adapters on one server share a signature, captured from the first `create_model`.
- **`sample()` 404 on `lora_name=…`** — either `save_sampler_checkpoint` wasn't called for that `model_id` before sampling, or `max_cpu_loras` is too low and vLLM evicted the adapter. Check the vLLM server log.
- **Server hangs on the second `create_model`** — the first policy build hasn't finished. Wait for `init policy model done` before starting subsequent clients.
- **CPU OOM on the Nth client** — each adapter slot holds LoRA params + fp32 main + Adam moments, roughly `~3× lora_param_bytes_per_DP_shard`. For Qwen3-0.6B at rank 32 this is on the order of tens of MB per slot; for larger models scale accordingly. Reduce concurrent adapters or move to a host with more RAM.
- **Sample returns the wrong tenant's output** — confirm `merge_lora: false` is set on the Megatron config; with merge enabled vLLM only sees the merged base.
8 changes: 5 additions & 3 deletions docs/content/docs/tinker/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -40,15 +40,17 @@ SkyRL brings the Tinker API to your own hardware. By utilizing the fully Tinker
| FSDP2 strategy | Supported |
| Megatron strategy | Supported |
| Vision models | Supported |
| Multi-tenant LoRA | Not yet supported |
| Multi-model sampling | Not yet supported |
| Multi-model training | Not yet supported |
| Multi-tenant LoRA training (Megatron + vLLM) | Supported — see [Multi-tenancy](./multi_tenancy) |
| Multi-tenant LoRA sampling (Megatron + vLLM) | Supported — see [Multi-tenancy](./multi_tenancy) |
| Multi-tenant LoRA on FSDP2 | Not yet supported |
| Multi-tenant full-parameter fine-tuning | Not yet supported |

For more details, see the [Limitations & Roadmap](./limitations) page.

## Next Steps

- [Quickstart](./quickstart) - Start a SkyRL Tinker server and run your first training script
- [Architecture](./architecture) - Understand how SkyRL implements the Tinker API
- [Multi-tenancy](./multi_tenancy) - Run multiple LoRA tenants concurrently against one server
- [Cookbook Scripts](./cookbook) - Run the official tinker-cookbook recipes on SkyRL
- [Limitations & Roadmap](./limitations) - Known limitations and future plans
28 changes: 20 additions & 8 deletions skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -308,10 +308,16 @@ async def init_weight_update_communicator(self, init_info: "WeightSyncInitInfo")
args=(pickled_init_info,),
)

async def _load_lora_from_disk(self, lora_path: str):
"""Load LoRA adapters from disk using vLLM's native add_lora method."""
async def _load_lora_from_disk(self, lora_path: str, lora_name: str = ""):
"""Load LoRA adapters from disk using vLLM's native add_lora method.

When ``lora_name`` is empty (legacy single-tenant), a numeric name is
generated. Multi-tenant callers pass ``lora_name`` so subsequent
``model=<lora_name>`` sampling routes to the right adapter.
"""
lora_id = int(time.time_ns() % 0x7FFFFFFF)
lora_request = LoRARequest(lora_name=f"{lora_id}", lora_int_id=lora_id, lora_path=lora_path)
name = lora_name or f"{lora_id}"
lora_request = LoRARequest(lora_name=name, lora_int_id=lora_id, lora_path=lora_path)
result = self.llm.llm_engine.add_lora(lora_request)
return result

Expand All @@ -320,7 +326,7 @@ async def update_named_weights(self, request: WeightUpdateRequest):

# Handle LoRA disk loading request
if isinstance(request, LoraLoadRequest):
return await self._load_lora_from_disk(request.lora_path)
return await self._load_lora_from_disk(request.lora_path, lora_name=request.lora_name)

if not len(request):
raise ValueError("Weight update request must not be empty")
Expand Down Expand Up @@ -453,10 +459,16 @@ def _create_ray_prometheus_stat_loggers(self):
)
return None

async def _load_lora_from_disk(self, lora_path: str):
"""Load LoRA adapters from disk using vLLM's native add_lora method."""
async def _load_lora_from_disk(self, lora_path: str, lora_name: str = ""):
"""Load LoRA adapters from disk using vLLM's native add_lora method.

When ``lora_name`` is empty (legacy single-tenant), a numeric name is
generated. Multi-tenant callers pass ``lora_name`` so subsequent
``model=<lora_name>`` sampling routes to the right adapter.
"""
lora_id = int(time.time_ns() % 0x7FFFFFFF)
lora_request = LoRARequest(lora_name=f"{lora_id}", lora_int_id=lora_id, lora_path=lora_path)
name = lora_name or f"{lora_id}"
lora_request = LoRARequest(lora_name=name, lora_int_id=lora_id, lora_path=lora_path)
result = await self.llm.add_lora(lora_request)
return result

Expand Down Expand Up @@ -539,7 +551,7 @@ async def update_named_weights(self, request: WeightUpdateRequest):

# Check for LoRA disk loading request
if isinstance(request, LoraLoadRequest):
return await self._load_lora_from_disk(request.lora_path)
return await self._load_lora_from_disk(request.lora_path, lora_name=request.lora_name)

if not len(request):
raise ValueError("Weight update request must not be empty")
Expand Down
Loading
Loading