|
| 1 | +--- |
| 2 | +title: "Multi-tenancy" |
| 3 | +--- |
| 4 | + |
| 5 | +A single SkyRL Tinker server can host multiple LoRA adapters concurrently against a shared base model. Each adapter is its own Tinker `model_id` and its own client session — multiple `tinker-cookbook` recipes can train and sample in parallel without spinning up a separate server per workload. |
| 6 | + |
| 7 | +This page describes the design, the operator contract, and quickstarts for SFT (sl_loop.py) and RL (rl_loop.py). |
| 8 | + |
| 9 | +<Callout type="info"> |
| 10 | +Multi-tenancy is wired on the **Megatron** backend with vLLM serving per-tenant adapters. FSDP2 multi-tenancy and multi-tenant full-parameter fine-tuning are not yet supported — see [Limitations](./limitations). |
| 11 | +</Callout> |
| 12 | + |
| 13 | +## How it works |
| 14 | + |
| 15 | +The base model is loaded once on the policy workers and shared across all tenants. Each tenant gets a per-adapter slot in pinned CPU memory holding its LoRA params, optimizer state, and step count; the live GPU adapter is swapped on demand at the top of every per-model dispatch entry point. Clients never reason about which adapter is currently resident — they just call the Tinker API with their `model_id`. |
| 16 | + |
| 17 | +What this means for you: |
| 18 | + |
| 19 | +- **GPU memory is bounded** by the base model plus a few small LoRA buffers, regardless of tenant count. The growth from adding a tenant is in *CPU* memory (one slot per adapter, on the order of `~3× lora_param_bytes_per_DP_shard` — tens of MB for Qwen3-0.6B at rank 32). |
| 20 | +- **Swap cost is small** relative to a forward pass — a host→device `tensor.copy_()` plus a DP-group barrier. You should not see noticeable per-call latency from tenant churn. |
| 21 | +- **Per-tenant sampling on vLLM** is by `model_id`. The worker exports each tenant's adapter into `lora_sync_path/<model_id>/` on `save_weights_for_sampler` and registers it on vLLM via `load_lora_adapter`. Sampling uses `model=<model_id>` and vLLM routes to the right adapter. |
| 22 | +- **Capacity is bounded by `max_cpu_loras`**, vLLM's CPU LRU cache. If you have more concurrent tenants than slots, vLLM evicts one and the next `sample()` against it 404s — there is no on-demand reload. Size for your peak. |
| 23 | + |
| 24 | +## Operator contract |
| 25 | + |
| 26 | +Required `--backend-config` keys to run multi-tenant LoRA on Megatron: |
| 27 | + |
| 28 | +```json |
| 29 | +{ |
| 30 | + "trainer.placement.colocate_all": false, |
| 31 | + "trainer.policy.megatron_config.lora_config.merge_lora": false, |
| 32 | + "trainer.policy.model.lora.max_loras": <max concurrent adapters in a single batch>, |
| 33 | + "trainer.policy.model.lora.max_cpu_loras": <total adapter capacity> |
| 34 | +} |
| 35 | +``` |
| 36 | + |
| 37 | +All adapters must share the same `(rank, alpha, target_modules)` signature. Mismatches are hard-rejected at `create_model` with a `LoRA signature mismatch …` error. |
| 38 | + |
| 39 | +The first `create_model` on a fresh server triggers the policy build and bootstraps the per-tenant adapter slot infrastructure; subsequent `create_model` calls register additional adapter slots and complete in milliseconds. When the *last* registered model is unloaded the server tears down the Ray runtime via `ray.shutdown()`; the next `create_model` rebuilds it. |
| 40 | + |
| 41 | +## Quickstart — Two SL clients |
| 42 | + |
| 43 | +Run two `tinker-cookbook` `sl_loop` clients in parallel against one Megatron-backed Tinker server. |
| 44 | + |
| 45 | +### 1. Start the server |
| 46 | + |
| 47 | +```bash |
| 48 | +uv run --extra tinker --extra megatron -m skyrl.tinker.api \ |
| 49 | + --host 0.0.0.0 \ |
| 50 | + --port 8000 \ |
| 51 | + --base-model Qwen/Qwen3-0.6B \ |
| 52 | + --backend megatron \ |
| 53 | + --backend-config '{ |
| 54 | + "strategy": "megatron", |
| 55 | + "trainer.placement.policy_num_gpus_per_node": 1, |
| 56 | + "trainer.placement.policy_num_nodes": 1, |
| 57 | + "trainer.placement.colocate_all": false, |
| 58 | + "trainer.policy.megatron_config.tensor_model_parallel_size": 1, |
| 59 | + "trainer.policy.megatron_config.pipeline_model_parallel_size": 1, |
| 60 | + "trainer.policy.megatron_config.lora_config.merge_lora": false, |
| 61 | + "trainer.policy.model.lora.max_loras": 2, |
| 62 | + "trainer.policy.model.lora.max_cpu_loras": 2, |
| 63 | + "trainer.logprobs_chunk_size": null |
| 64 | + }' |
| 65 | +``` |
| 66 | + |
| 67 | +Wait for `init policy model done` after the first client connects. |
| 68 | + |
| 69 | +### 2. Run two `sl_loop` clients |
| 70 | + |
| 71 | +In two separate terminals (in the tinker-cookbook repo): |
| 72 | + |
| 73 | +```bash |
| 74 | +# Terminal 2 — client A |
| 75 | +TINKER_API_KEY=tml-dummy uv run --with tinker --with tinker-cookbook --with datasets \ |
| 76 | + python -m tinker_cookbook.recipes.sl_loop \ |
| 77 | + base_url=http://localhost:8000 \ |
| 78 | + model_name="Qwen/Qwen3-0.6B" \ |
| 79 | + train_on_what=LAST_ASSISTANT_MESSAGE \ |
| 80 | + lora_rank=32 \ |
| 81 | + log_path=/tmp/sl_loop_a.log |
| 82 | +``` |
| 83 | + |
| 84 | +```bash |
| 85 | +# Terminal 3 — client B |
| 86 | +TINKER_API_KEY=tml-dummy uv run --with tinker --with tinker-cookbook --with datasets \ |
| 87 | + python -m tinker_cookbook.recipes.sl_loop \ |
| 88 | + base_url=http://localhost:8000 \ |
| 89 | + model_name="Qwen/Qwen3-0.6B" \ |
| 90 | + train_on_what=LAST_ASSISTANT_MESSAGE \ |
| 91 | + lora_rank=32 \ |
| 92 | + log_path=/tmp/sl_loop_b.log |
| 93 | +``` |
| 94 | + |
| 95 | +Stagger the launches by ~20s so the second client doesn't race the policy build. Both clients **must** use the same `lora_rank` and `model_name`. |
| 96 | + |
| 97 | +You should see both clients converge on their respective tasks, with NLL trending independently downward in both `sl_loop_a.log` and `sl_loop_b.log`. |
| 98 | +GPU memory will stay bounded even as the second client connects (single base model + N LoRA slots). |
| 99 | + |
| 100 | +## Quickstart — Two RL clients |
| 101 | + |
| 102 | +Two `rl_loop` clients each train and sample independently against one server. RL exercises the per-tenant `save_weights_for_sampler` + `sample(model=<model_id>)` path. |
| 103 | + |
| 104 | +### 1. Start the server |
| 105 | + |
| 106 | +```bash |
| 107 | +uv run --extra tinker --extra megatron -m skyrl.tinker.api \ |
| 108 | + --host 0.0.0.0 \ |
| 109 | + --port 8000 \ |
| 110 | + --base-model Qwen/Qwen3-0.6B \ |
| 111 | + --backend megatron \ |
| 112 | + --backend-config '{ |
| 113 | + "strategy": "megatron", |
| 114 | + "trainer.placement.policy_num_gpus_per_node": 4, |
| 115 | + "trainer.placement.policy_num_nodes": 1, |
| 116 | + "trainer.placement.colocate_all": false, |
| 117 | + "trainer.policy.megatron_config.tensor_model_parallel_size": 1, |
| 118 | + "trainer.policy.megatron_config.pipeline_model_parallel_size": 1, |
| 119 | + "trainer.policy.megatron_config.lora_config.merge_lora": false, |
| 120 | + "trainer.micro_train_batch_size_per_gpu": 64, |
| 121 | + "trainer.micro_forward_batch_size_per_gpu": 64, |
| 122 | + "generator.inference_engine.num_engines": 1, |
| 123 | + "generator.inference_engine.tensor_parallel_size": 1, |
| 124 | + "trainer.policy.model.lora.max_loras": 2, |
| 125 | + "trainer.policy.model.lora.max_cpu_loras": 2, |
| 126 | + "trainer.logprobs_chunk_size": null, |
| 127 | + }' |
| 128 | +``` |
| 129 | + |
| 130 | +Critical knobs vs the SL quickstart: |
| 131 | +- `colocate_all: false` is required. In order for sampling and training to progress independently for different client calls, inference engines and trainer workers should be placed on different GPUs. |
| 132 | +- `merge_lora: false` is required. With `merge_lora: true`, vLLM serves the merged base model and `sample(model=<adapter>)` returns the wrong tenant's weights. |
| 133 | +- `max_loras` ≥ number of adapters in a single batch (typically equal to the client count). |
| 134 | +- `max_cpu_loras` must be ≥ the number of adapters you expect to serve concurrently. There is no on-demand reload — if vLLM evicts an adapter, its next `sample()` 404s. |
| 135 | + |
| 136 | +### 2. Run two `rl_loop` clients |
| 137 | + |
| 138 | +```bash |
| 139 | +# Terminal 2 — client A |
| 140 | +TINKER_API_KEY=tml-dummy uv run --with tinker --with tinker-cookbook --with datasets --with torch \ |
| 141 | + python -m tinker_cookbook.recipes.rl_loop \ |
| 142 | + base_url=http://localhost:8000 \ |
| 143 | + model_name="Qwen/Qwen3-0.6B" \ |
| 144 | + lora_rank=32 \ |
| 145 | + log_path=/tmp/rl_loop_a.log |
| 146 | +``` |
| 147 | + |
| 148 | +```bash |
| 149 | +# Terminal 3 — client B |
| 150 | +TINKER_API_KEY=tml-dummy uv run --with tinker --with tinker-cookbook --with datasets --with torch \ |
| 151 | + python -m tinker_cookbook.recipes.rl_loop \ |
| 152 | + base_url=http://localhost:8000 \ |
| 153 | + model_name="Qwen/Qwen3-0.6B" \ |
| 154 | + lora_rank=32 \ |
| 155 | + log_path=/tmp/rl_loop_b.log |
| 156 | +``` |
| 157 | + |
| 158 | +Stagger by ~20 s. Both clients **must** use the same `lora_rank` and `model_name`. |
| 159 | + |
| 160 | +You should see both clients' rewards trend upward independently in `rl_loop_a.log` and `rl_loop_b.log`, vLLM logs showing two distinct adapter names registered and `sample` requests routed to each., and GPU memory staying bounded (single base model, two LoRA adapters, CPU LRU holds the same two). |
| 161 | + |
| 162 | +## Troubleshooting |
| 163 | + |
| 164 | +- **`LoRA signature mismatch`** — clients passed different `(rank, alpha, target_modules)`. All adapters on one server share a signature, captured from the first `create_model`. |
| 165 | +- **`sample()` 404 on `lora_name=…`** — either `save_sampler_checkpoint` wasn't called for that `model_id` before sampling, or `max_cpu_loras` is too low and vLLM evicted the adapter. Check the vLLM server log. |
| 166 | +- **Server hangs on the second `create_model`** — the first policy build hasn't finished. Wait for `init policy model done` before starting subsequent clients. |
| 167 | +- **CPU OOM on the Nth client** — each adapter slot holds LoRA params + fp32 main + Adam moments, roughly `~3× lora_param_bytes_per_DP_shard`. For Qwen3-0.6B at rank 32 this is on the order of tens of MB per slot; for larger models scale accordingly. Reduce concurrent adapters or move to a host with more RAM. |
| 168 | +- **Sample returns the wrong tenant's output** — confirm `merge_lora: false` is set on the Megatron config; with merge enabled vLLM only sees the merged base. |
0 commit comments