NovaSky-AI
diff --git a/‎docs/content/docs/tinker/architecture.mdx‎
Lines changed: 5 additions & 3 deletions b/‎docs/content/docs/tinker/architecture.mdx‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎docs/content/docs/tinker/configuration.mdx‎
Lines changed: 18 additions & 1 deletion b/‎docs/content/docs/tinker/configuration.mdx‎
Lines changed: 18 additions & 1 deletion
diff --git a/‎docs/content/docs/tinker/limitations.mdx‎
Lines changed: 3 additions & 13 deletions b/‎docs/content/docs/tinker/limitations.mdx‎
Lines changed: 3 additions & 13 deletions
diff --git a/‎docs/content/docs/tinker/meta.json‎
Lines changed: 1 addition & 0 deletions b/‎docs/content/docs/tinker/meta.json‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/content/docs/tinker/multi_tenancy.mdx‎
Lines changed: 168 additions & 0 deletions b/‎docs/content/docs/tinker/multi_tenancy.mdx‎
Lines changed: 168 additions & 0 deletions
diff --git a/‎docs/content/docs/tinker/overview.mdx‎
Lines changed: 5 additions & 3 deletions b/‎docs/content/docs/tinker/overview.mdx‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py‎
Lines changed: 20 additions & 8 deletions b/‎skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py‎
Lines changed: 20 additions & 8 deletions
@@ -122,9 +122,11 @@ The Tinker SDK sends a `sampling_session_seq_id` field when using the ephemeral
 
 Persistent saves can be very expensive because they write full model weights to disk on every call. In RL training loops that sync weights every batch, ephemeral mode avoids this overhead entirely. In typical RL loops (e.g., tinker-cookbook's `rl_loop`), every iteration uses ephemeral mode before sampling, and persistent saves are reserved for periodic checkpointing.
 
-### Single Model Constraint
+### Multiple LoRA tenants
 
-SkyRL currently supports only one copy of sampling model weights at a time. This differs from Thinking Machines' hosted service that supports arbitrarily many sampling clients attached to various sampling model weights. In SkyRL, after a weight sync, all subsequent `sample()` calls automatically use the updated weights.
+On the Megatron backend, SkyRL supports multiple LoRA adapters trained and sampled concurrently against a single server. Each tenant's adapter weights and optimizer state live in pinned-CPU slots; the live GPU adapter is swapped on demand at the top of every per-model dispatch entry point (forward, forward_backward, optim_step, save_weights_for_sampler). On the inference side, vLLM serves each tenant's adapter by `model_id` after `save_weights_for_sampler` registers it via `load_lora_adapter`. See [Multi-tenancy](./multi_tenancy) for the design and operator contract.
+
+Full-parameter fine-tuning and the FSDP backend remain single-tenant — calling `create_model` a second time on those paths returns an error.
 
 ## Checkpointing
 
@@ -163,4 +165,4 @@ Tinker represents training data as `Datum` objects with a `ModelInput` (containi
 - **Shifts** tokens: Tinker pre-shifts inputs/targets, but SkyRL-Train shifts internally, so the backend appends the last target token to reconstruct full sequences
 - Builds `attention_mask`, `loss_mask`, and `response_mask` tensors from token weights
 
-There is currently a limitation that batch size must be divisible by the data parallelism size (number of GPUs). The engine layer handles batching multiple client requests together before passing them to the backend.
+The engine layer also batches multiple client requests together before passing them to the backend.
@@ -54,7 +54,24 @@ python -m tinker_cookbook.recipes.sl_loop ... lora_rank=32
 python -m tinker_cookbook.recipes.sl_loop ... lora_rank=0
 ```
 
-No server-side configuration is needed to switch between LoRA and full-parameter fine-tuning.
+No server-side configuration is needed to switch between single-tenant LoRA and full-parameter fine-tuning.
+
+### Multi-tenant LoRA
+
+Hosting multiple LoRA tenants concurrently against one server *does* require server-side configuration on the Megatron backend. At minimum:
+
+```json
+{
+    "trainer.placement.colocate_all": false,
+    "trainer.policy.megatron_config.lora_config.merge_lora": false,
+    "trainer.policy.model.lora.max_loras": <max concurrent adapters in a single batch>,
+    "trainer.policy.model.lora.max_cpu_loras": <total adapter capacity>
+}
+```
+
+`merge_lora: false` is required so vLLM serves each tenant's adapter by name (with `merge_lora: true` vLLM only sees the merged base and per-tenant sampling returns the wrong weights). `max_cpu_loras` must be sized to the peak number of concurrent tenants — there is no on-demand reload, and if vLLM evicts an adapter the next `sample()` against it 404s. All adapters on one server must share the same `(rank, alpha, target_modules)` signature; mismatched signatures are hard-rejected at `create_model`.
+
+See [Multi-tenancy](./multi_tenancy) for the full operator contract and SFT/RL quickstarts.
 
 ## Full Config Reference
 
 
@@ -6,21 +6,11 @@ The Tinker integration is under active development. This page documents current
 
 ## Current Limitations
 
-### Single Model
+### Multi-tenant LoRA: Megatron only
 
-Only one training model and one set of sampling weights can be loaded at a time. Calling `create_model` when a model already exists will return an error. After a weight sync, all subsequent `sample()` calls use the updated weights — there is no support for maintaining multiple sampling snapshots concurrently. To switch models, restart the server.
+Multi-tenant LoRA training and sampling are supported on the **Megatron** backend with vLLM serving per-tenant adapters by name. See [Multi-tenancy](./multi_tenancy) for the operator contract and SL/RL quickstarts. **FSDP2** support is pending, and full-parameter fine-tuning remains single-tenant on both backends — calling `create_model` with `lora_rank=0` while another model exists returns an error.
 
-### Single-tenant LoRA
-Related to the above limitation, even when training with LoRA adaptors, the SkyRL-Train backend only supports one training model and one set of sampling weights. We plan to support training and sampling on multiple LoRA adaptors concurrently in the future.
-
-### Vision Language Models
-
-Vision language models (VLMs) are supported through the Tinker integration. We have validated the path end-to-end on [Qwen3-VL](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) — see the [Vision Language cookbook recipe](./cookbook#vision-language-vlm_classifier) for a runnable example. We welcome contributions that extend coverage to additional VLM families.
-
-
-### Batch Size Constraint
-
-The batch size must be evenly divisible by the data parallelism size (number of GPUs). For example, with 4 GPUs you cannot use a batch size of 5.
+All adapters registered against one server must share the same `(rank, alpha, target_modules)` signature; mismatched signatures are hard-rejected at `create_model`.
 
 ### No Prompt Logprobs
 
 
@@ -4,6 +4,7 @@
     "overview",
     "quickstart",
     "architecture",
+    "multi_tenancy",
     "cookbook",
     "configuration",
     "limitations"
 
@@ -0,0 +1,168 @@
+---
+title: "Multi-tenancy"
+---
+
+A single SkyRL Tinker server can host multiple LoRA adapters concurrently against a shared base model. Each adapter is its own Tinker `model_id` and its own client session — multiple `tinker-cookbook` recipes can train and sample in parallel without spinning up a separate server per workload.
+
+This page describes the design, the operator contract, and quickstarts for SFT (sl_loop.py) and RL (rl_loop.py).
+
+<Callout type="info">
+Multi-tenancy is wired on the **Megatron** backend with vLLM serving per-tenant adapters. FSDP2 multi-tenancy and multi-tenant full-parameter fine-tuning are not yet supported — see [Limitations](./limitations).
+</Callout>
+
+## How it works
+
+The base model is loaded once on the policy workers and shared across all tenants. Each tenant gets a per-adapter slot in pinned CPU memory holding its LoRA params, optimizer state, and step count; the live GPU adapter is swapped on demand at the top of every per-model dispatch entry point. Clients never reason about which adapter is currently resident — they just call the Tinker API with their `model_id`.
+
+What this means for you:
+
+- **GPU memory is bounded** by the base model plus a few small LoRA buffers, regardless of tenant count. The growth from adding a tenant is in *CPU* memory (one slot per adapter, on the order of `~3× lora_param_bytes_per_DP_shard` — tens of MB for Qwen3-0.6B at rank 32).
+- **Swap cost is small** relative to a forward pass — a host→device `tensor.copy_()` plus a DP-group barrier. You should not see noticeable per-call latency from tenant churn.
+- **Per-tenant sampling on vLLM** is by `model_id`. The worker exports each tenant's adapter into `lora_sync_path/<model_id>/` on `save_weights_for_sampler` and registers it on vLLM via `load_lora_adapter`. Sampling uses `model=<model_id>` and vLLM routes to the right adapter.
+- **Capacity is bounded by `max_cpu_loras`**, vLLM's CPU LRU cache. If you have more concurrent tenants than slots, vLLM evicts one and the next `sample()` against it 404s — there is no on-demand reload. Size for your peak.
+
+## Operator contract
+
+Required `--backend-config` keys to run multi-tenant LoRA on Megatron:
+
+```json
+{
+    "trainer.placement.colocate_all": false,
+    "trainer.policy.megatron_config.lora_config.merge_lora": false,
+    "trainer.policy.model.lora.max_loras": <max concurrent adapters in a single batch>,
+    "trainer.policy.model.lora.max_cpu_loras": <total adapter capacity>
+}
+```
+
+All adapters must share the same `(rank, alpha, target_modules)` signature. Mismatches are hard-rejected at `create_model` with a `LoRA signature mismatch …` error.
+
+The first `create_model` on a fresh server triggers the policy build and bootstraps the per-tenant adapter slot infrastructure; subsequent `create_model` calls register additional adapter slots and complete in milliseconds. When the *last* registered model is unloaded the server tears down the Ray runtime via `ray.shutdown()`; the next `create_model` rebuilds it.
+
+## Quickstart — Two SL clients
+
+Run two `tinker-cookbook` `sl_loop` clients in parallel against one Megatron-backed Tinker server.
+
+### 1. Start the server
+
+```bash
+uv run --extra tinker --extra megatron -m skyrl.tinker.api \
+    --host 0.0.0.0 \
+    --port 8000 \
+    --base-model Qwen/Qwen3-0.6B \
+    --backend megatron \
+    --backend-config '{
+        "strategy": "megatron",
+        "trainer.placement.policy_num_gpus_per_node": 1,
+        "trainer.placement.policy_num_nodes": 1,
+        "trainer.placement.colocate_all": false,
+        "trainer.policy.megatron_config.tensor_model_parallel_size": 1,
+        "trainer.policy.megatron_config.pipeline_model_parallel_size": 1,
+        "trainer.policy.megatron_config.lora_config.merge_lora": false,
+        "trainer.policy.model.lora.max_loras": 2,
+        "trainer.policy.model.lora.max_cpu_loras": 2,
+        "trainer.logprobs_chunk_size": null
+    }'
+```
+
+Wait for `init policy model done` after the first client connects.
+
+### 2. Run two `sl_loop` clients
+
+In two separate terminals (in the tinker-cookbook repo):
+
+```bash
+# Terminal 2 — client A
+TINKER_API_KEY=tml-dummy uv run --with tinker --with tinker-cookbook --with datasets \
+    python -m tinker_cookbook.recipes.sl_loop \
+    base_url=http://localhost:8000 \
+    model_name="Qwen/Qwen3-0.6B" \
+    train_on_what=LAST_ASSISTANT_MESSAGE \
+    lora_rank=32 \
+    log_path=/tmp/sl_loop_a.log
+```
+
+```bash
+# Terminal 3 — client B
+TINKER_API_KEY=tml-dummy uv run --with tinker --with tinker-cookbook --with datasets \
+    python -m tinker_cookbook.recipes.sl_loop \
+    base_url=http://localhost:8000 \
+    model_name="Qwen/Qwen3-0.6B" \
+    train_on_what=LAST_ASSISTANT_MESSAGE \
+    lora_rank=32 \
+    log_path=/tmp/sl_loop_b.log
+```
+
+Stagger the launches by ~20s so the second client doesn't race the policy build. Both clients **must** use the same `lora_rank` and `model_name`.
+
+You should see both clients converge on their respective tasks, with NLL trending independently downward in both `sl_loop_a.log` and `sl_loop_b.log`.
+GPU memory will stay bounded even as the second client connects (single base model + N LoRA slots).
+
+## Quickstart — Two RL clients
+
+Two `rl_loop` clients each train and sample independently against one server. RL exercises the per-tenant `save_weights_for_sampler` + `sample(model=<model_id>)` path.
+
+### 1. Start the server
+
+```bash
+uv run --extra tinker --extra megatron -m skyrl.tinker.api \
+    --host 0.0.0.0 \
+    --port 8000 \
+    --base-model Qwen/Qwen3-0.6B \
+    --backend megatron \
+    --backend-config '{
+        "strategy": "megatron",
+        "trainer.placement.policy_num_gpus_per_node": 4,
+        "trainer.placement.policy_num_nodes": 1,
+        "trainer.placement.colocate_all": false,
+        "trainer.policy.megatron_config.tensor_model_parallel_size": 1,
+        "trainer.policy.megatron_config.pipeline_model_parallel_size": 1,
+        "trainer.policy.megatron_config.lora_config.merge_lora": false,
+        "trainer.micro_train_batch_size_per_gpu": 64,
+        "trainer.micro_forward_batch_size_per_gpu": 64,
+        "generator.inference_engine.num_engines": 1,
+        "generator.inference_engine.tensor_parallel_size": 1,
+        "trainer.policy.model.lora.max_loras": 2,
+        "trainer.policy.model.lora.max_cpu_loras": 2,
+        "trainer.logprobs_chunk_size": null,
+    }'
+```
+
+Critical knobs vs the SL quickstart:
+- `colocate_all: false` is required. In order for sampling and training to progress independently for different client calls, inference engines and trainer workers should be placed on different GPUs.
+- `merge_lora: false` is required. With `merge_lora: true`, vLLM serves the merged base model and `sample(model=<adapter>)` returns the wrong tenant's weights.
+- `max_loras` ≥ number of adapters in a single batch (typically equal to the client count).
+- `max_cpu_loras` must be ≥ the number of adapters you expect to serve concurrently. There is no on-demand reload — if vLLM evicts an adapter, its next `sample()` 404s.
+
+### 2. Run two `rl_loop` clients
+
+```bash
+# Terminal 2 — client A
+TINKER_API_KEY=tml-dummy uv run --with tinker --with tinker-cookbook --with datasets --with torch \
+    python -m tinker_cookbook.recipes.rl_loop \
+    base_url=http://localhost:8000 \
+    model_name="Qwen/Qwen3-0.6B" \
+    lora_rank=32 \
+    log_path=/tmp/rl_loop_a.log
+```
+
+```bash
+# Terminal 3 — client B
+TINKER_API_KEY=tml-dummy uv run --with tinker --with tinker-cookbook --with datasets --with torch \
+    python -m tinker_cookbook.recipes.rl_loop \
+    base_url=http://localhost:8000 \
+    model_name="Qwen/Qwen3-0.6B" \
+    lora_rank=32 \
+    log_path=/tmp/rl_loop_b.log
+```
+
+Stagger by ~20 s. Both clients **must** use the same `lora_rank` and `model_name`.
+
+You should see both clients' rewards trend upward independently in `rl_loop_a.log` and `rl_loop_b.log`, vLLM logs showing two distinct adapter names registered and `sample` requests routed to each., and GPU memory staying bounded (single base model, two LoRA adapters, CPU LRU holds the same two).
+
+## Troubleshooting
+
+- **`LoRA signature mismatch`** — clients passed different `(rank, alpha, target_modules)`. All adapters on one server share a signature, captured from the first `create_model`.
+- **`sample()` 404 on `lora_name=…`** — either `save_sampler_checkpoint` wasn't called for that `model_id` before sampling, or `max_cpu_loras` is too low and vLLM evicted the adapter. Check the vLLM server log.
+- **Server hangs on the second `create_model`** — the first policy build hasn't finished. Wait for `init policy model done` before starting subsequent clients.
+- **CPU OOM on the Nth client** — each adapter slot holds LoRA params + fp32 main + Adam moments, roughly `~3× lora_param_bytes_per_DP_shard`. For Qwen3-0.6B at rank 32 this is on the order of tens of MB per slot; for larger models scale accordingly. Reduce concurrent adapters or move to a host with more RAM.
+- **Sample returns the wrong tenant's output** — confirm `merge_lora: false` is set on the Megatron config; with merge enabled vLLM only sees the merged base.
@@ -40,15 +40,17 @@ SkyRL brings the Tinker API to your own hardware. By utilizing the fully Tinker
 | FSDP2 strategy | Supported |
 | Megatron strategy | Supported |
 | Vision models | Supported |
-| Multi-tenant LoRA | Not yet supported |
-| Multi-model sampling | Not yet supported |
-| Multi-model training | Not yet supported |
+| Multi-tenant LoRA training (Megatron + vLLM) | Supported — see [Multi-tenancy](./multi_tenancy) |
+| Multi-tenant LoRA sampling (Megatron + vLLM) | Supported — see [Multi-tenancy](./multi_tenancy) |
+| Multi-tenant LoRA on FSDP2 | Not yet supported |
+| Multi-tenant full-parameter fine-tuning | Not yet supported |
 
 For more details, see the [Limitations & Roadmap](./limitations) page.
 
 ## Next Steps
 
 - [Quickstart](./quickstart) - Start a SkyRL Tinker server and run your first training script
 - [Architecture](./architecture) - Understand how SkyRL implements the Tinker API
+- [Multi-tenancy](./multi_tenancy) - Run multiple LoRA tenants concurrently against one server
 - [Cookbook Scripts](./cookbook) - Run the official tinker-cookbook recipes on SkyRL
 - [Limitations & Roadmap](./limitations) - Known limitations and future plans
@@ -308,10 +308,16 @@ async def init_weight_update_communicator(self, init_info: "WeightSyncInitInfo")
             args=(pickled_init_info,),
         )
 
-    async def _load_lora_from_disk(self, lora_path: str):
-        """Load LoRA adapters from disk using vLLM's native add_lora method."""
+    async def _load_lora_from_disk(self, lora_path: str, lora_name: str = ""):
+        """Load LoRA adapters from disk using vLLM's native add_lora method.
+
+        When ``lora_name`` is empty (legacy single-tenant), a numeric name is
+        generated. Multi-tenant callers pass ``lora_name`` so subsequent
+        ``model=<lora_name>`` sampling routes to the right adapter.
+        """
         lora_id = int(time.time_ns() % 0x7FFFFFFF)
-        lora_request = LoRARequest(lora_name=f"{lora_id}", lora_int_id=lora_id, lora_path=lora_path)
+        name = lora_name or f"{lora_id}"
+        lora_request = LoRARequest(lora_name=name, lora_int_id=lora_id, lora_path=lora_path)
         result = self.llm.llm_engine.add_lora(lora_request)
         return result
 
@@ -320,7 +326,7 @@ async def update_named_weights(self, request: WeightUpdateRequest):
 
         # Handle LoRA disk loading request
         if isinstance(request, LoraLoadRequest):
-            return await self._load_lora_from_disk(request.lora_path)
+            return await self._load_lora_from_disk(request.lora_path, lora_name=request.lora_name)
 
         if not len(request):
             raise ValueError("Weight update request must not be empty")
@@ -453,10 +459,16 @@ def _create_ray_prometheus_stat_loggers(self):
             )
             return None
 
-    async def _load_lora_from_disk(self, lora_path: str):
-        """Load LoRA adapters from disk using vLLM's native add_lora method."""
+    async def _load_lora_from_disk(self, lora_path: str, lora_name: str = ""):
+        """Load LoRA adapters from disk using vLLM's native add_lora method.
+
+        When ``lora_name`` is empty (legacy single-tenant), a numeric name is
+        generated. Multi-tenant callers pass ``lora_name`` so subsequent
+        ``model=<lora_name>`` sampling routes to the right adapter.
+        """
         lora_id = int(time.time_ns() % 0x7FFFFFFF)
-        lora_request = LoRARequest(lora_name=f"{lora_id}", lora_int_id=lora_id, lora_path=lora_path)
+        name = lora_name or f"{lora_id}"
+        lora_request = LoRARequest(lora_name=name, lora_int_id=lora_id, lora_path=lora_path)
         result = await self.llm.add_lora(lora_request)
         return result
 
@@ -539,7 +551,7 @@ async def update_named_weights(self, request: WeightUpdateRequest):
 
         # Check for LoRA disk loading request
         if isinstance(request, LoraLoadRequest):
-            return await self._load_lora_from_disk(request.lora_path)
+            return await self._load_lora_from_disk(request.lora_path, lora_name=request.lora_name)
 
         if not len(request):
             raise ValueError("Weight update request must not be empty")