NVIDIA
diff --git a/‎CHANGELOG.rst‎
Lines changed: 1 addition & 0 deletions b/‎CHANGELOG.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/dataset/example_data_config.yaml‎
Lines changed: 6 additions & 3 deletions b/‎examples/dataset/example_data_config.yaml‎
Lines changed: 6 additions & 3 deletions
diff --git a/‎examples/specdec_bench/specdec_bench/utils.py‎
Lines changed: 2 additions & 0 deletions b/‎examples/specdec_bench/specdec_bench/utils.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎examples/speculative_decoding/README.md‎
Lines changed: 6 additions & 0 deletions b/‎examples/speculative_decoding/README.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎examples/speculative_decoding/eagle_utils.py‎
Lines changed: 8 additions & 6 deletions b/‎examples/speculative_decoding/eagle_utils.py‎
Lines changed: 8 additions & 6 deletions
diff --git a/‎examples/speculative_decoding/launch_train.sh‎
Lines changed: 21 additions & 13 deletions b/‎examples/speculative_decoding/launch_train.sh‎
Lines changed: 21 additions & 13 deletions
diff --git a/‎examples/speculative_decoding/main.py‎
Lines changed: 4 additions & 8 deletions b/‎examples/speculative_decoding/main.py‎
Lines changed: 4 additions & 8 deletions
diff --git a/‎modelopt/recipe/config.py‎
Lines changed: 17 additions & 1 deletion b/‎modelopt/recipe/config.py‎
Lines changed: 17 additions & 1 deletion
diff --git a/‎modelopt/torch/speculative/config.py‎
Lines changed: 16 additions & 2 deletions b/‎modelopt/torch/speculative/config.py‎
Lines changed: 16 additions & 2 deletions
diff --git a/‎modelopt/torch/speculative/eagle/utils.py‎
Lines changed: 22 additions & 4 deletions b/‎modelopt/torch/speculative/eagle/utils.py‎
Lines changed: 22 additions & 4 deletions
@@ -7,6 +7,7 @@ Changelog
 **New Features**
 
 - Add the ``day0-release`` agent skill (``.agents/skills/day0-release/``), a deterministic end-to-end driver that chains the PTQ → evaluation → comparison skills (the evaluation stage deploys the checkpoint itself) with an enforced gate after each stage and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Ships three GPU-free, unit-tested gate scripts (``gate_ptq.py``, ``gate_run.py``, ``gate_compare.py``) that validate checkpoint coverage, evaluation-run completeness, and baseline-vs-candidate accuracy threshold. v1 reports and stops on regression; the recipe-search loop is deferred.
+- Add **streaming** speculative-decoding training (EAGLE3 / DFlash): the draft trains on base-model hidden states produced on the fly by a co-located ``vllm serve`` (no disk dump), moved trainer-side over NIXL RDMA, scaling to multi-node (dedicated serve replicas + DDP trainers). New launcher examples for NVFP4 Kimi-K2.5 / K2.6 on GB200/aarch64 under ``tools/launcher/examples/moonshotai/``.
 
 0.45 (2026-06-xx)
 ^^^^^^^^^^^^^^^^^
 
@@ -6,15 +6,18 @@ outputs:
         splits:
           all: 0
       - name: "ultrachat"
+        # UltraChat's loader yields prompt-only turns (no assistant completions),
+        # which makes answer_only_loss=true mask nothing. Use daring-anteater below.
         splits:
-          train_gen: 25000
-          train_sft: 25000
+          train_gen: 0
+          train_sft: 0
       - name: "mtbench"
         splits:
           all: 0
       - name: "daring-anteater"
+        # Multi-turn SFT conversations WITH assistant completions (real split: train).
         splits:
-          all: 0
+          train: 50000
       - name: "magpie"
         splits:
           300k: 0
 
@@ -209,6 +209,8 @@ def _checkpoint_provenance(model_dir):
 
 
 def _is_sensitive_key(key):
+    # Engine configs can carry non-string dict keys (e.g. int layer ids in a
+    # serving_config); those are never sensitive field *names*, so skip them.
     if not isinstance(key, str):
         return False
     klow = key.lower()
 
@@ -18,6 +18,7 @@ This example focuses on training with Hugging Face. To train with Megatron‑LM,
 | Simplified Workflow | Train, evaluate, and export EAGLE model with one-line command | \[[Link](#getting-started-simplified-workflow)\] |
 | Online Training | Train draft model alongside base model in GPU memory | \[[Link](#training-draft-model-with-online-base-model)\] |
 | Offline Training | Train draft model using pre-computed hidden states | \[[Link](#training-draft-model-with-offline-base-model)\] |
+| Streaming Training | Train draft on hidden states streamed from a live vLLM serve (no disk dump) | \[[Link](#training-draft-model-with-streaming-base-model)\] |
 | After Training | Evaluation, export and deployment | \[[Link](#model-validation)\] |
 | Advanced Usage | Data synthesis, vocab compression, and configuration | \[[Link](#advanced-usage)\] |
 | Support Matrix | Supported models for speculative decoding training | \[[Link](#support-matrix)\] |
@@ -127,6 +128,10 @@ Once we finish dumping hidden states, launch offline training pointing to the hi
     training.output_dir=ckpts/llama-3.2-1b-offline
 ```
 
+## Training Draft Model with Streaming Base Model
+
+For large base models, you can stream hidden states from a live `vllm serve` instead of dumping them to disk: a co-located server produces the base-model hidden states on the fly and sends them to the trainer over NIXL RDMA, scaling to multiple nodes (dedicated serve replicas + DDP trainers). See the launcher examples, e.g. [Kimi-K2.5 streaming EAGLE3](../../tools/launcher/examples/moonshotai/Kimi-K2.5/hf_streaming_eagle3_multi_node.yaml) and [streaming DFlash](../../tools/launcher/examples/moonshotai/Kimi-K2.5/hf_streaming_dflash_multi_node.yaml).
+
 ## Model Validation
 
 For online training checkpoints, we can run in-framework evaluation on MT-bench:
@@ -334,6 +339,7 @@ See `main.py` for the full example including tokenizer setup, dataset loading, a
 | Mistral | ✅ | ✅ | ✅ |
 | Phi 3 | ✅ | ✅ | ✅ |
 | QWen 1.5,2,2.5,3 | ✅ | ✅ | ✅ |
+| Kimi-K2.5, K2.6 |  |  | ✅ |
 
 ## Speculation Module Checkpoints
 
 
@@ -59,7 +59,6 @@ def make_speculative_data_module(
     train_len=None,
     answer_only_loss=False,
     shift_labels=True,
-    seed: int = 0,
 ) -> dict:
     """Create data module for speculative decoding training.
 
@@ -88,14 +87,15 @@ def make_speculative_data_module(
         ds = load_dataset("json", data_files=data_args.data_path, split="train")
         if data_args.sample_size > 0:
             ds = ds.select(range(data_args.sample_size))
+        # Map-style dataset: each rank fetches its own DistributedSampler shard.
+        # Fetch concurrency comes from the DataLoader's num_workers, not a config knob;
+        # shuffling/order is the sampler's job (seeded by training_args.seed).
+        # ``server_urls`` accepts a comma-separated string for multi-server fan-out.
         streaming_cfg = EagleVllmStreamingConfig(
-            server_url=data_args.streaming_server_url,
+            server_urls=data_args.streaming_server_url,
             model=data_args.streaming_model_name,
-            shared_storage_root=data_args.streaming_shared_storage_path,
             max_seq_len=train_len,
             answer_only_loss=answer_only_loss,
-            prefetch=data_args.streaming_prefetch,
-            seed=seed,
         )
         train_dataset = EagleVllmStreamingDataset(
             entries=ds,
@@ -138,7 +138,9 @@ def make_speculative_data_module(
             raise ValueError("sample_size must be -1 (use all samples) or a positive integer")
         if data_args.sample_size > 0:
             dumped_files = dumped_files[: data_args.sample_size]
-        train_dataset = OfflineSupervisedDataset(dumped_files, answer_only_loss=answer_only_loss)
+        train_dataset = OfflineSupervisedDataset(
+            dumped_files, answer_only_loss=answer_only_loss, tokenizer=tokenizer
+        )
         data_collator = EagleOfflineDataCollator(train_len=train_len)
 
     return {
 
@@ -19,9 +19,8 @@
 #   Multi-node:   ./launch_train.sh --config ../../modelopt_recipes/general/speculative_decoding/eagle3.yaml --num_nodes 2 --head_node_ip <IP>
 #   With overrides: ./launch_train.sh --config my.yaml model.model_name_or_path=xxx training.output_dir=yyy
 #
-# Extra key=value args are forwarded as OmegaConf dotlist overrides to main.py.
-# All training config (model, data, hyperparams, eagle, fsdp) lives in the YAML file.
-# Only multi-node routing args are passed here; mixed_precision is fixed to bf16.
+# Extra key=value args are forwarded as OmegaConf dotlist overrides to main.py; all
+# training config lives in the YAML. mixed_precision is fixed to bf16.
 
 set -eo pipefail
 
@@ -30,12 +29,14 @@ SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
 CONFIG_FILE=""
 NUM_NODES=1
 HEAD_NODE_IP=""
+MACHINE_RANK=""
 EXTRA_ARGS=()
 while [ $# -gt 0 ]; do
   case "$1" in
     --config*)     if [[ "$1" != *=* ]]; then shift; fi; CONFIG_FILE="${1#*=}" ;;
     --num_nodes*)  if [[ "$1" != *=* ]]; then shift; fi; NUM_NODES="${1#*=}" ;;
     --head_node_ip*) if [[ "$1" != *=* ]]; then shift; fi; HEAD_NODE_IP="${1#*=}" ;;
+    --machine_rank*) if [[ "$1" != *=* ]]; then shift; fi; MACHINE_RANK="${1#*=}" ;;
     *) EXTRA_ARGS+=("$1") ;;
   esac
   shift
@@ -46,7 +47,6 @@ if [ -z "$CONFIG_FILE" ]; then
   exit 1
 fi
 
-# GPU count detection
 if [[ "$NUM_NODES" != "1" ]]; then
   GPU_PER_NODE=${GPU_PER_NODE:-$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)}
   TOTAL_GPU=$((NUM_NODES * GPU_PER_NODE))
@@ -56,20 +56,28 @@ else
   echo "Total GPUs: $TOTAL_GPU (single node)"
 fi
 
-# Multi-node routing args (accelerate only; training config comes from the YAML)
-MULTI_NODE_ARGS=""
+MULTI_NODE_ARGS=()
 if [[ "$NUM_NODES" != "1" ]]; then
-  MULTI_NODE_ARGS="--num_processes $TOTAL_GPU \
-                   --num_machines $NUM_NODES \
-                   --machine_rank $SLURM_PROCID \
-                   --rdzv_backend c10d \
-                   --main_process_ip $HEAD_NODE_IP \
-                   --main_process_port 29500"
+  # --multi_gpu is required even at 1 GPU/node, else accelerate won't form the DDP group.
+  # machine_rank defaults to $SLURM_PROCID; override --machine_rank if node 0 isn't a trainer.
+  MULTI_NODE_ARGS=(
+    --multi_gpu
+    --num_processes "$TOTAL_GPU"
+    --num_machines "$NUM_NODES"
+    --machine_rank "${MACHINE_RANK:-$SLURM_PROCID}"
+    --main_process_ip "$HEAD_NODE_IP"
+    --main_process_port 29500
+  )
 fi
 
 export TOKENIZERS_PARALLELISM=False
 
+# argv array, not `sh -c` (which would word-split overrides and run embedded substitutions).
+CMD=(accelerate launch --mixed_precision bf16
+     "${MULTI_NODE_ARGS[@]}"
+     "${SCRIPT_DIR}/main.py" --config "$CONFIG_FILE" "${EXTRA_ARGS[@]}")
+
 set -x
 start_time=$(date +%s)
-sh -c "accelerate launch --mixed_precision bf16 $MULTI_NODE_ARGS ${SCRIPT_DIR}/main.py --config $CONFIG_FILE ${EXTRA_ARGS[*]}"
+"${CMD[@]}"
 echo "Total time: $(( $(date +%s) - $start_time )) seconds"
@@ -267,7 +267,6 @@ def train():
         train_len=training_args.training_seq_len,
         answer_only_loss=training_args.answer_only_loss,
         shift_labels=not is_dflash,
-        seed=training_args.seed,
     )
 
     callbacks = [EagleTrainingPlot(training_args.ar_validate_steps, training_args.estimate_ar)]
@@ -277,13 +276,10 @@ def train():
         and recipe.eagle.eagle_base_lora_warmup_steps > 0
     ):
         callbacks.append(LoRAWarmupCallback(recipe.eagle.eagle_base_lora_warmup_steps))
-    if recipe.data.mode == "streaming":
-        # Skip-on-resume happens inside the dataset (no re-fetch from server);
-        # disable HF Trainer's own data skip so the offset isn't applied twice.
-        from modelopt.torch.speculative.plugins.hf_streaming_dataset import StreamingResumeCallback
-
-        training_args.ignore_data_skip = True
-        callbacks.append(StreamingResumeCallback())
+    # Leave training_args.ignore_data_skip at its default (False). The dataset is
+    # map-style, so HF Trainer's resume skips consumed indices at the batch-sampler
+    # level (accelerate.skip_first_batches) without re-fetching them, landing at the
+    # exact data position. Setting it True would restart the data order from the top.
 
     trainer = EagleTrainerWithAccLog(
         model=model,
 
@@ -31,6 +31,18 @@
     TrainingArguments as SpecTrainingArgs,
 )
 
+__all__ = [
+    "RECIPE_TYPE_TO_CLASS",
+    "ModelOptDFlashRecipe",
+    "ModelOptEagleRecipe",
+    "ModelOptMedusaRecipe",
+    "ModelOptPTQRecipe",
+    "ModelOptRecipeBase",
+    "ModelOptSpeculativeRecipeBase",
+    "RecipeMetadataConfig",
+    "RecipeType",
+]
+
 
 class RecipeType(str, Enum):
     """List of recipe types. See ``RECIPE_TYPE_TO_CLASS`` at the bottom for the schema mapping."""
@@ -178,7 +190,11 @@ class ModelOptDFlashRecipe(ModelOptSpeculativeRecipeBase):
 
     @model_validator(mode="after")
     def _derive_dflash_offline(self) -> ModelOptDFlashRecipe:
-        self.dflash.dflash_offline = self.data.offline_data_path is not None
+        # offline (dumped .pt) and streaming (hidden states via NIXL RDMA from a vLLM
+        # serve) both feed pre-computed base hidden states to the DFlash module, so
+        # both set dflash_offline. Only fully-online training runs the base model.
+        # Mirrors ModelOptEagleRecipe._derive_eagle_offline.
+        self.dflash.dflash_offline = self.data.mode != "online"
         return self
 
 
 
@@ -23,6 +23,18 @@
 
 from .eagle.default_config import default_eagle_config, default_kimik2_eagle_config
 
+__all__ = [
+    "DFLASH_DEFAULT_CFG",
+    "EAGLE3_DEFAULT_CFG",
+    "EAGLE_MTP_DEFAULT_CFG",
+    "DFlashConfig",
+    "EagleConfig",
+    "MedusaConfig",
+    "eagle3_default_config",
+    "eagle_mtp_default_config",
+    "kimik2_eagle_default_config",
+]
+
 kimik2_eagle_default_config = deepcopy(default_kimik2_eagle_config)
 
 eagle3_default_config = deepcopy(default_eagle_config)
@@ -68,8 +80,10 @@ class DFlashConfig(ModeloptBaseConfig):
     dflash_offline: bool = ModeloptField(
         default=False,
         description=(
-            "Whether to use detached DFlash (offline training from pre-computed hidden states). "
-            "Derived by ModelOptDFlashRecipe from data.offline_data_path; not user-configurable."
+            "Whether the DFlash module consumes pre-computed hidden states (offline from "
+            "dumped .pt files, or streaming via NIXL RDMA from a vLLM serve) instead of running "
+            "the base model. Derived by ModelOptDFlashRecipe from data.mode (True unless "
+            "online); not user-configurable."
         ),
     )
 
 
@@ -41,6 +41,8 @@
 from torch.utils.data import Dataset
 from transformers.trainer_pt_utils import LabelSmoother
 
+from modelopt.torch.utils.loss_mask import get_loss_mask_recovery
+
 IGNORE_TOKEN_ID = LabelSmoother.ignore_index
 
 
@@ -96,20 +98,27 @@ class OfflineSupervisedDataset(Dataset):
         dumped_files (list): A list of file paths to the dumped .pt files.
         answer_only_loss (bool): If True, use the ``loss_mask`` stored in each .pt
             file so that only assistant-produced tokens contribute to the loss.
-            Raises ``ValueError`` on ``__getitem__`` if the file lacks ``loss_mask``.
+            If a file lacks ``loss_mask`` and ``tokenizer`` has a registered
+            model-specific recovery (see ``modelopt.torch.utils.loss_mask``), the
+            mask is rebuilt from ``input_ids``; otherwise ``__getitem__`` raises
+            ``ValueError``.
             If False (default), a uniform all-ones mask is used regardless of what
             is stored in the file (backward compatible).
+        tokenizer: Optional tokenizer used to recover the assistant mask for dumps
+            that lack a stored ``loss_mask``.
     """
 
     def __init__(
         self,
         dumped_files,
         answer_only_loss: bool = False,
+        tokenizer=None,
     ):
         """Initialize with a list of .pt file paths."""
         super().__init__()
         self.dumped_files = dumped_files
         self.answer_only_loss = answer_only_loss
+        self.tokenizer = tokenizer
 
     def __len__(self):
         return len(self.dumped_files)
@@ -121,13 +130,22 @@ def __getitem__(self, i) -> dict[str, torch.Tensor]:
         labels[..., :-1] = offline_data["input_ids"][..., 1:]
 
         if self.answer_only_loss:
-            if "loss_mask" not in offline_data:
+            recovery = get_loss_mask_recovery(self.tokenizer) if self.tokenizer else None
+            if "loss_mask" in offline_data:
+                loss_mask = offline_data["loss_mask"].to(offline_data["input_ids"].dtype)
+            elif recovery is not None:
+                # Dumps from tokenizers that cannot emit assistant masks carry no
+                # loss_mask; rebuild it from the token ids.
+                loss_mask = recovery.compute(self.tokenizer, offline_data["input_ids"]).to(
+                    offline_data["input_ids"].dtype
+                )
+            else:
                 raise ValueError(
                     f"answer_only_loss=True requires a 'loss_mask' entry in the offline "
                     f".pt file, but {self.dumped_files[i]} does not have one. Re-dump "
-                    f"with --answer-only-loss in compute_hidden_states_*.py."
+                    f"with --answer-only-loss in compute_hidden_states_*.py, or pass a "
+                    f"tokenizer with a registered loss-mask recovery."
                 )
-            loss_mask = offline_data["loss_mask"].to(offline_data["input_ids"].dtype)
         else:
             loss_mask = torch.ones_like(offline_data["input_ids"])