update

hjh0119 · hjh0119 · commit fd351402549f · 2026-03-02T22:53:17.000+08:00
diff --git a/docs/source/Instruction/GKD.md b/docs/source/Instruction/GKD.md
@@ -196,39 +196,36 @@ swift rlhf \
 | `--gkd_logits_topk` | int | **必需** | 使用外部 API 时必须设置，对应 API 返回的 top_logprobs 数量 |
 
 **支持的后端**：
-- `swift deploy`（vLLM backend）
-- 独立 vLLM 服务（`vllm serve`）
+- `vllm serve`（推荐）
+
+> **注意**：仅支持 `vllm serve` 作为教师服务后端。训练代码通过 `/v1/completions` 接口直接传递 token IDs 并使用 `prompt_logprobs` 参数获取输入 token 的 log 概率，这是 vLLM 原生支持的功能。
 
 **步骤 1：部署教师模型服务**
 
 ```bash
-# 使用 swift deploy 部署教师模型
-CUDA_VISIBLE_DEVICES=0,1 swift deploy \
-    --model Qwen/Qwen2-72B-Instruct \
-    --infer_backend vllm \
+# 使用 vllm serve 部署教师模型
+CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen2.5-14B-Instruct \
     --port 8000 \
-    --vllm_engine_kwargs '{"max_logprobs": 64}'
-
-# 或使用独立 vLLM 服务
-vllm serve Qwen/Qwen2-72B-Instruct --max-logprobs 64 --port 8000
+    --max-logprobs 64 \
+    --gpu-memory-utilization 0.9
 ```
 
 **步骤 2：启动 GKD 训练**
 
 ```bash
 swift rlhf \
     --rlhf_type gkd \
-    --model Qwen/Qwen2-7B-Instruct \
+    --model Qwen/Qwen2.5-7B \
     --teacher_model_server http://localhost:8000 \
-    --gkd_logits_topk 20 \
+    --gkd_logits_topk 64 \
     --dataset your_dataset \
     --lmbda 1.0 \
-    --beta 0.5 \
+    --beta 1.0 \
     ...
 ```
 
 > **vLLM max_logprobs 限制**：
-> - vLLM 默认 `max_logprobs=20`，可通过 `--vllm_engine_kwargs '{"max_logprobs": N}'` 参数调整
+> - vLLM 默认 `max_logprobs=20`，可通过 `--max-logprobs N` 参数调整
 > - `gkd_logits_topk` 不能超过服务端的 `max_logprobs` 设置
 
 ## 采样加速
diff --git a/docs/source/Megatron-SWIFT/GKD.md b/docs/source/Megatron-SWIFT/GKD.md
@@ -34,7 +34,7 @@ Megatron GKD 当前已支持以下功能：
 | 参数 | 类型 | 默认值 | 说明 |
 |------|------|--------|------|
 | `--teacher_model` | str | - | 教师模型路径或模型 ID<br>*使用 `teacher_model_server` 时可省略 |
-| `--teacher_model_server` | str | None | 教师模型服务地址，如 `http://localhost:8000` |
+| `--teacher_model_server` | str | None | 教师模型服务地址（仅支持 `vllm serve`），如 `http://localhost:8000` |
 | `--gkd_logits_topk` | int | None | Top-K logits 数量，使用外部教师 API 时必须设置 |
 | `--beta` | float | 0.5 | JSD 散度插值系数：<br>• 0.0: Forward KL<br>• 0.5: 对称 JSD<br>• 1.0: Reverse KL |
 | `--lmbda` | float | 0.5 | On-Policy 学习触发概率：<br>• 0.0: 纯 Off-Policy<br>• 1.0: 纯 On-Policy |
diff --git a/docs/source_en/Instruction/GKD.md b/docs/source_en/Instruction/GKD.md
@@ -197,36 +197,36 @@ When `gkd_logits_topk` is set, you can use an external teacher model API service
 | `--gkd_logits_topk` | int | **Required** | Must be set when using external API; corresponds to the top_logprobs returned by the API |
 
 **Supported Backends**:
-- `swift deploy` (vLLM backend)
-- Standalone vLLM server (`vllm serve`)
+- `vllm serve` (recommended)
+
+> **Note**: Only `vllm serve` is supported as the teacher server backend. The training code sends raw token IDs via the `prompt` field and uses the `prompt_logprobs` parameter in the `/v1/completions` API to obtain input token log-probabilities. This is a vLLM-native feature.
 
 **Step 1: Deploy Teacher Model Service**
 
 ```bash
-# Deploy teacher model with swift deploy (recommended)
-swift deploy \
-    --model Qwen/Qwen2.5-14B-Instruct \
-    --infer_backend vllm \
+# Deploy teacher model with vllm serve
+CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen2.5-14B-Instruct \
     --port 8000 \
-    --vllm_engine_kwargs '{"max_logprobs": 64}'
+    --max-logprobs 64 \
+    --gpu-memory-utilization 0.9
 ```
 
 **Step 2: Start GKD Training**
 
 ```bash
 swift rlhf \
     --rlhf_type gkd \
-    --model Qwen/Qwen2.5-7B-Instruct \
+    --model Qwen/Qwen2.5-7B \
     --teacher_model_server http://localhost:8000 \
-    --gkd_logits_topk 20 \
+    --gkd_logits_topk 64 \
     --dataset your_dataset \
     --lmbda 1.0 \
-    --beta 0.5 \
+    --beta 1.0 \
     ...
 ```
 
 > **vLLM max_logprobs Limitation**:
-> - vLLM default `max_logprobs=20`, adjustable via `--vllm_engine_kwargs '{"max_logprobs": N}'` parameter
+> - vLLM default `max_logprobs=20`, adjustable via `--max-logprobs N` parameter
 > - `gkd_logits_topk` cannot exceed the server's `max_logprobs` setting
 
 ## Sampling Acceleration
diff --git a/docs/source_en/Megatron-SWIFT/GKD.md b/docs/source_en/Megatron-SWIFT/GKD.md
@@ -34,7 +34,7 @@ Megatron GKD currently supports the following features:
 | Parameter | Type | Default | Description |
 |-----------|------|---------|-------------|
 | `--teacher_model` | str | - | Path or model ID of the teacher model<br>*Can be omitted when using `teacher_model_server` |
-| `--teacher_model_server` | str | None | Teacher model service URL, e.g. `http://localhost:8000` |
+| `--teacher_model_server` | str | None | Teacher model service URL (`vllm serve` only), e.g. `http://localhost:8000` |
 | `--gkd_logits_topk` | int | None | Number of Top-K logits; required when using external API |
 | `--beta` | float | 0.5 | JSD divergence interpolation coefficient:<br>• 0.0: Forward KL<br>• 0.5: Symmetric JSD<br>• 1.0: Reverse KL |
 | `--lmbda` | float | 0.5 | On-Policy learning probability:<br>• 0.0: Pure Off-Policy<br>• 1.0: Pure On-Policy |
diff --git a/examples/megatron/rlhf/gkd/teacher_server.sh b/examples/megatron/rlhf/gkd/teacher_server.sh
@@ -1,3 +1,8 @@
+# GKD Training with External Teacher Model Server (Megatron)
+#
+# Start teacher server first (in a separate terminal / GPU):
+#   CUDA_VISIBLE_DEVICES=4 vllm serve Qwen/Qwen3-8B --port 8000 --max-logprobs 64
+
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=4 \
 PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
diff --git a/examples/train/rlhf/gkd/gsm8k_teacher_server.sh b/examples/train/rlhf/gkd/gsm8k_teacher_server.sh
@@ -0,0 +1,77 @@
+# GKD on GSM8K: Teacher Server Mode with Top-K Logits
+#
+# This script validates GKD effectiveness on mathematical reasoning using GSM8K.
+# Student: Qwen2.5-1.5B-Instruct, Teacher: Qwen2.5-7B-Instruct (via vllm serve)
+#
+# Expected outcome: GSM8K accuracy should improve after GKD training, as the student
+# learns the teacher's reasoning distribution on math problems.
+#
+# ===================== Step 1: Start Teacher Server =====================
+# Run in a separate terminal / GPU:
+#
+#   CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen2.5-7B-Instruct \
+#       --port 8000 \
+#       --max-logprobs 64 \
+#       --gpu-memory-utilization 0.9
+#
+# Wait until the server is ready, then verify:
+#   curl http://localhost:8000/v1/models
+# ========================================================================
+#
+# ===================== Step 2: Prepare GSM8K Dataset =====================
+# The dataset uses the standard GSM8K train split from Hugging Face:
+#   openai/gsm8k (7473 training samples)
+# Swift will auto-download it via the HuggingFace dataset name.
+# ========================================================================
+#
+# ===================== Step 3: Evaluation =====================
+# After training, evaluate on GSM8K test set:
+#
+#   CUDA_VISIBLE_DEVICES=0 swift eval \
+#       --model <output_dir>/checkpoint-xxx \
+#       --eval_backend OpenCompass \
+#       --infer_backend vllm \
+#       --eval_dataset gsm8k
+#
+# Compare with the base model to verify improvement:
+#   CUDA_VISIBLE_DEVICES=0 swift eval \
+#       --model Qwen/Qwen2.5-1.5B-Instruct \
+#       --eval_backend OpenCompass \
+#       --infer_backend vllm \
+#       --eval_dataset gsm8k
+# ========================================================================
+
+TEACHER_SERVER_URL=${TEACHER_SERVER_URL:-"http://localhost:8000"}
+GKD_LOGITS_TOPK=${GKD_LOGITS_TOPK:-64}
+
+CUDA_VISIBLE_DEVICES=1 \
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+swift rlhf \
+    --rlhf_type gkd \
+    --model Qwen/Qwen2.5-1.5B-Instruct \
+    --teacher_model_server $TEACHER_SERVER_URL \
+    --gkd_logits_topk $GKD_LOGITS_TOPK \
+    --tuner_type lora \
+    --lora_rank 64 \
+    --lora_alpha 128 \
+    --dataset 'openai/gsm8k#train' \
+    --seq_kd false \
+    --lmbda 0 \
+    --beta 0.5 \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 3 \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --learning_rate 5e-5 \
+    --gradient_accumulation_steps 8 \
+    --eval_steps 200 \
+    --save_steps 200 \
+    --save_total_limit 3 \
+    --logging_steps 5 \
+    --max_length 1024 \
+    --warmup_ratio 0.05 \
+    --save_only_model true \
+    --dataloader_num_workers 4 \
+    --dataset_num_proc 4 \
+    --deepspeed zero2 \
+    --attn_impl flash_attn
diff --git a/examples/train/rlhf/gkd/teacher_server.sh b/examples/train/rlhf/gkd/teacher_server.sh
@@ -1,21 +1,32 @@
-# GKD Training with External Teacher Model Server
+# GKD Training with External Teacher Model Server (vLLM)
 #
 # This script demonstrates using an external vLLM server as the teacher model
-# for knowledge distillation.
+# for knowledge distillation. The teacher server provides prompt_logprobs via
+# the /v1/completions endpoint, which requires native vLLM serving (vllm serve).
+#
+# NOTE: Only `vllm serve` is supported as the teacher server backend, because
+# the training code sends raw token IDs via the `prompt` field and uses the
+# `prompt_logprobs` parameter in the /v1/completions API. This is a vLLM-native
+# feature not available through swift deploy.
 
-# Teacher Server Setup (run in a separate gpu):
-# CUDA_VISIBLE_DEVICES=5 swift deploy \
-#     --model Qwen/Qwen2.5-14B-Instruct \
-#     --infer_backend vllm \
-#     --port 8000 \
-#     --vllm_engine_kwargs '{"max_logprobs": 64}'
+# ===================== Step 1: Start Teacher Server =====================
+# Run in a separate terminal / GPU:
+#
+#   CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen2.5-14B-Instruct \
+#       --port 8000 \
+#       --max-logprobs 64 \
+#       --gpu-memory-utilization 0.9
+#
+# Wait until the server is ready (shows "Uvicorn running on ...").
+# Verify with: curl http://localhost:8000/v1/models
+# ========================================================================
 
-TEACHER_SERVER_URL=${TEACHER_SERVER_URL:-"http://localhost:8001"}
+TEACHER_SERVER_URL=${TEACHER_SERVER_URL:-"http://localhost:8000"}
 GKD_LOGITS_TOPK=${GKD_LOGITS_TOPK:-64}
 
 NPROC_PER_NODE=4 \
 PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
-CUDA_VISIBLE_DEVICES=0,1,2,3 \
+CUDA_VISIBLE_DEVICES=1,2,3,4 \
 swift rlhf \
     --rlhf_type gkd \
     --model Qwen/Qwen2.5-7B \
diff --git a/swift/megatron/trainers/gkd_trainer.py b/swift/megatron/trainers/gkd_trainer.py
@@ -285,8 +285,7 @@ def _compute_teacher_logits_local(self, encoded_batches: List[Dict], vp_stage: O
                     teacher_logits = teacher_logits.detach()
 
             if topk is not None and teacher_logits is not None:
-                scaled = teacher_logits / self.temperature
-                topk_logits, topk_indices = torch.topk(scaled, k=topk, dim=-1)
+                topk_logits, topk_indices = torch.topk(teacher_logits, k=topk, dim=-1)
                 encoded_batch['teacher_api_logprobs'] = topk_logits
                 encoded_batch['teacher_api_indices'] = topk_indices
                 encoded_batch['teacher_logits'] = None
@@ -295,12 +294,16 @@ def _compute_teacher_logits_local(self, encoded_batches: List[Dict], vp_stage: O
 
     def _compute_teacher_logits_from_api(self, encoded_batches: List[Dict]) -> None:
         """Fetch teacher logprobs from external API service."""
-        from swift.rlhf_trainers.teacher_api_client import fetch_teacher_logprobs
+        from swift.rlhf_trainers.gkd_trainer import fetch_teacher_logprobs
         topk = self.gkd_logits_topk
         for encoded_batch in encoded_batches:
             input_ids = encoded_batch['input_ids']
             teacher_logprobs, teacher_indices = fetch_teacher_logprobs(
                 self.teacher_model_server, input_ids.tolist(), topk=topk)
+            # fetch_teacher_logprobs returns [batch, seq_len-1, topk] (shifted).
+            # Pad last position with -inf to match student [batch, seq_len, topk].
+            teacher_logprobs = F.pad(teacher_logprobs, (0, 0, 0, 1), value=float('-inf'))
+            teacher_indices = F.pad(teacher_indices, (0, 0, 0, 1), value=0)
             encoded_batch['teacher_api_logprobs'] = teacher_logprobs.to(input_ids.device)
             encoded_batch['teacher_api_indices'] = teacher_indices.to(input_ids.device)
             encoded_batch['teacher_logits'] = None
@@ -474,14 +477,14 @@ def generalized_jsd_loss(
     def _jsd_topk(self, student_logits, teacher_topk_logprobs, teacher_topk_indices, mask, beta):
         """Compute JSD on teacher's top-k distribution.
 
-        Handles both local top-k (raw logits) and API top-k (raw logprobs) by
-        normalizing both teacher and student over the top-k subset via log_softmax.
+        Both local and API teacher are handled uniformly: gather student logits at
+        teacher's top-k indices, scale by 1/T, and log_softmax over top-k subset.
+        By shift-invariance of log_softmax, this gives identical results whether
+        teacher_topk_logprobs contains raw logits (local) or raw logprobs (API).
         """
         s_scaled = student_logits / self.temperature
         s_topk = torch.gather(s_scaled, dim=-1, index=teacher_topk_indices)
-
-        # Normalize both over top-k subset (handles both raw logits and API logprobs)
-        t_log_p = F.log_softmax(teacher_topk_logprobs, dim=-1)
+        t_log_p = F.log_softmax(teacher_topk_logprobs / self.temperature, dim=-1)
         s_log_p = F.log_softmax(s_topk, dim=-1)
         t_p = torch.exp(t_log_p)
 
diff --git a/swift/rlhf_trainers/__init__.py b/swift/rlhf_trainers/__init__.py
@@ -15,7 +15,6 @@
     from .ppo_trainer import PPOTrainer
     from .reward_trainer import RewardTrainer
     from .rlhf_mixin import RLHFTrainerMixin
-    from .teacher_api_client import fetch_teacher_logprobs
     from .utils import _ForwardRedirection, patch_lora_merge, patch_lora_unmerge, round_robin
     from .vllm_client import VLLMClient
 else:
@@ -32,7 +31,6 @@
         'args_mixin': ['VllmArguments', 'GRPOArgumentsMixin'],
         'utils': ['patch_lora_merge', 'patch_lora_unmerge', 'round_robin', '_ForwardRedirection'],
         'vllm_client': ['VLLMClient'],
-        'teacher_api_client': ['fetch_teacher_logprobs'],
         'arguments':
         ['DPOConfig', 'CPOConfig', 'KTOConfig', 'ORPOConfig', 'PPOConfig', 'RewardConfig', 'GRPOConfig', 'GKDConfig']
     }
diff --git a/swift/rlhf_trainers/gkd_trainer.py b/swift/rlhf_trainers/gkd_trainer.py
diff --git a/swift/rlhf_trainers/teacher_api_client.py b/swift/rlhf_trainers/teacher_api_client.py