Add conditional distillation support for GKD trainer #6542

woshixiaobai2019 · 2025-11-11T04:15:05Z

PR type

Bug Fix
[✅] New Feature
Document Updates
More Models or Datasets Support

PR information

条件蒸馏（Conditional Distillation）

条件蒸馏允许教师模型和学生模型使用不同的上下文或提示词进行训练，从而实现更灵活的知识迁移策略。例如：

教师模型接收包含额外专家指导的提示词
教师模型接收任务重构后的输入（如摘要、翻译等）
教师模型使用更长的上下文信息

TeacherAdapter 插件系统

通过实现 TeacherAdapter 接口，可以自定义教师模型的上下文转换逻辑：

# swift/plugin/teacher_adapter.py
from swift.plugin import TeacherAdapter

class MyTeacherAdapter(TeacherAdapter):
    def shape_context(self, history):
        """将学生的消息转换为教师的消息

        Args:
            history: 学生模型的消息列表（OpenAI 格式）

        Returns:
            教师模型的消息列表
        """
        # 为教师添加额外的系统提示
        teacher_history = history.copy()
        if teacher_history and teacher_history[0]['role'] == 'system':
            teacher_history[0]['content'] += '\n\n你是一位专业领域专家。'
        else:
            teacher_history.insert(0, {
                'role': 'system',
                'content': '你是一位专业领域专家。'
            })
        return teacher_history

# 注册到插件系统
from swift.plugin import teacher_adapters
teacher_adapters['my_adapter'] = MyTeacherAdapter

内置 Adapter

SWIFT 提供两个内置的 teacher adapter：

Adapter	说明
`default`	默认：教师使用与学生相同的上下文
`example`	示例：为教师添加额外的系统提示

使用方法

swift rlhf \
    --rlhf_type gkd \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --teacher_model Qwen/Qwen2.5-7B-Instruct \
    --teacher_adapter example \
    --dataset your_dataset.jsonl \
    ...

工作原理

在条件蒸馏中：

学生模型处理原始输入：[prompt_student] + [response]
教师模型处理转换后的输入：[prompt_teacher] + [response]
两个模型在相同的 response tokens 上计算 logits
使用这些 logits 计算蒸馏损失

其中 prompt_teacher 由 teacher_adapter.shape_context() 从 prompt_student 转换而来，而 response 部分保持不变。

训练脚本参考这里

实验结果

gemini-code-assist · 2025-11-11T04:15:39Z

Summary of Changes

Hello @woshixiaobai2019, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the GKD (Generative Knowledge Distillation) trainer by introducing conditional distillation. This new capability allows for more flexible knowledge transfer by enabling the teacher and student models to process different input contexts or prompts while distilling knowledge. The feature is implemented through an extensible TeacherAdapter plugin system, providing a clear mechanism for users to customize how teacher prompts are generated from student prompts. This change is accompanied by updated training arguments, modifications to the data processing pipeline, and comprehensive documentation, making it easier for users to leverage this advanced distillation technique.

Highlights

Conditional Distillation Feature: Introduced support for conditional distillation in the GKD trainer, allowing teacher and student models to use distinct contexts or prompts during training.
Extensible TeacherAdapter System: Implemented a plugin system with a TeacherAdapter interface, enabling users to define custom logic for transforming student prompts into teacher prompts.
Built-in Adapter Examples: Provided default and math_teacher adapters to demonstrate context transformation, making it easier for users to get started.
GKD Trainer Enhancements: Modified the GKDTrainer to integrate the teacher_adapter and handle the distinct prompt encoding and loss calculation required for conditional distillation.
Comprehensive Documentation & Example: Added detailed documentation in both Chinese and English, along with a new example training script, to guide users on utilizing the new feature.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

本次PR成功地为GKD训练器增加了条件蒸馏功能，这是一个非常有价值的新特性。整体实现结构清晰，特别是TeacherAdapter的插件化设计，使得扩展教师模型上下文转换逻辑变得非常方便，值得称赞。gkd_trainer.py中处理不同prompt和对齐logits计算损失的核心逻辑虽然复杂，但实现看起来是正确的。文档更新清晰明了，有助于用户理解和使用新功能。同时注意到PR中包含了大量f-string引号风格的全局 linting 修复，这有助于保持代码风格的一致性。

gemini-code-assist · 2025-11-11T04:17:40Z

docs/source/Instruction/GKD.md

+        if teacher_history and teacher_history[0]['role'] == 'system':
+            teacher_history[0]['content'] += '\n\n你是一位专业领域专家。'
+        else:
+            teacher_history.insert(0, {
+                'role': 'system',
+                'content': '你是一位专业领域专家。'
+            })


文档中 MyTeacherAdapter 的示例代码使用了 teacher_history[0]['content'] += ... 的方式来修改系统提示。由于 history.copy() 是浅拷贝，这样做会意外地修改原始的 history 对象中的内容。虽然在当前 GKD 的流程中可能不会引发问题，但这是一种有风险的实践。建议将文档中的示例更新为更安全的实现方式，即创建一个新的字典来替换 teacher_history[0]，就像 swift/plugin/teacher_adapter.py 中 MathTeacherAdapter 的实现一样，以避免潜在的副作用。

Suggested change

if teacher_history and teacher_history[0]['role'] == 'system':

teacher_history[0]['content'] += '\n\n你是一位专业领域专家。'

else:

teacher_history.insert(0, {

'role': 'system',

'content': '你是一位专业领域专家。'

})

teacher_history = history.copy()

if teacher_history and teacher_history[0]['role'] == 'system':

# 更健壮的方式：创建一个新字典以避免副作用

teacher_history[0] = {

'role': 'system',

'content': teacher_history[0]['content'] + '\n\n你是一位专业领域专家。'

}

else:

teacher_history.insert(0, {

'role': 'system',

'content': '你是一位专业领域专家。'

})

gemini-code-assist · 2025-11-11T04:17:40Z

docs/source_en/Instruction/GKD.md

+        if teacher_history and teacher_history[0]['role'] == 'system':
+            teacher_history[0]['content'] += '\n\nYou are an expert with extensive knowledge.'
+        else:
+            teacher_history.insert(0, {
+                'role': 'system',
+                'content': 'You are an expert with extensive knowledge.'
+            })


The example code for MyTeacherAdapter in the documentation uses teacher_history[0]['content'] += ... to modify the system prompt. Since history.copy() performs a shallow copy, this approach will unintentionally modify the content of the original history object. While this might not cause issues in the current GKD workflow, it is a risky practice. It's recommended to update the example to a safer implementation by creating a new dictionary to replace teacher_history[0], similar to how MathTeacherAdapter is implemented in swift/plugin/teacher_adapter.py, to prevent potential side effects.

Suggested change

if teacher_history and teacher_history[0]['role'] == 'system':

teacher_history[0]['content'] += '\n\nYou are an expert with extensive knowledge.'

else:

teacher_history.insert(0, {

'role': 'system',

'content': 'You are an expert with extensive knowledge.'

})

teacher_history = history.copy()

if teacher_history and teacher_history[0]['role'] == 'system':

# More robust way: create a new dict to avoid side effects

teacher_history[0] = {

'role': 'system',

'content': teacher_history[0]['content'] + '\n\nYou are an expert with extensive knowledge.'

}

else:

teacher_history.insert(0, {

'role': 'system',

'content': 'You are an expert with extensive knowledge.'

})

gemini-code-assist · 2025-11-11T04:17:40Z

swift/trainers/rlhf_trainer/gkd_trainer.py

+            if 'teacher_prompt_attention_mask' in inputs:
+                teacher_prompts_len = inputs['teacher_prompt_attention_mask'].sum(dim=1)  # [batch_size]
+            else:
+                teacher_prompts_len = torch.full((inputs['teacher_prompts'].shape[0],),
+                                                 inputs['teacher_prompts'].shape[1],
+                                                 device=inputs['teacher_prompts'].device)


这部分计算 teacher_prompts_len 的逻辑与前面的第148-153行重复了。建议在 if 'teacher_prompts' in inputs: 代码块的开头计算一次 teacher_prompts_len，然后在后续的逻辑中复用这个变量，这样可以避免代码冗余，使逻辑更清晰。

- Change shape_context() to accept complete data dict instead of just messages - Allow adapter to access all fields (dataset, images, etc.) for flexible usage - Follow GRPO's reward_model_plugin design pattern

…on control 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Add conditional distillation support for GKD trainer

2cb20f5

gemini-code-assist bot reviewed Nov 11, 2025

View reviewed changes

mouse and others added 3 commits November 11, 2025 07:02

Improve TeacherAdapter interface to match GRPO design

9a118b4

- Change shape_context() to accept complete data dict instead of just messages - Allow adapter to access all fields (dataset, images, etc.) for flexible usage - Follow GRPO's reward_model_plugin design pattern

WIP: conditional distillation implementation

5409e7d

Add get_loss_mask interface to TeacherAdapter for flexible distillati…

507530b

…on control 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add conditional distillation support for GKD trainer #6542

Add conditional distillation support for GKD trainer #6542

Uh oh!

woshixiaobai2019 commented Nov 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 11, 2025

Uh oh!

gemini-code-assist bot Nov 11, 2025

Uh oh!

gemini-code-assist bot Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add conditional distillation support for GKD trainer #6542

Are you sure you want to change the base?

Add conditional distillation support for GKD trainer #6542

Uh oh!

Conversation

woshixiaobai2019 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

条件蒸馏（Conditional Distillation）

TeacherAdapter 插件系统

内置 Adapter

使用方法

工作原理

实验结果

Uh oh!

gemini-code-assist bot commented Nov 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

woshixiaobai2019 commented Nov 11, 2025 •

edited

Loading