Feature/ms swift custom by LEWISZZZcc · Pull Request #8222 · modelscope/ms-swift

LEWISZZZcc · 2026-03-06T04:03:18Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

gemini-code-assist · 2026-03-06T04:08:57Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the ms-swift framework's capabilities for advanced reinforcement learning from human feedback (RLHF) and agent development. It introduces new model output features, a sophisticated hierarchical loss function for better routing, and a comprehensive suite of agent templates and reward models. These additions enable more flexible and robust training of large language models for complex tasks involving tool use and structured responses, alongside improved infrastructure for remote inference and weight synchronization.

Highlights

Enhanced Embedding Output: Extended the model's embedding output to include pre-normalized and pre-projection hidden states, providing more granular access to intermediate representations for downstream tasks.
Hierarchical Route Classification Loss: Introduced a new RouteHybridInfonceLoss that combines InfoNCE with weighted hierarchical cross-entropy, supporting conditional softmax and path KL divergence for improved routing capabilities in embedding models.
Comprehensive Agent Template Support: Added a wide array of new agent templates, including Qwen (English/Chinese, parallel), React (English/Chinese), Hermes, GLM4, Llama3/4, and ToolBench, along with a fixed tool call accuracy reward plugin to handle diverse tool-calling formats.
New RLHF Training and Server Scripts: Provided multiple new training scripts for GRPO and PPO, demonstrating remote vLLM rollout, automatic value head configuration, and various reward model integrations, alongside corresponding server setup and monitoring scripts.
Extensible Reward Model Framework: Integrated a rich set of Outcome Reward Models (ORMs) and Process Reward Models (PRMs), including specialized models for React tasks, mathematical accuracy, format validation, cosine similarity, repetition penalty, and structured card output validation.
Structured Output Validation: Implemented a CardValidator class to rigorously check the format and content of structured card outputs generated by agent models, ensuring adherence to predefined XML-like schemas.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

examples/train/grpo/plugin/deepeyes/deepeyes_plugin.py
- Added a TODO comment for future improvements.
swift/model/patcher.py
- Extended the _output_embedding_hook to return pre_norm_last_hidden_state and pre_projection_last_hidden_state.
swift/trainers/seq2seq_trainer.py
- Initialized compute_loss_func for route_hybrid_infonce loss type.
- Modified compute_loss to extract and pass route_label_kwargs to the custom loss function.
workSpace/GRPO/AiShit/GRPO_4B_tool_call_fixed.sh
- Added a new GRPO training script for Qwen models with external tool call reward functions.
workSpace/GRPO/AiShit/dataset_format_example.json
- Added an example JSON dataset illustrating tool calling formats.
workSpace/GRPO/AiShit/plugin_fixed.py
- Added a fixed tool call accuracy reward plugin (ToolCallAccReward) supporting Qwen, React, and JSON formats.
workSpace/GRPO/AiShit/start_remote_vllm_server_fixed.sh
- Added a fixed script to start a remote vLLM server with agent template configuration.
workSpace/GRPO/AiShit/test_tool_call_parsing.py
- Added a test script to verify agent template tool call parsing and the fixed plugin reward function.
workSpace/GRPO/GRPO_4B_format.sh
- Added a GRPO training script with a focus on output format reward.
workSpace/GRPO/GRPO_4B_tool_call.sh
- Added a GRPO training script specifically for tool call optimization.
workSpace/GRPO/debug_GRPO.sh
- Added a debug script for GRPO training with simplified settings.
workSpace/GRPO/memo.md
- Added notes and observations regarding WandB settings, completion formats, and multi-turn issues.
workSpace/GRPO/ms_grpo.sh
- Added a GRPO training script demonstrating the use of external vLLM and reward models.
workSpace/GRPO/test_origin_weight.py
- Added a script to test vLLM output with original model weights.
workSpace/GRPO/test_orm.py
- Added a script to test Outcome Reward Models (ORM) with various completion formats and error cases.
workSpace/PPO/PPO_14B.sh
- Added a PPO training script configured for remote vLLM rollout.
workSpace/PPO/PPO_14B_auto_value_head.sh
- Added a PPO training script demonstrating automatic value head configuration for improved stability.
workSpace/PPO/PPO_Remote_Rollout_README.md
- Added comprehensive documentation for deploying PPO with remote vLLM rollout, including weight synchronization mechanisms.
workSpace/PPO/debug_model_args.py
- Added a script to debug RLHF argument parsing.
workSpace/PPO/expenation.md
- Added detailed explanations for PPO implementation with multiple reward and value models.
workSpace/PPO/monitor_weight_sync.sh
- Added a script to monitor PPO weight synchronization during training.
workSpace/PPO/safe_rlhf_ppo.sh
- Added a safe RLHF PPO training script.
workSpace/PPO/setup.sh
- Added a setup script for the PPO remote rollout environment.
workSpace/PPO/thoughts.md
- Added detailed thoughts and theoretical considerations for implementing PPO with multiple reward and value models.
workSpace/dpo/full.sh
- Added a DPO training script for full fine-tuning.
workSpace/dpo/lora.sh
- Added a DPO training script utilizing LoRA for efficient fine-tuning.
workSpace/dpo/qwen3-4B-re-re.sh
- Added a DPO training script for Qwen3-4B models.
workSpace/dpo/qwen3-4B-renforce.sh
- Added another DPO training script for Qwen3-4B models.
workSpace/plugin/init.py
- Updated plugin initialization to include new agent templates, ORMs, PRMs, and loss functions.
workSpace/plugin/agent_template/init.py
- Updated agent template initialization to include new ReactGRPO, GLM4, Hermes, Llama, Qwen, React, and ToolBench templates.
workSpace/plugin/agent_template/base.py
- Added a base class for agent templates, including React compatibility and tool parsing utilities.
workSpace/plugin/agent_template/extra.py
- Added ReactGRPOAgentTemplate for specific GRPO React-style interactions.
workSpace/plugin/agent_template/glm4.py
- Added GLM4AgentTemplate and GLM4_0414AgentTemplate for GLM-4 specific tool calling.
workSpace/plugin/agent_template/hermes.py
- Added HermesAgentTemplate for Hermes-style tool calling.
workSpace/plugin/agent_template/llama.py
- Added Llama3AgentTemplate and Llama4AgentTemplate for Llama-specific tool calling.
workSpace/plugin/agent_template/qwen.py
- Added Qwen agent templates (QwenEnAgentTemplate, QwenZhAgentTemplate, QwenEnParallelAgentTemplate, QwenZhParallelAgentTemplate) for Qwen-specific tool calling.
workSpace/plugin/agent_template/react.py
- Added React agent templates (ReactEnAgentTemplate, ReactZnAgentTemplate) for React-style tool calling.
workSpace/plugin/agent_template/toolbench.py
- Added ToolBenchAgentTemplate for ToolBench-style tool calling.
workSpace/plugin/callback.py
- Added EarlyStopCallback for early stopping training based on evaluation metrics.
workSpace/plugin/card_validate.py
- Added CardValidator class for validating structured card outputs in agent responses.
workSpace/plugin/loss.py
- Updated loss functions to include generative_reranker and listwise_reranker types.
workSpace/plugin/loss_dev.py
- Added RouteHybridInfonceLoss and RouteClsHeads for hierarchical route classification in embedding models.
workSpace/plugin/loss_scale/init.py
- Updated loss scale initialization to include new configurations.
workSpace/plugin/loss_scale/config/agentflan.json
- Added agentflan loss scale configuration.
workSpace/plugin/loss_scale/config/alpha_umi.json
- Added alpha_umi loss scale configuration.
workSpace/plugin/loss_scale/config/hermes.json
- Added Hermes loss scale configuration.
workSpace/plugin/loss_scale/config/ignore_empty_think.json
- Added ignore empty think loss scale configuration.
workSpace/plugin/loss_scale/config/qwen.json
- Added Qwen loss scale configuration.
workSpace/plugin/loss_scale/config/react.json
- Added React loss scale configuration.
workSpace/plugin/loss_scale/loss_scale.py
- Updated loss scale logic to incorporate new agent-specific loss scales.
workSpace/plugin/loss_scale/utils.py
- Added utility functions for calculating loss scale based on response patterns.
workSpace/plugin/metric.py
- Added InferStats and MeanMetric classes for tracking inference statistics and mean values.
workSpace/plugin/multi_turn.py
- Added MultiTurnScheduler base class and MathTipsScheduler for multi-turn interactions.
workSpace/plugin/optimizer.py
- Added custom optimizers including galore, lorap, muon, and multimodal optimizers.
workSpace/plugin/orm.py
- Added various Outcome Reward Models (ORMs) such as ReactORM, MathORM, Format, ReActFormat, CosineReward, RepetitionPenalty, SoftOverlong, AgentAccReward, and CombinedCosineReward.
workSpace/plugin/prm.py
- Added Process Reward Models (PRMs) including QwenMaxPRM and ClientPRM.
workSpace/plugin/rm_plugin.py
- Added DefaultRMPlugin and GenRMPlugin for reward model integration.
workSpace/plugin/route_cls_loss_dev_log.md
- Added detailed documentation and implementation notes for route classification loss.
workSpace/plugin/route_cls_loss_plugin.py
- Added a plugin to integrate RouteHybridInfonceLoss into the loss mapping.
workSpace/plugin/tuner.py
- Added Tuner base class and PeftTuner with IA3 and DummyTuner for model tuning.
workSpace/plugin/xml_orm.py
- Added XML-related ORM functions for extracting and evaluating XML-formatted responses.
workSpace/start_remote_vllm_server_single_round.sh
- Added a script to start a remote vLLM server configured for single-round interactions.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The code changes include adding a TODO comment in deepeyes_plugin.py, adding pre_norm_last_hidden_state and pre_projection_last_hidden_state to the output of _output_embedding_hook in patcher.py, and modifying Seq2SeqTrainer.compute_loss to handle route_label_kwargs and initialize the model if loss_type is route_hybrid_infonce. Additionally, several new files were added, including shell scripts for GRPO training and vLLM server setup, a JSON example for dataset formatting, and Python scripts for tool call parsing and testing. The review comments highlight security vulnerabilities due to hardcoded API keys and tokens, portability issues from hardcoded absolute paths, redundancy in assigning the same value to different keys, and the need to address or remove a placeholder TODO comment.

_{Note: Security Review did not run due to the size of the PR.}

gemini-code-assist · 2026-03-06T04:23:09Z

workSpace/GRPO/debug_GRPO.sh

+# 基础配置
+model_name="/mnt/cfs/ssw/ljc/LLaMA-Factory/models/Qwen3-4B"
+output_dir="./debug_output"
+wandb_api_key="8b7eb3957d2cf7157ab46fcf3e5b602cf2e7b24e"


Hardcoding the wandb_api_key is a security vulnerability. Please load it from an environment variable.

gemini-code-assist · 2026-03-06T04:23:09Z

workSpace/dpo/qwen3-4B-renforce.sh

+    --offload_optimizer true \
+    --deepspeed zero2 \
+    --report_to swanlab \
+    --swanlab_token GFPjNmyR2K5Cog3C6N7uA \


A hardcoded swanlab_token is present in this script. This is a critical security vulnerability and should be removed immediately. Use environment variables to handle secrets.

gemini-code-assist · 2026-03-06T04:23:09Z

workSpace/GRPO/AiShit/GRPO_4B_tool_call_fixed.sh

+wandb_api_key="8b7eb3957d2cf7157ab46fcf3e5b602cf2e7b24e"
+swanlab_api_key="GFPjNmyR2K5Cog3C6N7uA"


Hardcoding API keys (wandb_api_key, swanlab_api_key) in a script is a critical security vulnerability. These keys should be loaded from environment variables or a secure configuration file, not stored in version control.

Suggested change

wandb_api_key="8b7eb3957d2cf7157ab46fcf3e5b602cf2e7b24e"

swanlab_api_key="GFPjNmyR2K5Cog3C6N7uA"

wandb_api_key="${WANDB_API_KEY}"

swanlab_api_key="${SWANLAB_API_KEY}"

gemini-code-assist · 2026-03-06T04:23:09Z

workSpace/dpo/qwen3-4B-re-re.sh

+    --deepspeed zero2 \
+    --report_to swanlab \
+    --swanlab_token GFPjNmyR2K5Cog3C6N7uA \
+    --swanlab_mode cloud \


Hardcoding the swanlab_token is a critical security risk. This secret should be managed via environment variables or a secure vault, not committed to the repository.

gemini-code-assist · 2026-03-06T04:23:09Z

workSpace/dpo/full.sh

+    --offload_optimizer true \
+    --deepspeed zero3 \
+    --report_to swanlab \
+    --swanlab_token GFPjNmyR2K5Cog3C6N7uA \


The swanlab_token is hardcoded in this script. This is a critical security vulnerability. Please remove the token and load it from an environment variable.

gemini-code-assist · 2026-03-06T04:23:10Z

workSpace/PPO/safe_rlhf_ppo.sh

+--reward_model /zhoupc/safe_alignment/models/safe_rlhf_v/rm_qwen2_5_vl
+--cost_model /zhoupc/safe_alignment/models/safe_rlhf_v/cm_qwen2_5_vl # 
+--train_type full
+--dataset /zhoupc/safe_alignment/datasets/converted_sample.jsonl
+--torch_dtype bfloat16
+--num_train_epochs 2
+--per_device_train_batch_size 1
+--per_device_eval_batch_size 1
+--attn_impl flash_attn
+--learning_rate 5e-7
+--remove_unused_columns false #?
+--warmup_ratio 0.03
+--dataloader_num_workers 0
+--deepspeed zero3_offload 
+--dataset_num_proc 8 
+
+--freeze_vit true
+
+--gradient_accumulation_steps 4
+--eval_steps 3000
+--save_steps 10000
+--save_total_limit 1
+--logging_steps 5 
+--max_length 21000
+## Saving settings
+--save_only_model true
+--output_dir /zhoupc/safe_alignment/checkpoints/safe_rlhf_v_ppo_qwen-7b


This script contains multiple hardcoded absolute paths (e.g., /zhoupc/safe_alignment/...). This makes the script non-portable and difficult for other users to run. Please replace these with variables or relative paths.

gemini-code-assist · 2026-03-06T04:23:10Z

workSpace/GRPO/GRPO_4B_format.sh

+
+# hard settings
+nproc_per_node=8  # 使用的GPU数量，根据你的硬件调整
+# model_name="/mnt/cfs/ssw/ljc/LLaMA-Factory/saves/qwen3-4b/full/long1.0+plannner+format1.0"  # 模型名称


The model_name variable is commented out, but it is used later in the swift rlhf command on line 57. This will cause the script to fail. Please uncomment this line and provide a valid model path.

gemini-code-assist · 2026-03-06T04:23:10Z

workSpace/GRPO/AiShit/test_tool_call_parsing.py

+
+import sys
+import os
+sys.path.insert(0, '/mnt/cfs/ssw/ljc/ms-swift')


Using sys.path.insert with a hardcoded absolute path makes this test script non-portable and dependent on a specific user's directory structure. It's better to use relative imports or configure the PYTHONPATH environment variable outside the script.

gemini-code-assist · 2026-03-06T04:23:10Z

examples/train/grpo/plugin/deepeyes/deepeyes_plugin.py

        if count_tool_1 != count_tool_2:
            is_format_error = True
-
+        # TODO: ?


This TODO comment seems to be a placeholder. It should be either addressed with a proper implementation or removed if it's no longer relevant.

gemini-code-assist · 2026-03-06T04:23:10Z

swift/model/patcher.py

+            'pre_norm_last_hidden_state': pre_norm_embeddings.contiguous(),
+            'pre_projection_last_hidden_state': pre_norm_embeddings.contiguous(),


The keys 'pre_norm_last_hidden_state' and 'pre_projection_last_hidden_state' are both assigned the same value (pre_norm_embeddings.contiguous()). This seems redundant. If they are intended for different purposes, adding a comment to clarify would be helpful. Otherwise, consider removing one of the keys to avoid confusion.

liujunchen and others added 6 commits November 9, 2025 17:53

added RLHF part

b71f15e

adding orm

8cd4a1d

Merge branch 'modelscope:main' into main

72662cb

Merge branch 'modelscope:main' into main

c35f74a

fix

45b4c47

add combined loss

18e4383

gemini-code-assist bot reviewed Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/ms swift custom#8222

Feature/ms swift custom#8222
LEWISZZZcc wants to merge 6 commits intomodelscope:mainfrom
LEWISZZZcc:feature/ms-swift-custom

LEWISZZZcc commented Mar 6, 2026

Uh oh!

gemini-code-assist bot commented Mar 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		wandb_api_key="8b7eb3957d2cf7157ab46fcf3e5b602cf2e7b24e"
		swanlab_api_key="GFPjNmyR2K5Cog3C6N7uA"

		'pre_norm_last_hidden_state': pre_norm_embeddings.contiguous(),
		'pre_projection_last_hidden_state': pre_norm_embeddings.contiguous(),

Conversation

LEWISZZZcc commented Mar 6, 2026

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist bot commented Mar 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant