Skip to content

Feature/ms swift custom#8222

Open
LEWISZZZcc wants to merge 6 commits intomodelscope:mainfrom
LEWISZZZcc:feature/ms-swift-custom
Open

Feature/ms swift custom#8222
LEWISZZZcc wants to merge 6 commits intomodelscope:mainfrom
LEWISZZZcc:feature/ms-swift-custom

Conversation

@LEWISZZZcc
Copy link

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the ms-swift framework's capabilities for advanced reinforcement learning from human feedback (RLHF) and agent development. It introduces new model output features, a sophisticated hierarchical loss function for better routing, and a comprehensive suite of agent templates and reward models. These additions enable more flexible and robust training of large language models for complex tasks involving tool use and structured responses, alongside improved infrastructure for remote inference and weight synchronization.

Highlights

  • Enhanced Embedding Output: Extended the model's embedding output to include pre-normalized and pre-projection hidden states, providing more granular access to intermediate representations for downstream tasks.
  • Hierarchical Route Classification Loss: Introduced a new RouteHybridInfonceLoss that combines InfoNCE with weighted hierarchical cross-entropy, supporting conditional softmax and path KL divergence for improved routing capabilities in embedding models.
  • Comprehensive Agent Template Support: Added a wide array of new agent templates, including Qwen (English/Chinese, parallel), React (English/Chinese), Hermes, GLM4, Llama3/4, and ToolBench, along with a fixed tool call accuracy reward plugin to handle diverse tool-calling formats.
  • New RLHF Training and Server Scripts: Provided multiple new training scripts for GRPO and PPO, demonstrating remote vLLM rollout, automatic value head configuration, and various reward model integrations, alongside corresponding server setup and monitoring scripts.
  • Extensible Reward Model Framework: Integrated a rich set of Outcome Reward Models (ORMs) and Process Reward Models (PRMs), including specialized models for React tasks, mathematical accuracy, format validation, cosine similarity, repetition penalty, and structured card output validation.
  • Structured Output Validation: Implemented a CardValidator class to rigorously check the format and content of structured card outputs generated by agent models, ensuring adherence to predefined XML-like schemas.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • examples/train/grpo/plugin/deepeyes/deepeyes_plugin.py
    • Added a TODO comment for future improvements.
  • swift/model/patcher.py
    • Extended the _output_embedding_hook to return pre_norm_last_hidden_state and pre_projection_last_hidden_state.
  • swift/trainers/seq2seq_trainer.py
    • Initialized compute_loss_func for route_hybrid_infonce loss type.
    • Modified compute_loss to extract and pass route_label_kwargs to the custom loss function.
  • workSpace/GRPO/AiShit/GRPO_4B_tool_call_fixed.sh
    • Added a new GRPO training script for Qwen models with external tool call reward functions.
  • workSpace/GRPO/AiShit/dataset_format_example.json
    • Added an example JSON dataset illustrating tool calling formats.
  • workSpace/GRPO/AiShit/plugin_fixed.py
    • Added a fixed tool call accuracy reward plugin (ToolCallAccReward) supporting Qwen, React, and JSON formats.
  • workSpace/GRPO/AiShit/start_remote_vllm_server_fixed.sh
    • Added a fixed script to start a remote vLLM server with agent template configuration.
  • workSpace/GRPO/AiShit/test_tool_call_parsing.py
    • Added a test script to verify agent template tool call parsing and the fixed plugin reward function.
  • workSpace/GRPO/GRPO_4B_format.sh
    • Added a GRPO training script with a focus on output format reward.
  • workSpace/GRPO/GRPO_4B_tool_call.sh
    • Added a GRPO training script specifically for tool call optimization.
  • workSpace/GRPO/debug_GRPO.sh
    • Added a debug script for GRPO training with simplified settings.
  • workSpace/GRPO/memo.md
    • Added notes and observations regarding WandB settings, completion formats, and multi-turn issues.
  • workSpace/GRPO/ms_grpo.sh
    • Added a GRPO training script demonstrating the use of external vLLM and reward models.
  • workSpace/GRPO/test_origin_weight.py
    • Added a script to test vLLM output with original model weights.
  • workSpace/GRPO/test_orm.py
    • Added a script to test Outcome Reward Models (ORM) with various completion formats and error cases.
  • workSpace/PPO/PPO_14B.sh
    • Added a PPO training script configured for remote vLLM rollout.
  • workSpace/PPO/PPO_14B_auto_value_head.sh
    • Added a PPO training script demonstrating automatic value head configuration for improved stability.
  • workSpace/PPO/PPO_Remote_Rollout_README.md
    • Added comprehensive documentation for deploying PPO with remote vLLM rollout, including weight synchronization mechanisms.
  • workSpace/PPO/debug_model_args.py
    • Added a script to debug RLHF argument parsing.
  • workSpace/PPO/expenation.md
    • Added detailed explanations for PPO implementation with multiple reward and value models.
  • workSpace/PPO/monitor_weight_sync.sh
    • Added a script to monitor PPO weight synchronization during training.
  • workSpace/PPO/safe_rlhf_ppo.sh
    • Added a safe RLHF PPO training script.
  • workSpace/PPO/setup.sh
    • Added a setup script for the PPO remote rollout environment.
  • workSpace/PPO/thoughts.md
    • Added detailed thoughts and theoretical considerations for implementing PPO with multiple reward and value models.
  • workSpace/dpo/full.sh
    • Added a DPO training script for full fine-tuning.
  • workSpace/dpo/lora.sh
    • Added a DPO training script utilizing LoRA for efficient fine-tuning.
  • workSpace/dpo/qwen3-4B-re-re.sh
    • Added a DPO training script for Qwen3-4B models.
  • workSpace/dpo/qwen3-4B-renforce.sh
    • Added another DPO training script for Qwen3-4B models.
  • workSpace/plugin/init.py
    • Updated plugin initialization to include new agent templates, ORMs, PRMs, and loss functions.
  • workSpace/plugin/agent_template/init.py
    • Updated agent template initialization to include new ReactGRPO, GLM4, Hermes, Llama, Qwen, React, and ToolBench templates.
  • workSpace/plugin/agent_template/base.py
    • Added a base class for agent templates, including React compatibility and tool parsing utilities.
  • workSpace/plugin/agent_template/extra.py
    • Added ReactGRPOAgentTemplate for specific GRPO React-style interactions.
  • workSpace/plugin/agent_template/glm4.py
    • Added GLM4AgentTemplate and GLM4_0414AgentTemplate for GLM-4 specific tool calling.
  • workSpace/plugin/agent_template/hermes.py
    • Added HermesAgentTemplate for Hermes-style tool calling.
  • workSpace/plugin/agent_template/llama.py
    • Added Llama3AgentTemplate and Llama4AgentTemplate for Llama-specific tool calling.
  • workSpace/plugin/agent_template/qwen.py
    • Added Qwen agent templates (QwenEnAgentTemplate, QwenZhAgentTemplate, QwenEnParallelAgentTemplate, QwenZhParallelAgentTemplate) for Qwen-specific tool calling.
  • workSpace/plugin/agent_template/react.py
    • Added React agent templates (ReactEnAgentTemplate, ReactZnAgentTemplate) for React-style tool calling.
  • workSpace/plugin/agent_template/toolbench.py
    • Added ToolBenchAgentTemplate for ToolBench-style tool calling.
  • workSpace/plugin/callback.py
    • Added EarlyStopCallback for early stopping training based on evaluation metrics.
  • workSpace/plugin/card_validate.py
    • Added CardValidator class for validating structured card outputs in agent responses.
  • workSpace/plugin/loss.py
    • Updated loss functions to include generative_reranker and listwise_reranker types.
  • workSpace/plugin/loss_dev.py
    • Added RouteHybridInfonceLoss and RouteClsHeads for hierarchical route classification in embedding models.
  • workSpace/plugin/loss_scale/init.py
    • Updated loss scale initialization to include new configurations.
  • workSpace/plugin/loss_scale/config/agentflan.json
    • Added agentflan loss scale configuration.
  • workSpace/plugin/loss_scale/config/alpha_umi.json
    • Added alpha_umi loss scale configuration.
  • workSpace/plugin/loss_scale/config/hermes.json
    • Added Hermes loss scale configuration.
  • workSpace/plugin/loss_scale/config/ignore_empty_think.json
    • Added ignore empty think loss scale configuration.
  • workSpace/plugin/loss_scale/config/qwen.json
    • Added Qwen loss scale configuration.
  • workSpace/plugin/loss_scale/config/react.json
    • Added React loss scale configuration.
  • workSpace/plugin/loss_scale/loss_scale.py
    • Updated loss scale logic to incorporate new agent-specific loss scales.
  • workSpace/plugin/loss_scale/utils.py
    • Added utility functions for calculating loss scale based on response patterns.
  • workSpace/plugin/metric.py
    • Added InferStats and MeanMetric classes for tracking inference statistics and mean values.
  • workSpace/plugin/multi_turn.py
    • Added MultiTurnScheduler base class and MathTipsScheduler for multi-turn interactions.
  • workSpace/plugin/optimizer.py
    • Added custom optimizers including galore, lorap, muon, and multimodal optimizers.
  • workSpace/plugin/orm.py
    • Added various Outcome Reward Models (ORMs) such as ReactORM, MathORM, Format, ReActFormat, CosineReward, RepetitionPenalty, SoftOverlong, AgentAccReward, and CombinedCosineReward.
  • workSpace/plugin/prm.py
    • Added Process Reward Models (PRMs) including QwenMaxPRM and ClientPRM.
  • workSpace/plugin/rm_plugin.py
    • Added DefaultRMPlugin and GenRMPlugin for reward model integration.
  • workSpace/plugin/route_cls_loss_dev_log.md
    • Added detailed documentation and implementation notes for route classification loss.
  • workSpace/plugin/route_cls_loss_plugin.py
    • Added a plugin to integrate RouteHybridInfonceLoss into the loss mapping.
  • workSpace/plugin/tuner.py
    • Added Tuner base class and PeftTuner with IA3 and DummyTuner for model tuning.
  • workSpace/plugin/xml_orm.py
    • Added XML-related ORM functions for extracting and evaluating XML-formatted responses.
  • workSpace/start_remote_vllm_server_single_round.sh
    • Added a script to start a remote vLLM server configured for single-round interactions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The code changes include adding a TODO comment in deepeyes_plugin.py, adding pre_norm_last_hidden_state and pre_projection_last_hidden_state to the output of _output_embedding_hook in patcher.py, and modifying Seq2SeqTrainer.compute_loss to handle route_label_kwargs and initialize the model if loss_type is route_hybrid_infonce. Additionally, several new files were added, including shell scripts for GRPO training and vLLM server setup, a JSON example for dataset formatting, and Python scripts for tool call parsing and testing. The review comments highlight security vulnerabilities due to hardcoded API keys and tokens, portability issues from hardcoded absolute paths, redundancy in assigning the same value to different keys, and the need to address or remove a placeholder TODO comment.

Note: Security Review did not run due to the size of the PR.

# 基础配置
model_name="/mnt/cfs/ssw/ljc/LLaMA-Factory/models/Qwen3-4B"
output_dir="./debug_output"
wandb_api_key="8b7eb3957d2cf7157ab46fcf3e5b602cf2e7b24e"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Hardcoding the wandb_api_key is a security vulnerability. Please load it from an environment variable.

--offload_optimizer true \
--deepspeed zero2 \
--report_to swanlab \
--swanlab_token GFPjNmyR2K5Cog3C6N7uA \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

A hardcoded swanlab_token is present in this script. This is a critical security vulnerability and should be removed immediately. Use environment variables to handle secrets.

Comment on lines +10 to +11
wandb_api_key="8b7eb3957d2cf7157ab46fcf3e5b602cf2e7b24e"
swanlab_api_key="GFPjNmyR2K5Cog3C6N7uA"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Hardcoding API keys (wandb_api_key, swanlab_api_key) in a script is a critical security vulnerability. These keys should be loaded from environment variables or a secure configuration file, not stored in version control.

Suggested change
wandb_api_key="8b7eb3957d2cf7157ab46fcf3e5b602cf2e7b24e"
swanlab_api_key="GFPjNmyR2K5Cog3C6N7uA"
wandb_api_key="${WANDB_API_KEY}"
swanlab_api_key="${SWANLAB_API_KEY}"

--deepspeed zero2 \
--report_to swanlab \
--swanlab_token GFPjNmyR2K5Cog3C6N7uA \
--swanlab_mode cloud \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Hardcoding the swanlab_token is a critical security risk. This secret should be managed via environment variables or a secure vault, not committed to the repository.

--offload_optimizer true \
--deepspeed zero3 \
--report_to swanlab \
--swanlab_token GFPjNmyR2K5Cog3C6N7uA \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The swanlab_token is hardcoded in this script. This is a critical security vulnerability. Please remove the token and load it from an environment variable.

Comment on lines +12 to +38
--reward_model /zhoupc/safe_alignment/models/safe_rlhf_v/rm_qwen2_5_vl
--cost_model /zhoupc/safe_alignment/models/safe_rlhf_v/cm_qwen2_5_vl #
--train_type full
--dataset /zhoupc/safe_alignment/datasets/converted_sample.jsonl
--torch_dtype bfloat16
--num_train_epochs 2
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--attn_impl flash_attn
--learning_rate 5e-7
--remove_unused_columns false #?
--warmup_ratio 0.03
--dataloader_num_workers 0
--deepspeed zero3_offload
--dataset_num_proc 8

--freeze_vit true

--gradient_accumulation_steps 4
--eval_steps 3000
--save_steps 10000
--save_total_limit 1
--logging_steps 5
--max_length 21000
## Saving settings
--save_only_model true
--output_dir /zhoupc/safe_alignment/checkpoints/safe_rlhf_v_ppo_qwen-7b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This script contains multiple hardcoded absolute paths (e.g., /zhoupc/safe_alignment/...). This makes the script non-portable and difficult for other users to run. Please replace these with variables or relative paths.


# hard settings
nproc_per_node=8 # 使用的GPU数量,根据你的硬件调整
# model_name="/mnt/cfs/ssw/ljc/LLaMA-Factory/saves/qwen3-4b/full/long1.0+plannner+format1.0" # 模型名称
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The model_name variable is commented out, but it is used later in the swift rlhf command on line 57. This will cause the script to fail. Please uncomment this line and provide a valid model path.


import sys
import os
sys.path.insert(0, '/mnt/cfs/ssw/ljc/ms-swift')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using sys.path.insert with a hardcoded absolute path makes this test script non-portable and dependent on a specific user's directory structure. It's better to use relative imports or configure the PYTHONPATH environment variable outside the script.

if count_tool_1 != count_tool_2:
is_format_error = True

# TODO: ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This TODO comment seems to be a placeholder. It should be either addressed with a proper implementation or removed if it's no longer relevant.

Comment on lines +98 to +99
'pre_norm_last_hidden_state': pre_norm_embeddings.contiguous(),
'pre_projection_last_hidden_state': pre_norm_embeddings.contiguous(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The keys 'pre_norm_last_hidden_state' and 'pre_projection_last_hidden_state' are both assigned the same value (pre_norm_embeddings.contiguous()). This seems redundant. If they are intended for different purposes, adding a comment to clarify would be helpful. Otherwise, consider removing one of the keys to avoid confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant