PaddlePaddle
diff --git a/‎llm/alignment/ppo/README.md
Lines changed: 21 additions & 24 deletions b/‎llm/alignment/ppo/README.md
Lines changed: 21 additions & 24 deletions
diff --git a/‎llm/alignment/ppo/reward/fake_reward_server.py
Lines changed: 112 additions & 0 deletions b/‎llm/alignment/ppo/reward/fake_reward_server.py
Lines changed: 112 additions & 0 deletions
diff --git a/‎llm/alignment/ppo/run_ppo.py
Lines changed: 0 additions & 3 deletions b/‎llm/alignment/ppo/run_ppo.py
Lines changed: 0 additions & 3 deletions
diff --git a/‎llm/config/llama/grpo_argument.yaml
Lines changed: 19 additions & 21 deletions b/‎llm/config/llama/grpo_argument.yaml
Lines changed: 19 additions & 21 deletions
@@ -36,26 +36,21 @@ python setup_cuda.py install
 
 ### 字段说明
 
-- src (list(str)): 用户对话内容，可能会包含 markup 内容，如 [<search-res>]；
-- tgt (list(str)): 除了最后一轮的系统多轮回复内容，以对话轮次排列，可能会包含 markup 内容，如 [<search>]；注意：len(tgt)==len(src)-1
+- src (list(str)): 经过 chat_template 处理后的 prompt 输入；
+- tgt (list(str)): 标签内容；
 
 ### 数据示例
 
 ```json
 {
-    "src": [
-        "需要你帮我写几个有创意的广告语来打开市场。",
-        "目标用户是年轻人，追求时尚、个性和自我。"
-    ],
-    "tgt": [
-        "当然！我很乐意帮助你创作几个有创意的广告语来推广你的新洗发露。请告诉我一些关于你的产品的特点，目标受众以及你希望传达的核心信息，我会根据这些信息为你提供几个创意的广告语。"
-    ]
+    "src": ["<|im_start|>system\nYou are a helpful assistant. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>.  Now the user asks you to solve a logical reasoning problem. After thinking, when you finally reach a conclusion, clearly state the identity of each character within <answer> </answer> tags. i.e., <answer> (1) Zoey is a knight\n(2) ... </answer>.\n<|im_end|>\n<|im_start|>user\nA very special island is inhabited only by knights and knaves. Knights always tell the truth, and knaves always lie. You meet 3 inhabitants: Michael, Zoey, and Ethan. Michael was heard saying, \"Ethan is a knight if and only if Michael is a knight\". \"Zoey is a knight or Ethan is a knight,\" Zoey mentioned. Ethan asserted: \"Michael is a knave if and only if Zoey is a knave\". So who is a knight and who is a knave?\n<|im_end|>\n<|im_start|>assistant\n<think>"],
+    "tgt": ["(1) Michael is a knight\n(2) Zoey is a knight\n(3) Ethan is a knight"]
 }
 ```
 
 
 ### PPO & GRPO 数据准备
-
+我们提供了一版使用 `Qwen/Qwen2.5-7B-Instruct-1M` 的`chat template`预处理后的[KK 数据集](https://hf-mirror.com/datasets/K-and-K/knights-and-knaves)。
 ```
 wget https://paddlenlp.bj.bcebos.com/datasets/examples/ppo-kk.tgz && tar zxf ppo-kk.tgz
 ```
@@ -66,9 +61,6 @@ wget https://paddlenlp.bj.bcebos.com/datasets/examples/ppo-kk.tgz && tar zxf ppo
 
 我们采用的配置文件在放置在`llm/config/llama/ppo_argument.json`和`llm/config/llama/grpo_argument.json`中，同时我们提供了详细参数释义如下：
 
-- `train_task_config`: 训练数据 config, 请以`config/task_ppo.json`为例
-- `eval_task_config`: 评估数据 config, 请以`config/task_ppo.json`为例
-- `ptx_task_config`: SFT 辅助数据, 请以`config/task_sft.json`为例，默认为""
 - `actor_model_name_or_path`: PPO 中 actor-model 和 reference-model 模型本地的模型路径
 - `reward_model_name_or_path`: PPO 中 reward-model 和 critic-model 模型本地的模型路径
 - `use_fusemt`: 是否通过 FustMT 加速生成，默认为 True
@@ -95,8 +87,7 @@ wget https://paddlenlp.bj.bcebos.com/datasets/examples/ppo-kk.tgz && tar zxf ppo
 - `critic_weight_decay`: Critic 模型除了所有 bias 和 LayerNorm 权重之外，应用于所有层的权重衰减数值。（`float`，可选，默认为 0.0）
 - `max_prompt_len`: 生成样本时的最大生成长度， max_length 调大会增加生成时间，并且增加显存占用。注意：
 max_dec_len + max_prompt_len 应当小于 max_seq_len。
-- `per_device_prompt_batch_size`: PPO 生成样本时的批处理大小，同 micro batch size，即满足 global_batch_size = dp（data parallel）* sharding * micro batch size。batch_size 调大会增加生成时间，并且增加显存占用
-- `per_device_train_batch_size`: 训练 batch 大小, 当前为了优化性能设为1，请避免更改
+- `per_device_train_batch_size`: 训练 batch 大小
 - `per_device_eval_batch_size`: 评估 batch 大小。
 - `max_steps`: 总的训练步数
 - `eval_steps`: 模型评估的间隔步数
@@ -109,13 +100,8 @@ max_dec_len + max_prompt_len 应当小于 max_seq_len。
 - `fp16`: 使用 float16 精度进行模型训练和推理。
 - `bf16`: 使用 bfloat16 精度进行模型训练和推理。
 - `fp16_opt_level`: float16 精度训练模式，`O2`表示纯 float16 训练
-
-
-<!-- ### PPO 训练命令
-
-```shell
-python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7"  run_ppo.py llm/config/llama/ppo_argument.json
-``` -->
+- `balance_batch`：该参数用于指定是否在数据并行场景下，对批次内的 token 数量进行均衡分配。若设置为 True，系统将尝试在不同并行设备间平衡 token 的分布；若设置为 False（默认值），则不进行此类均衡操作。
+- `use_remove_padding`：此参数决定是否在训练过程中去除输入数据中的 padding 部分。启用该选项（设置为 True）可有效提高训练过程中有效 token 的占比，从而提升训练效率；若设置为 False（默认值），则保留输入数据中的 padding。
 
 ### GRPO 训练命令
 ```shell
@@ -130,8 +116,19 @@ python reward_server.py
 ```shell
 export PYTHONPATH=your_PaddleNLP_path/:$PYTHONPATH
 export PYTHONPATH=your_PaddleNLP_path/llm:$PYTHONPATH
-python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" run_ppo.py ../../config/qwen/grpo_argument.json
-# python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" run_ppo.py ../../config/llama/grpo_argument.json
+
+export FLAGS_set_to_1d=False
+export NVIDIA_TF32_OVERRIDE=0
+export FLAGS_dataloader_use_file_descriptor=False
+export HF_DATASETS_DOWNLOAD_TIMEOUT=1
+export FLAGS_gemm_use_half_precision_compute_type=False
+export FLAGS_force_cublaslt_no_reduced_precision_reduction=True
+
+export FLAGS_mla_use_tensorcore=0
+export FLAGS_cascade_attention_max_partition_size=2048
+
+python -u -m paddle.distributed.launch --devices "0,1,2,3" run_ppo.py ../../config/qwen/grpo_argument.yaml
+# python -u -m paddle.distributed.launch --devices "0,1,2,3" run_ppo.py ../../config/llama/grpo_argument.yaml
 ```
 
 ### 在线监控
 
@@ -0,0 +1,112 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Launch Reward HTTP Server."""
+
+import argparse
+import json
+import logging
+import threading
+import traceback
+from typing import List
+
+import uvicorn
+from fastapi import FastAPI
+from pydantic import BaseModel
+
+
+class Request(BaseModel):
+    """The request for RM server."""
+
+    src: List[str]
+    tgt: List[str]
+    response: List[str]
+
+
+class Response(BaseModel):
+    """The response for RM server."""
+
+    error_code: int = 0
+    error_msg: str = "Success"
+    score: List[float] = None
+
+
+def compute_score(
+    solution_str: str, ground_truth: str, query=None, format_reward: int = 1, answer_reward: float = 1.0
+):
+    score = float(1.0)
+    print(
+        f"==============================================================={ground_truth}=========================================================================="
+    )
+    print(f"score {score}, solution_str\n", solution_str)
+    print(
+        "================================================================================================================================================="
+    )
+    return score
+
+
+def setup_args():
+    """Setup inerance server arguments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=8731)
+    parser.add_argument("--log_file", type=str, default="rm_server.log")
+    args = parser.parse_args()
+    return args
+
+
+def server(args):
+    """Launch RM server."""
+    app = FastAPI()
+    lock = threading.Lock()
+
+    logging.basicConfig(
+        level=logging.INFO,
+        filename=args.log_file,
+        filemode="w",
+        format="%(asctime)s - %(message)s",
+    )
+
+    @app.post("/")
+    async def _server(request: Request) -> Response:
+        lock.acquire()
+        logging.info(f"Request: {request}")
+        try:
+            all_result = []
+            if len(request.tgt) != len(request.response) or len(request.tgt) != len(request.src):
+                raise ValueError("The length of response, tgt, and src should be equal.")
+            for i in range(len(request.response)):
+                reward = compute_score(request.response[i], request.tgt[i], request.src[i])
+                all_result.append(reward)
+            output = {
+                "error_code": 0,
+                "error_msg": "Success",
+                "score": all_result,
+            }
+        except Exception as err:
+            logging.error(f"Server error: when process {request}\n{traceback.format_stack()}")
+            output = {
+                "error_code": 500,
+                "error_msg": f"{err}",
+                "score": [0] * len(request.tgt),
+            }
+        logging.info(f"Response: {json.dumps(output, indent=2, ensure_ascii=False)}")
+        lock.release()
+        return output
+
+    uvicorn.run(app, host="0.0.0.0", port=args.port)
+
+
+if __name__ == "__main__":
+    args = setup_args()
+    server(args)
@@ -100,8 +100,6 @@ def create_actor_models(
         actor_model_config.set_attn_func = True
         actor_model_config.max_position_embeddings = data_args.max_length
         actor_model_config.use_sparse_head_and_loss_fn = False
-        actor_model_config.fused_linear = model_args.fused_linear
-        actor_model_config.use_fused_rms_norm = training_args.use_fused_rms_norm
         actor_model_config.seq_length = data_args.max_length
         actor_model_config.max_sequence_length = data_args.max_length
         print(f"Loading Actor model with config:\n\t{actor_model_config}\n")
@@ -172,7 +170,6 @@ def create_reward_models(
         LlmMetaConfig.set_llm_config(reward_model_config, training_args)
         reward_model_config.max_position_embeddings = data_args.max_length
         reward_model_config.use_sparse_head_and_loss_fn = False
-        reward_model_config.fused_linear = model_args.fused_linear
         print(f"Loading Reward model with config:\n\t{reward_model_config}\n")
 
         config = copy.deepcopy(reward_model_config)
 
@@ -11,7 +11,7 @@ reward_server: "http://127.0.0.1:8731" # The address of the reward model server
 logging_dir: grpo-logs # Directory for logging
 logging_steps: 1 # Number of steps between logging
 output_dir: "qwen2.5-7b-kk-dataset-grpo/checkpoints" # Directory for output ckpts
-report_to: "wandb" # Supported reporting options: "all", "wandb", "tensorboard", "visualdl"(default), "none"
+report_to: "visualdl" # Supported reporting options: "all", "wandb", "tensorboard", "visualdl"(default), "none"
 wandb_http_proxy: "http://127.0.0.1:8962" # HTTP proxy for wandb
 run_name: "qwen2.5-7b-kk-dataset-grpo" # Name of the run
 
@@ -22,12 +22,13 @@ prompt_key: "src" # Key for the prompt in the dataset
 response_key: "tgt" # Key for the response in the dataset
 dataloader_drop_last: true # Whether to drop the last incomplete batch in the DataLoader
 balance_batch: true # Whether to balance batch size across dataset_world_size
+use_remove_padding: true # Whether to remove padding tokens in the input
 
 # distributed training args
 tensor_parallel_degree: 2 # Degree of tensor parallelism
 sequence_parallel: true # Whether to enable sequence parallelism
-sharding_parallel_degree: 1 # Degree of sharding parallelism
-sharding: "stage2" # Sharding strategy, e.g., "stage1" or "stage2"
+sharding_parallel_degree: -1 # Degree of sharding parallelism
+sharding: "stage1" # Sharding strategy, e.g., "stage1" or "stage2"
 sharding_parallel_config: "enable_release_grads" # Configuration for sharding parallelism
 pipeline_parallel_degree: 1 # Degree of pipeline parallelism
 virtual_pp_degree: 1 # Degree of virtual pipeline parallelism
@@ -39,24 +40,23 @@ min_dec_len: 32 # Minimum length of the response
 top_p: 1.0 # Top-p sampling parameter
 temperature: 0.7 # Temperature parameter for sampling
 repetition_penalty: 1.0 # Repetition penalty parameter
-# rollout_use_dynamic_insert: 1 # Whether to use dynamic insert for rollout
-# rollout_continue_batching_batch_size: 32 # Base batch size for dynamic insert
-quant_type: "" # Quantization type, e.g., "weight_only_int8"
+rollout_max_num_seqs: 32 # The maximum number of sequences that can be processed in a single inference
+rollout_quant_type: "" # Quantization type, e.g., "weight_only_int8"
 
 # training args
 do_train: true # Whether to perform training
 seed: 42 # Random seed for reproducibility
-global_batch_size: 2 # Global batch size for training
-mini_batch_size: 2 # Mini-batch size for training
+global_batch_size: 4 # Global batch size for training
+global_gen_batch_size: -1 # Global generation batch size for dynamic sampling
+global_mini_batch_size: -1 # Mini-batch size for training
 rollout_n: 8 # Number of rollouts
 update_iters: 1 # Number of training iterations for rollout samples
-per_device_rollout_batch_size: 1 # Rollout batch size per device
 per_device_logprob_batch_size: 8 # Log probability batch size per device
 per_device_reward_batch_size: 8 # Reward batch size per device
 per_device_value_batch_size: 8 # Value batch size per device
 per_device_train_batch_size: 8 # Training batch size per device
 # gradient_accumulation_steps: 1 # Gradient accumulation steps (auto-calculated)
-num_train_epochs: 3 # Number of training epochs
+num_train_epochs: 6 # Number of training epochs
 max_length: 4608 # Maximum length for training, should be larger than max_prompt_len + max_dec_len
 learning_rate: 5e-7 # Learning rate for training
 lr_scheduler_type: "constant" # Learning rate scheduler type
@@ -65,15 +65,15 @@ adam_beta1: 0.9 # AdamW optimizer beta1
 adam_beta2: 0.999 # AdamW optimizer beta2
 adam_epsilon: 1e-8 # AdamW optimizer epsilon
 max_grad_norm: 1.0 # Maximum gradient norm for clipping
-max_steps: 3600 # Maximum number of training steps
+max_steps: -1 # Maximum number of training steps
 save_steps: 300 # Number of steps between model saves
 save_strategy: "steps" # Strategy for saving models
 ignore_save_lr_and_optim: true # Whether to ignore saving learning rate and optimizer state (leave empty if not specified)
 disable_tqdm: true # Whether to disable tqdm progress bar
 
 # RL args
 kl_coeff: 0.0 # KL coefficient
-kl_loss_coeff: 0.0 # KL loss coefficient
+kl_loss_coeff: 0.001 # KL loss coefficient
 pg_loss_coeff: 1.0 # Policy gradient loss coefficient
 entropy_coeff: 0.0 # Entropy coefficient
 clip_range_ratio: 0.2 # The clipping range for ratio between the old and new policy. (PPO algorithm)
@@ -84,12 +84,11 @@ enable_overlong_reward_buffer: false # Whether to enable overlong reward buffer
 overlong_reward_buffer: 256 # The length of the overlong reward buffer
 overlong_penalty_factor: 1.0 # The penalty factor for overlong reward buffer
 clip_range_value: 5.0 # The clipping range for the output of the value model. The value is clipped into [-clip_range_value, clip_range_value].
-normalize_reward: true # Whether to normalize reward
-normalize_advantage: true # Whether to normalize advantage
+normalize_reward: false # Whether to normalize reward
+normalize_advantage: false # Whether to normalize advantage
 dynamic_sampling: false # Whether to use dynamic sampling, which is introcuded in DAPO algorithm https://arxiv.org/abs/2503.14476
-per_device_sample_batch_size: 1 # Sample batch size per device for dynamic sampling
 max_gen_batches: 2 # Maximum number of generation batches for dynamic sampling
-use_fp32_compute: false # Whether to use fp32 to compute xx_log_prob,rewards, advantages and loss
+use_fp32_compute: true # Whether to use fp32 to compute xx_log_prob,rewards, advantages and loss
 
 # eval args
 do_eval: true # Whether to perform evaluation
@@ -99,11 +98,10 @@ eval_steps: 20 # Number of steps between evaluations
 
 # device memory optimization args
 use_flash_attention: true # Whether to use fused attention operations
-use_fused_rms_norm: true # Whether to use fused RMS norm operations
-use_fused_rope: true # Whether to use fused rope operations
-use_fused_head_and_loss_fn: false # Whether to use fused head and loss function
-use_fused_linear: false # Whether to use fused linear operations, which needs to install fused_ln in slm/model_zoo/gpt-3/external_ops
-fused_linear: false # Whether to use fused_gemm_epilogue
+use_fused_rms_norm: true # Whether to use fused RMS norm operations, which needs to install fused_ln in slm/model_zoo/gpt-3/external_ops
+use_fused_rope: false # Whether to use fused rope operations
+use_fused_head_and_loss_fn: true # Whether to use fused head and loss function
+use_fused_linear: true # Whether to use fused linear operations
 recompute: true # Whether to enable gradient checkpointing for memory optimization
 recompute_use_reentrant: true # Whether to use reentrant recompute
 recompute_granularity: "full" # Granularity of recompute