Skip to content

Dev#221

Merged
xming521 merged 4 commits into
masterfrom
dev
May 31, 2026
Merged

Dev#221
xming521 merged 4 commits into
masterfrom
dev

Conversation

@xming521

@xming521 xming521 commented May 28, 2026

Copy link
Copy Markdown
Owner

Summary by Sourcery

在现有 SFT 训练的基础上增加对持续预训练(continued pre-training)的支持,包括配置模型、CLI 入口点以及训练 runner 的集成。

新功能:

  • 引入 TrainPtArgsWCTrainPtConfig 模型,用于配置持续预训练任务。
  • 新增 train-pt CLI 命令,连接到新的训练入口点,在已准备好的文本数据集上运行基于 LlamaFactory 的持续预训练。
  • 实现对预训练数据集的解析和校验,以确保符合正确的 schema 且相关文件真实存在。

增强改进:

  • 扩展 SFT 训练配置,支持从已有的 LoRA 适配器恢复训练或创建新的适配器,并在配置处理过程中统一规范与适配器相关的字段。
  • 优化配置构建逻辑,以处理预训练参数的扁平化,并从最终配置中移除如量化(quantization)和 include_type 等未使用字段。

杂项:

  • 更新与 WC-exp 产物相关的 gitignore 配置。
Original summary in English

Summary by Sourcery

Add support for continued pre-training alongside existing SFT training, including configuration models, CLI entrypoint, and training runner integration.

New Features:

  • Introduce TrainPtArgs and WCTrainPtConfig models to configure continued pre-training runs.
  • Add a train-pt CLI command wired to a new training entrypoint that runs LlamaFactory-based continued pre-training on prepared text datasets.
  • Implement dataset resolution and validation for pre-training datasets to ensure correct schema and file existence.

Enhancements:

  • Extend SFT training configuration to support resuming from an existing LoRA adapter or creating a new one, while normalizing adapter-related fields during config processing.
  • Refine config construction logic to handle pre-training argument flattening and remove unused fields like quantization and include_type in final configs.

Chores:

  • Update gitignore entries related to WC-exp artifacts.

xming521 and others added 2 commits May 13, 2026 16:32
- Added `train_pt` command to continue pre-training using prepared text datasets.
- Introduced `TrainPtArgs` class for pre-training parameters, extending `TrainSftArgs`.
Copilot AI review requested due to automatic review settings May 28, 2026 03:16
@sourcery-ai

sourcery-ai Bot commented May 28, 2026

Copy link
Copy Markdown

Reviewer's Guide

在现有的 SFT 工作流基础上,新增对持续预训练(PT)工作流的支持,包括新的配置模型和 CLI 入口点,并对 SFT 和 PT 流程的适配器处理以及数据集校验逻辑进行了改进。

新的 train-pt CLI 工作流时序图

sequenceDiagram
    actor User
    participant TrainPtCommand
    participant TrainPtMain
    participant ConfigLoader
    participant WCTrainPtConfig
    participant LlamaFactoryTuner

    User ->> TrainPtCommand: train_pt
    TrainPtCommand ->> TrainPtMain: main

    TrainPtMain ->> ConfigLoader: load_config(train_pt)
    ConfigLoader ->> WCTrainPtConfig: WCTrainPtConfig.__init__
    WCTrainPtConfig ->> WCTrainPtConfig: process_config
    WCTrainPtConfig -->> ConfigLoader: WCTrainPtConfig
    ConfigLoader -->> TrainPtMain: WCTrainPtConfig

    TrainPtMain ->> TrainPtMain: _resolve_dataset_path
    TrainPtMain ->> LlamaFactoryTuner: run_exp
    LlamaFactoryTuner -->> TrainPtMain: training_completed
    TrainPtMain -->> User: PT finished
Loading

文件级变更

Change Details Files
扩展 SFT 训练配置,以支持从已有的 LoRA adapter 恢复训练,并规范化 output/adapter 字段。
  • 在 TrainSftArgs 模型中新增 resume_adapter_name_or_path 和 create_new_adapter 字段。
  • 在 WCTrainSftConfig 中引入 after-model_validator,将 adapter_name_or_path 映射到 output_dir,并可选地用 resume_adapter_name_or_path 覆盖 adapter_name_or_path。
  • 通过解析数据集名称以及移除诸如 resume_adapter_name_or_path、quantization 和 include_type 等仅供内部使用的字段,对最终的 SFT 配置进行规范化和清理。
weclone/utils/config_models.py
为持续预训练(PT)引入专用的配置模型和配置创建路径。
  • 添加用于 PT 特定参数的 TrainPtArgs 模型,包括 stage='pt'、dataset、output_dir 和 packing 标志。
  • 定义 WCTrainPtConfig,并在其中使用 after-model_validator,从 adapter_name_or_path 设置默认的 output_dir,并剥离 adapter_name_or_path 和 quantization 等内部字段。
  • 让 WcConfig 可选地携带 train_pt_args,并添加 create_config_by_arg_type 分支,利用扁平化的 quantization 参数构造 WCTrainPtConfig。
weclone/utils/config_models.py
weclone/utils/config.py
暴露一个新的 CLI 命令和训练入口点,通过 LlamaFactory 运行 PT,并提供严格的数据集校验。
  • 新增 train-pt CLI 命令,应用通用装饰器并调用新的 train_pt main 函数。
  • 实现 weclone.train.train_pt,执行 dataset_info.json 校验,确保数据集不是 ShareGPT 格式,具有已定义的 prompt 列,以及数据集文件真实存在。
  • 从配置中加载 WCTrainPtConfig,校验 stage='pt',在设备为 CPU 时给出警告,记录解析后的数据集路径和配置,从运行时配置中移除 quantization,并调用 run_exp。
  • 在新的 PT 工作流中更新 .gitignore / 仓库布局元数据条目。
weclone/cli.py
weclone/train/train_pt.py
.gitignore
WC-exp

Tips and commands

与 Sourcery 交互

  • 触发新评审: 在 Pull Request 中评论 @sourcery-ai review
  • 继续讨论: 直接回复 Sourcery 的评审评论。
  • 从评审评论生成 GitHub issue: 在评审评论下回复,请 Sourcery 从该评论创建一个 issue。你也可以直接回复 @sourcery-ai issue 来从该评论生成 issue。
  • 生成 Pull Request 标题: 在 Pull Request 标题中任意位置写上 @sourcery-ai,即可随时生成标题。你也可以在 Pull Request 中评论 @sourcery-ai title 来(重新)生成标题。
  • 生成 Pull Request 摘要: 在 Pull Request 正文任意位置写上 @sourcery-ai summary,即可在指定位置生成 PR 摘要。你也可以在 Pull Request 中评论 @sourcery-ai summary 来(重新)生成摘要。
  • 生成 reviewer's guide: 在 Pull Request 中评论 @sourcery-ai guide,即可随时(重新)生成 reviewer's guide。
  • 一次性解决所有 Sourcery 评论: 在 Pull Request 中评论 @sourcery-ai resolve,即可将所有 Sourcery 评论标记为已解决。如果你已经处理完所有评论且不想再看到它们,这会很有用。
  • 一次性关闭所有 Sourcery 评审: 在 Pull Request 中评论 @sourcery-ai dismiss,即可关闭所有现有的 Sourcery 评审。特别适用于你想从头开始一次新的评审——别忘了再评论 @sourcery-ai review 来触发新的评审!

自定义你的使用体验

打开你的 dashboard 以:

  • 启用或停用诸如 Sourcery 自动生成的 Pull Request 摘要、reviewer's guide 等评审功能。
  • 更改评审语言。
  • 添加、移除或编辑自定义评审说明。
  • 调整其他评审设置。

获取帮助

Original review guide in English

Reviewer's Guide

Adds support for a new continued pre-training (PT) workflow alongside existing SFT, including new config models and CLI entrypoint, plus refined adapter handling and dataset validation logic for both SFT and PT flows.

Sequence diagram for the new train-pt CLI workflow

sequenceDiagram
    actor User
    participant TrainPtCommand
    participant TrainPtMain
    participant ConfigLoader
    participant WCTrainPtConfig
    participant LlamaFactoryTuner

    User ->> TrainPtCommand: train_pt
    TrainPtCommand ->> TrainPtMain: main

    TrainPtMain ->> ConfigLoader: load_config(train_pt)
    ConfigLoader ->> WCTrainPtConfig: WCTrainPtConfig.__init__
    WCTrainPtConfig ->> WCTrainPtConfig: process_config
    WCTrainPtConfig -->> ConfigLoader: WCTrainPtConfig
    ConfigLoader -->> TrainPtMain: WCTrainPtConfig

    TrainPtMain ->> TrainPtMain: _resolve_dataset_path
    TrainPtMain ->> LlamaFactoryTuner: run_exp
    LlamaFactoryTuner -->> TrainPtMain: training_completed
    TrainPtMain -->> User: PT finished
Loading

File-Level Changes

Change Details Files
Extend SFT training configuration to support resuming from an existing LoRA adapter and to normalize output/adapter fields.
  • Add resume_adapter_name_or_path and create_new_adapter fields to the TrainSftArgs model.
  • Introduce an after-model_validator in WCTrainSftConfig to map adapter_name_or_path to output_dir and optionally override adapter_name_or_path with resume_adapter_name_or_path.
  • Normalize and clean the final SFT config by parsing dataset name and removing internal-only fields such as resume_adapter_name_or_path, quantization, and include_type.
weclone/utils/config_models.py
Introduce a dedicated configuration model and config creation path for continued pre-training (PT).
  • Add TrainPtArgs model for PT-specific arguments, including stage='pt', dataset, output_dir, and packing flag.
  • Define WCTrainPtConfig with an after-model_validator that sets a default output_dir from adapter_name_or_path and strips internal fields like adapter_name_or_path and quantization.
  • Wire WcConfig to optionally carry train_pt_args and add a create_config_by_arg_type branch to construct WCTrainPtConfig using flattened quantization args.
weclone/utils/config_models.py
weclone/utils/config.py
Expose a new CLI command and training entrypoint to run PT using LlamaFactory with strong dataset validation.
  • Add train-pt CLI command that applies common decorators and calls a new train_pt main function.
  • Implement weclone.train.train_pt with dataset_info.json validation, ensuring non-ShareGPT formatting, a defined prompt column, and an existing dataset file.
  • Load WCTrainPtConfig from config, validate stage='pt', warn on CPU device, log resolved dataset path and config, strip quantization from the runtime config, and call run_exp.
  • Update .gitignore / repo layout metadata entries as part of the new PT workflow.
weclone/cli.py
weclone/train/train_pt.py
.gitignore
WC-exp

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - 我发现了 1 个问题,并给出了一些整体层面的反馈:

  • 新增的 TrainSftArgs 上的 create_new_adapter 标志在 WCTrainSftConfig.process_config 或 PT 流程中从未被使用。如果它是要影响 adapter 的创建/恢复方式,建议把它接好 wiring 到配置变换逻辑或训练入口中,使其产生实际效果。
  • 你在多个位置移除了 quantizationcreate_config_by_arg_type(train_pt)WCTrainPtConfig.process_config,以及在 run_exp 之前又移除了一次),这有点重复;把这类清理集中在单一层进行会让 PT 配置流更容易理解和维护。
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new `create_new_adapter` flag on `TrainSftArgs` is never used in `WCTrainSftConfig.process_config` or the PT flow, so if it is meant to influence how adapters are created/resumed, consider wiring it into the config transformation or the training entrypoints so it has a concrete effect.
- You are stripping `quantization` in multiple places (`create_config_by_arg_type(train_pt)`, `WCTrainPtConfig.process_config`, and again before `run_exp`), which is redundant; consolidating this cleanup in a single layer would make the PT config flow easier to reason about.

## Individual Comments

### Comment 1
<location path="weclone/utils/config_models.py" line_range="326-327" />
<code_context>
+            delattr(self, "adapter_name_or_path")
+
+        self.dataset = self._parse_dataset_name()
+        if hasattr(self, "resume_adapter_name_or_path"):
+            delattr(self, "resume_adapter_name_or_path")
+        if hasattr(self, "quantization"):
+            delattr(self, "quantization")
</code_context>
<issue_to_address>
**issue (bug_risk):** Consider similar cleanup for PT-only helper fields before passing config downstream

In SFT you strip `resume_adapter_name_or_path` so LlamaFactory only receives supported args. For PT, `TrainPtArgs` still carries `resume_adapter_name_or_path` and `create_new_adapter`, and `WCTrainPtConfig.process_config` doesn’t remove them, so they’ll end up in `train_config.model_dump()` and be passed to `run_exp`. If LlamaFactory doesn’t recognize these keys, that’s a runtime error risk. Please also delete or translate these PT-only helper fields in `WCTrainPtConfig.process_config` (or just before `run_exp`).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
帮我变得更有用!请在每条评论上点击 👍 或 👎,我会根据你的反馈改进后续的代码审查。
Original comment in English

Hey - I've found 1 issue, and left some high level feedback:

  • The new create_new_adapter flag on TrainSftArgs is never used in WCTrainSftConfig.process_config or the PT flow, so if it is meant to influence how adapters are created/resumed, consider wiring it into the config transformation or the training entrypoints so it has a concrete effect.
  • You are stripping quantization in multiple places (create_config_by_arg_type(train_pt), WCTrainPtConfig.process_config, and again before run_exp), which is redundant; consolidating this cleanup in a single layer would make the PT config flow easier to reason about.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new `create_new_adapter` flag on `TrainSftArgs` is never used in `WCTrainSftConfig.process_config` or the PT flow, so if it is meant to influence how adapters are created/resumed, consider wiring it into the config transformation or the training entrypoints so it has a concrete effect.
- You are stripping `quantization` in multiple places (`create_config_by_arg_type(train_pt)`, `WCTrainPtConfig.process_config`, and again before `run_exp`), which is redundant; consolidating this cleanup in a single layer would make the PT config flow easier to reason about.

## Individual Comments

### Comment 1
<location path="weclone/utils/config_models.py" line_range="326-327" />
<code_context>
+            delattr(self, "adapter_name_or_path")
+
+        self.dataset = self._parse_dataset_name()
+        if hasattr(self, "resume_adapter_name_or_path"):
+            delattr(self, "resume_adapter_name_or_path")
+        if hasattr(self, "quantization"):
+            delattr(self, "quantization")
</code_context>
<issue_to_address>
**issue (bug_risk):** Consider similar cleanup for PT-only helper fields before passing config downstream

In SFT you strip `resume_adapter_name_or_path` so LlamaFactory only receives supported args. For PT, `TrainPtArgs` still carries `resume_adapter_name_or_path` and `create_new_adapter`, and `WCTrainPtConfig.process_config` doesn’t remove them, so they’ll end up in `train_config.model_dump()` and be passed to `run_exp`. If LlamaFactory doesn’t recognize these keys, that’s a runtime error risk. Please also delete or translate these PT-only helper fields in `WCTrainPtConfig.process_config` (or just before `run_exp`).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread weclone/utils/config_models.py

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds first-class support for continued pre-training (PT) alongside existing SFT training by introducing PT-specific config models, wiring a new train-pt CLI entrypoint, and adding dataset schema/file validation for PT runs.

Changes:

  • Add TrainPtArgs / WCTrainPtConfig and WcConfig.train_pt_args to represent continued pre-training configuration.
  • Add train-pt CLI command and a new weclone/train/train_pt.py runner that validates PT dataset schema and invokes LlamaFactory run_exp.
  • Refine training config normalization (flatten quantization, normalize adapter/resume fields) and update .gitignore.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
weclone/utils/config.py Adds config construction path for train_pt and flattens quantization args for PT.
weclone/utils/config_models.py Introduces PT config models and updates SFT config post-processing/normalization.
weclone/train/train_pt.py New PT training entrypoint with dataset_info validation and LlamaFactory invocation.
weclone/cli.py Adds train-pt command wired to the new training entrypoint.
.gitignore Ignores .claude/* artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +314 to +324
output_adapter_value = getattr(self, "adapter_name_or_path", None)
resume_adapter_value = getattr(self, "resume_adapter_name_or_path", None)

if output_adapter_value:
self.output_dir = output_adapter_value

if resume_adapter_value:
self.adapter_name_or_path = resume_adapter_value
elif hasattr(self, "adapter_name_or_path"):
delattr(self, "adapter_name_or_path")

Comment thread weclone/train/train_pt.py
Comment on lines +25 to +34
if dataset_entry.get("formatting") == "sharegpt":
raise ValueError(
f"Dataset '{dataset_name}' is a ShareGPT dataset. "
"LlamaFactory pre-training requires Alpaca-style data with columns.prompt mapped to text."
)

prompt_column = (dataset_entry.get("columns") or {}).get("prompt")
if prompt_column is None:
raise ValueError(f"Dataset '{dataset_name}' must define columns.prompt for pre-training.")

Comment thread weclone/cli.py
Comment on lines +123 to +130
@cli.command("train-pt", help="Continue pre-training the model using prepared text datasets.")
@apply_common_decorators()
def train_pt():
"""Continue pre-training the model using prepared text datasets."""
from weclone.train.train_pt import main as train_pt_main

train_pt_main()

Comment on lines +325 to +332
self.dataset = self._parse_dataset_name()
if hasattr(self, "resume_adapter_name_or_path"):
delattr(self, "resume_adapter_name_or_path")
if hasattr(self, "quantization"):
delattr(self, "quantization")
if hasattr(self, "include_type"):
delattr(self, "include_type")

if hasattr(self, "include_type"):
delattr(self, "include_type")
if hasattr(self, "quantization"):
delattr(self, "quantization")
xming521 and others added 2 commits May 28, 2026 12:21
- Updated subproject commit to indicate a dirty state.
- Removed the `create_new_adapter` field from `TrainSftArgs` class in `config_models.py` to streamline training configuration.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.

Comment on lines +242 to +246
class TrainPtArgs(TrainSftArgs):
stage: str = Field("pt", description="Pre-training stage")
dataset: str = Field(..., description="Pre-training dataset name")
output_dir: Optional[str] = Field(None, description="PT output directory")
packing: Optional[bool] = Field(
Comment thread weclone/train/train_pt.py
Comment on lines +32 to +33
if prompt_column is None:
raise ValueError(f"Dataset '{dataset_name}' must define columns.prompt for pre-training.")
Comment thread weclone/cli.py
Comment on lines +123 to +129
@cli.command("train-pt", help="Continue pre-training the model using prepared text datasets.")
@apply_common_decorators()
def train_pt():
"""Continue pre-training the model using prepared text datasets."""
from weclone.train.train_pt import main as train_pt_main

train_pt_main()
Comment on lines +335 to +336
output_dir: Optional[str] = Field(None)

@xming521 xming521 merged commit a20f9eb into master May 31, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants