Conversation
- Added `train_pt` command to continue pre-training using prepared text datasets. - Introduced `TrainPtArgs` class for pre-training parameters, extending `TrainSftArgs`.
Reviewer's Guide在现有的 SFT 工作流基础上,新增对持续预训练(PT)工作流的支持,包括新的配置模型和 CLI 入口点,并对 SFT 和 PT 流程的适配器处理以及数据集校验逻辑进行了改进。 新的 train-pt CLI 工作流时序图sequenceDiagram
actor User
participant TrainPtCommand
participant TrainPtMain
participant ConfigLoader
participant WCTrainPtConfig
participant LlamaFactoryTuner
User ->> TrainPtCommand: train_pt
TrainPtCommand ->> TrainPtMain: main
TrainPtMain ->> ConfigLoader: load_config(train_pt)
ConfigLoader ->> WCTrainPtConfig: WCTrainPtConfig.__init__
WCTrainPtConfig ->> WCTrainPtConfig: process_config
WCTrainPtConfig -->> ConfigLoader: WCTrainPtConfig
ConfigLoader -->> TrainPtMain: WCTrainPtConfig
TrainPtMain ->> TrainPtMain: _resolve_dataset_path
TrainPtMain ->> LlamaFactoryTuner: run_exp
LlamaFactoryTuner -->> TrainPtMain: training_completed
TrainPtMain -->> User: PT finished
文件级变更
Tips and commands与 Sourcery 交互
自定义你的使用体验打开你的 dashboard 以:
获取帮助Original review guide in EnglishReviewer's GuideAdds support for a new continued pre-training (PT) workflow alongside existing SFT, including new config models and CLI entrypoint, plus refined adapter handling and dataset validation logic for both SFT and PT flows. Sequence diagram for the new train-pt CLI workflowsequenceDiagram
actor User
participant TrainPtCommand
participant TrainPtMain
participant ConfigLoader
participant WCTrainPtConfig
participant LlamaFactoryTuner
User ->> TrainPtCommand: train_pt
TrainPtCommand ->> TrainPtMain: main
TrainPtMain ->> ConfigLoader: load_config(train_pt)
ConfigLoader ->> WCTrainPtConfig: WCTrainPtConfig.__init__
WCTrainPtConfig ->> WCTrainPtConfig: process_config
WCTrainPtConfig -->> ConfigLoader: WCTrainPtConfig
ConfigLoader -->> TrainPtMain: WCTrainPtConfig
TrainPtMain ->> TrainPtMain: _resolve_dataset_path
TrainPtMain ->> LlamaFactoryTuner: run_exp
LlamaFactoryTuner -->> TrainPtMain: training_completed
TrainPtMain -->> User: PT finished
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - 我发现了 1 个问题,并给出了一些整体层面的反馈:
- 新增的
TrainSftArgs上的create_new_adapter标志在WCTrainSftConfig.process_config或 PT 流程中从未被使用。如果它是要影响 adapter 的创建/恢复方式,建议把它接好 wiring 到配置变换逻辑或训练入口中,使其产生实际效果。 - 你在多个位置移除了
quantization(create_config_by_arg_type(train_pt)、WCTrainPtConfig.process_config,以及在run_exp之前又移除了一次),这有点重复;把这类清理集中在单一层进行会让 PT 配置流更容易理解和维护。
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `create_new_adapter` flag on `TrainSftArgs` is never used in `WCTrainSftConfig.process_config` or the PT flow, so if it is meant to influence how adapters are created/resumed, consider wiring it into the config transformation or the training entrypoints so it has a concrete effect.
- You are stripping `quantization` in multiple places (`create_config_by_arg_type(train_pt)`, `WCTrainPtConfig.process_config`, and again before `run_exp`), which is redundant; consolidating this cleanup in a single layer would make the PT config flow easier to reason about.
## Individual Comments
### Comment 1
<location path="weclone/utils/config_models.py" line_range="326-327" />
<code_context>
+ delattr(self, "adapter_name_or_path")
+
+ self.dataset = self._parse_dataset_name()
+ if hasattr(self, "resume_adapter_name_or_path"):
+ delattr(self, "resume_adapter_name_or_path")
+ if hasattr(self, "quantization"):
+ delattr(self, "quantization")
</code_context>
<issue_to_address>
**issue (bug_risk):** Consider similar cleanup for PT-only helper fields before passing config downstream
In SFT you strip `resume_adapter_name_or_path` so LlamaFactory only receives supported args. For PT, `TrainPtArgs` still carries `resume_adapter_name_or_path` and `create_new_adapter`, and `WCTrainPtConfig.process_config` doesn’t remove them, so they’ll end up in `train_config.model_dump()` and be passed to `run_exp`. If LlamaFactory doesn’t recognize these keys, that’s a runtime error risk. Please also delete or translate these PT-only helper fields in `WCTrainPtConfig.process_config` (or just before `run_exp`).
</issue_to_address>Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Original comment in English
Hey - I've found 1 issue, and left some high level feedback:
- The new
create_new_adapterflag onTrainSftArgsis never used inWCTrainSftConfig.process_configor the PT flow, so if it is meant to influence how adapters are created/resumed, consider wiring it into the config transformation or the training entrypoints so it has a concrete effect. - You are stripping
quantizationin multiple places (create_config_by_arg_type(train_pt),WCTrainPtConfig.process_config, and again beforerun_exp), which is redundant; consolidating this cleanup in a single layer would make the PT config flow easier to reason about.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `create_new_adapter` flag on `TrainSftArgs` is never used in `WCTrainSftConfig.process_config` or the PT flow, so if it is meant to influence how adapters are created/resumed, consider wiring it into the config transformation or the training entrypoints so it has a concrete effect.
- You are stripping `quantization` in multiple places (`create_config_by_arg_type(train_pt)`, `WCTrainPtConfig.process_config`, and again before `run_exp`), which is redundant; consolidating this cleanup in a single layer would make the PT config flow easier to reason about.
## Individual Comments
### Comment 1
<location path="weclone/utils/config_models.py" line_range="326-327" />
<code_context>
+ delattr(self, "adapter_name_or_path")
+
+ self.dataset = self._parse_dataset_name()
+ if hasattr(self, "resume_adapter_name_or_path"):
+ delattr(self, "resume_adapter_name_or_path")
+ if hasattr(self, "quantization"):
+ delattr(self, "quantization")
</code_context>
<issue_to_address>
**issue (bug_risk):** Consider similar cleanup for PT-only helper fields before passing config downstream
In SFT you strip `resume_adapter_name_or_path` so LlamaFactory only receives supported args. For PT, `TrainPtArgs` still carries `resume_adapter_name_or_path` and `create_new_adapter`, and `WCTrainPtConfig.process_config` doesn’t remove them, so they’ll end up in `train_config.model_dump()` and be passed to `run_exp`. If LlamaFactory doesn’t recognize these keys, that’s a runtime error risk. Please also delete or translate these PT-only helper fields in `WCTrainPtConfig.process_config` (or just before `run_exp`).
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Pull request overview
This PR adds first-class support for continued pre-training (PT) alongside existing SFT training by introducing PT-specific config models, wiring a new train-pt CLI entrypoint, and adding dataset schema/file validation for PT runs.
Changes:
- Add
TrainPtArgs/WCTrainPtConfigandWcConfig.train_pt_argsto represent continued pre-training configuration. - Add
train-ptCLI command and a newweclone/train/train_pt.pyrunner that validates PT dataset schema and invokes LlamaFactoryrun_exp. - Refine training config normalization (flatten quantization, normalize adapter/resume fields) and update
.gitignore.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
weclone/utils/config.py |
Adds config construction path for train_pt and flattens quantization args for PT. |
weclone/utils/config_models.py |
Introduces PT config models and updates SFT config post-processing/normalization. |
weclone/train/train_pt.py |
New PT training entrypoint with dataset_info validation and LlamaFactory invocation. |
weclone/cli.py |
Adds train-pt command wired to the new training entrypoint. |
.gitignore |
Ignores .claude/* artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| output_adapter_value = getattr(self, "adapter_name_or_path", None) | ||
| resume_adapter_value = getattr(self, "resume_adapter_name_or_path", None) | ||
|
|
||
| if output_adapter_value: | ||
| self.output_dir = output_adapter_value | ||
|
|
||
| if resume_adapter_value: | ||
| self.adapter_name_or_path = resume_adapter_value | ||
| elif hasattr(self, "adapter_name_or_path"): | ||
| delattr(self, "adapter_name_or_path") | ||
|
|
| if dataset_entry.get("formatting") == "sharegpt": | ||
| raise ValueError( | ||
| f"Dataset '{dataset_name}' is a ShareGPT dataset. " | ||
| "LlamaFactory pre-training requires Alpaca-style data with columns.prompt mapped to text." | ||
| ) | ||
|
|
||
| prompt_column = (dataset_entry.get("columns") or {}).get("prompt") | ||
| if prompt_column is None: | ||
| raise ValueError(f"Dataset '{dataset_name}' must define columns.prompt for pre-training.") | ||
|
|
| @cli.command("train-pt", help="Continue pre-training the model using prepared text datasets.") | ||
| @apply_common_decorators() | ||
| def train_pt(): | ||
| """Continue pre-training the model using prepared text datasets.""" | ||
| from weclone.train.train_pt import main as train_pt_main | ||
|
|
||
| train_pt_main() | ||
|
|
| self.dataset = self._parse_dataset_name() | ||
| if hasattr(self, "resume_adapter_name_or_path"): | ||
| delattr(self, "resume_adapter_name_or_path") | ||
| if hasattr(self, "quantization"): | ||
| delattr(self, "quantization") | ||
| if hasattr(self, "include_type"): | ||
| delattr(self, "include_type") | ||
|
|
| if hasattr(self, "include_type"): | ||
| delattr(self, "include_type") | ||
| if hasattr(self, "quantization"): | ||
| delattr(self, "quantization") |
- Updated subproject commit to indicate a dirty state. - Removed the `create_new_adapter` field from `TrainSftArgs` class in `config_models.py` to streamline training configuration.
| class TrainPtArgs(TrainSftArgs): | ||
| stage: str = Field("pt", description="Pre-training stage") | ||
| dataset: str = Field(..., description="Pre-training dataset name") | ||
| output_dir: Optional[str] = Field(None, description="PT output directory") | ||
| packing: Optional[bool] = Field( |
| if prompt_column is None: | ||
| raise ValueError(f"Dataset '{dataset_name}' must define columns.prompt for pre-training.") |
| @cli.command("train-pt", help="Continue pre-training the model using prepared text datasets.") | ||
| @apply_common_decorators() | ||
| def train_pt(): | ||
| """Continue pre-training the model using prepared text datasets.""" | ||
| from weclone.train.train_pt import main as train_pt_main | ||
|
|
||
| train_pt_main() |
| output_dir: Optional[str] = Field(None) | ||
|
|
Summary by Sourcery
在现有 SFT 训练的基础上增加对持续预训练(continued pre-training)的支持,包括配置模型、CLI 入口点以及训练 runner 的集成。
新功能:
TrainPtArgs和WCTrainPtConfig模型,用于配置持续预训练任务。train-ptCLI 命令,连接到新的训练入口点,在已准备好的文本数据集上运行基于 LlamaFactory 的持续预训练。增强改进:
include_type等未使用字段。杂项:
Original summary in English
Summary by Sourcery
Add support for continued pre-training alongside existing SFT training, including configuration models, CLI entrypoint, and training runner integration.
New Features:
Enhancements:
Chores: