THUDM
diff --git a/‎docs/01-Intro.md‎
Lines changed: 2 additions & 5 deletions b/‎docs/01-Intro.md‎
Lines changed: 2 additions & 5 deletions
diff --git a/‎docs/04-Finetune/01-Prerequisites.mdx‎
Lines changed: 34 additions & 84 deletions b/‎docs/04-Finetune/01-Prerequisites.mdx‎
Lines changed: 34 additions & 84 deletions
diff --git a/‎docs/04-Finetune/02-Quick Start.md‎
Lines changed: 14 additions & 15 deletions b/‎docs/04-Finetune/02-Quick Start.md‎
Lines changed: 14 additions & 15 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 2 additions & 3 deletions b/‎pyproject.toml‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎quickstart/configs/accelerate_config.yaml‎
Lines changed: 0 additions & 26 deletions b/‎quickstart/configs/accelerate_config.yaml‎
Lines changed: 0 additions & 26 deletions
diff --git a/‎quickstart/configs/zero/zero2.yaml‎
Lines changed: 0 additions & 38 deletions b/‎quickstart/configs/zero/zero2.yaml‎
Lines changed: 0 additions & 38 deletions
diff --git a/‎quickstart/configs/zero/zero2_offload.yaml‎
Lines changed: 0 additions & 42 deletions b/‎quickstart/configs/zero/zero2_offload.yaml‎
Lines changed: 0 additions & 42 deletions
@@ -4,15 +4,12 @@ slug: /
 
 # Introduction
 
-CogKit is an open-source project that provides a user-friendly interface for researchers and developers to utilize ZhipuAI's [**CogView**](https://huggingface.co/collections/THUDM/cogview-67ac3f241eefad2af015669b) (image generation) and [**CogVideoX**](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) (video generation) models. It streamlines multimodal tasks such as **text-to-image (T2I)**, **text-to-video (T2V)**, and **image-to-video (I2V)**. Users must comply with legal and ethical guidelines to ensure responsible implementation.
+CogKit is an open-source project that provides a user-friendly interface for researchers and developers to utilize ZhipuAI's [CogView](https://huggingface.co/collections/THUDM/cogview-67ac3f241eefad2af015669b) (image generation) and [CogVideoX](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) (video generation) models. It streamlines multimodal tasks such as text-to-image(T2I), text-to-video(T2V), and image-to-video(I2V). Users must comply with legal and ethical guidelines to ensure responsible implementation.
 
 ## Supported Models
 
 Please refer to the [Model Card](./05-Model%20Card.mdx) for more details.
 
 ## Environment Testing
 
-This repository has been tested in environments with `1×A100` and `8×A100` GPUs, using `CUDA 12.4, Python 3.10.16`.
-
-- Cog series models typically do not support `FP16` precision (Only `CogVideoX-2B` support); GPUs like the `V100` cannot be fine-tuned properly (Will cause `loss=nan` for example). At a minimum, an `A100` or other GPUs supporting `BF16` precision should be used.
-- We have not yet systematically tested the minimum GPU memory requirements for each model. For `LORA(bs=1 with offload)`, a single `A100` GPU is sufficient. For `SFT`, our tests have passed in an `8×A100` environment.
+This repository has been tested in environments with 8×A100 GPUs, using CUDA 12.4, Python 3.10.16.
@@ -3,109 +3,69 @@
 
 # Prerequisites
 
-Before starting fine-tuning, please ensure your machine meets the minimum hardware requirements listed in the tables below. The tables show the minimum VRAM (GPU memory) requirements for different models under various configurations.
+Before starting fine-tuning, please ensure your machine meets the minimum hardware requirements listed in the tables below. The tables show the minimum VRAM requirements for different models under various configurations (test on 8xA100).
 
 ## CogVideo Series
 
 <table style={{ textAlign: "center" }}>
   <thead>
     <tr>
       <th style={{ textAlign: "center" }}>Model</th>
-      <th style={{ textAlign: "center" }}>Training Type</th>
-      <th style={{ textAlign: "center" }}>Distribution Strategy</th>
-      <th style={{ textAlign: "center" }}>Training Resolution (FxHxW)</th>
+      <th style={{ textAlign: "center" }}>Type</th>
+      <th style={{ textAlign: "center" }}>Strategy</th>
+      <th style={{ textAlign: "center" }}>Resolution <br /> (FxHxW)</th>
       <th style={{ textAlign: "center" }}>Requirement</th>
     </tr>
   </thead>
   <tbody>
     <tr>
-      <td rowspan="6">cogvideox-t2v-2b</td>
+      <td rowSpan="2">cogvideox-t2v-2b</td>
       <td>lora</td>
       <td>DDP</td>
       <td>49x480x720</td>
-      <td>16GB VRAM</td>
+      <td>1 GPU with <br /> 12GB VRAM</td>
     </tr>
     <tr>
-      <td rowspan="5">sft</td>
+      <td rowSpan="1">sft</td>
       <td>DDP</td>
       <td>49x480x720</td>
-      <td>36GB VRAM</td>
+      <td>1 GPU with <br /> 25GB VRAM</td>
     </tr>
     <tr>
-      <td>1-GPU zero-2 + opt offload</td>
-      <td>49x480x720</td>
-      <td>17GB VRAM</td>
-    </tr>
-    <tr>
-      <td>8-GPU zero-2</td>
-      <td>49x480x720</td>
-      <td>17GB VRAM</td>
-    </tr>
-    <tr>
-      <td>8-GPU zero-3</td>
-      <td>49x480x720</td>
-      <td>19GB VRAM</td>
-    </tr>
-    <tr>
-      <td>8-GPU zero-3 + opt and param offload</td>
-      <td>49x480x720</td>
-      <td>14GB VRAM</td>
-    </tr>
-    <tr>
-      <td rowspan="5">cogvideox-\{t2v,i2v\}-5b</td>
+      <td rowSpan="3">cogvideox-\{t2v,i2v\}-5b</td>
       <td>lora</td>
       <td>DDP</td>
       <td>49x480x720</td>
-      <td>24GB VRAM</td>
-    </tr>
-    <tr>
-      <td rowspan="4">sft</td>
-      <td>1-GPU zero-2 + opt offload</td>
-      <td>49x480x720</td>
-      <td>42GB VRAM</td>
+      <td>1 GPU with <br /> 24GB VRAM</td>
     </tr>
     <tr>
-      <td>8-GPU zero-2</td>
+      <td rowSpan="2">sft</td>
+      <td>FSDP fullshard</td>
       <td>49x480x720</td>
-      <td>42GB VRAM</td>
+      <td>8 GPU with <br /> 20GB VRAM</td>
     </tr>
     <tr>
-      <td>8-GPU zero-3</td>
+      <td>FSDP fullshard + offload</td>
       <td>49x480x720</td>
-      <td>43GB VRAM</td>
+      <td>1 GPU with <br /> 16GB VRAM</td>
     </tr>
     <tr>
-      <td>8-GPU zero-3 + opt and param offload</td>
-      <td>49x480x720</td>
-      <td>28GB VRAM</td>
-    </tr>
-    <tr>
-      <td rowspan="5">cogvideox1.5-\{t2v,i2v\}-5b</td>
+      <td rowSpan="3">cogvideox1.5-\{t2v,i2v\}-5b</td>
       <td>lora</td>
       <td>DDP</td>
       <td>81x768x1360</td>
-      <td>35GB VRAM</td>
-    </tr>
-    <tr>
-      <td rowspan="4">sft</td>
-      <td>1-GPU zero-2 + opt offload</td>
-      <td>81x768x1360</td>
-      <td>56GB VRAM</td>
+      <td>1 GPU with <br /> 32GB VRAM</td>
     </tr>
     <tr>
-      <td>8-GPU zero-2</td>
+      <td rowSpan="2">sft</td>
+      <td>FSDP fullshard</td>
       <td>81x768x1360</td>
-      <td>55GB VRAM</td>
+      <td>8 GPUs with <br /> 31GB VRAM</td>
     </tr>
     <tr>
-      <td>8-GPU zero-3</td>
+      <td>FSDP fullshard + offload</td>
       <td>81x768x1360</td>
-      <td>55GB VRAM</td>
-    </tr>
-    <tr>
-      <td>8-GPU zero-3 + opt and param offload</td>
-      <td>81x768x1360</td>
-      <td>40GB VRAM</td>
+      <td>8 GPUs with <br /> 27GB VRAM</td>
     </tr>
   </tbody>
 </table>
@@ -116,46 +76,36 @@ Before starting fine-tuning, please ensure your machine meets the minimum hardwa
   <thead>
     <tr>
       <th style={{ textAlign: "center" }}>Model</th>
-      <th style={{ textAlign: "center" }}>Training Type</th>
-      <th style={{ textAlign: "center" }}>Distribution Strategy</th>
-      <th style={{ textAlign: "center" }}>Training Resolution (HxW)</th>
+      <th style={{ textAlign: "center" }}>Type</th>
+      <th style={{ textAlign: "center" }}>Strategy</th>
+      <th style={{ textAlign: "center" }}>Resolution <br /> (HxW)</th>
       <th style={{ textAlign: "center" }}>Requirement</th>
     </tr>
   </thead>
   <tbody>
     <tr>
-      <td rowspan="6">CogView4-6B</td>
-      <td>qlora + param offload <br />(`--low_vram`)</td>
+      <td rowSpan="4">CogView4-6B</td>
+      <td>qlora + offload <br />(enable --low_vram)</td>
       <td>DDP</td>
       <td>1024x1024</td>
-      <td>9GB VRAM</td>
+      <td>1 GPU with <br /> 9GB VRAM</td>
     </tr>
     <tr>
       <td>lora</td>
       <td>DDP</td>
       <td>1024x1024</td>
-      <td>30GB VRAM</td>
-    </tr>
-    <tr>
-      <td rowspan="4">sft</td>
-      <td>1-GPU zero-2 + opt offload</td>
-      <td>1024x1024</td>
-      <td>42GB VRAM</td>
-    </tr>
-    <tr>
-      <td>8-GPU zero-2</td>
-      <td>1024x1024</td>
-      <td>50GB VRAM</td>
+      <td>1 GPU with <br /> 20GB VRAM</td>
     </tr>
     <tr>
-      <td>8-GPU zero-3</td>
+      <td rowSpan="2">sft</td>
+      <td>FSDP fullshard</td>
       <td>1024x1024</td>
-      <td>47GB VRAM</td>
+      <td>8 GPUs with <br /> 28GB VRAM</td>
     </tr>
     <tr>
-      <td>8-GPU zero-3 + opt and param offload</td>
+      <td>FSDP fullshard + offload</td>
       <td>1024x1024</td>
-      <td>28GB VRAM</td>
+      <td>8 GPUs with <br /> 22GB VRAM</td>
     </tr>
   </tbody>
 </table>
@@ -27,36 +27,35 @@ We recommend that you read the corresponding [model card](../05-Model%20Card.mdx
 :::
 
 1. Navigate to the `CogKit/` directory after cloning the repository
+
    ```bash
    cd CogKit/
    ```
 
-2. Choose the appropriate training script from the `quickstart/scripts` directory based on your task type and distribution strategy. For example, `train_ddp_t2i.sh` corresponds to DDP strategy + text-to-image task
-
-3. Review and adjust the parameters in the selected training script (e.g., `--data_root`, `--output_dir`, etc.)
+2. Choose the appropriate subdirectory from the `quickstart/scripts` based on your task type and distribution strategy. For example, `t2i` corresponds to text-to-image task
 
-4. [Optional] If you are using ZeRO strategy, refer to `quickstart/configs/accelerate_config.yaml` to confirm your ZeRO config file and number of GPUs.
+3. Review and adjust the parameters in `config.yaml` in the selected training directory
 
-5. Run the script, for example:
+4. Run the script in the selected directory:
 
    ```bash
-   cd quickstart/scripts
-   bash train_ddp_t2i.sh
+   bash start_train.sh
    ```
 
 ## Load Fine-tuned Model
 
-### LoRA
-
-After fine-tuning with LoRA, you can load your trained weights during inference using the `--lora_model_id_or_path` option or parameter. For more details, please refer to the inference guide.
+### Merge Checkpoint
 
-### ZeRO
-
-After fine-tuning with ZeRO strategy, you need to use the `zero_to_fp32.py` script provided in the `quickstart/tools/converters` directory to convert the ZeRO checkpoint weights into Diffusers format. For example:
+After fine-tuning, you need to use the `merge.py` script to merge the distributed checkpoint weights into a single checkpoint (**except for QLoRA fine-tuning**).
+The script can be found in the `quickstart/tools/converters` directory.
+For example:
 
 ```bash
 cd quickstart/tools/converters
-python zero2diffusers.py checkpoint_dir/ output_dir/ --bfloat16
+python merge.py --checkpoint_dir ckpt/ --output_dir output_dir/
+# Add --lora option if you are using LoRA fine-tuning
 ```
 
-During inference, pass the `output_dir/` to the `--transformer_path` option or parameter. For more details, please refer to the inference guide.
+### Load Checkpoint
+
+You can pass the `output_dir` to the `--lora_model_id_or_path` option if you are using LoRA fine-tuning, or to the `--transformer_path` option if you are using FSDP fine-tuning. For more details, please refer to the inference guide.
@@ -18,7 +18,6 @@ dependencies = [
   "pydantic~=2.10",
   "sentencepiece==0.2.0",
   "transformers~=4.49",
-  "wandb~=0.19.8",
   "fastapi[standard]~=0.115.11",
   "fastapi_cli~=0.0.7",
   "openai~=1.67",
@@ -31,10 +30,10 @@ dependencies = [
 [project.optional-dependencies]
 finetune = [
   "datasets~=3.4",
-  "deepspeed~=0.16.4",
+  "wandb~=0.19.8",
   "av~=14.2.0",
   "bitsandbytes~=0.45.4",
-  "tensorboard~=2.19",
+  "pyyaml>=6.0.2",
 ]
 
 [project.urls]