THUDM · zRzRzRzRzRzRzR · Mar 20, 2025 · Mar 17, 2025 · Mar 17, 2025 · Mar 17, 2025
diff --git a/.env.template b/.env.template
@@ -0,0 +1 @@
+COGVIEW4_PATH=THUDM/CogView4-6B
diff --git a/.gitignore b/.gitignore
@@ -241,7 +241,13 @@ Temporary Items
 
 # End of https://www.toptal.com/developers/gitignore/api/macos
 
+# * hatch-vcs
 src/cogkit/_version.py
+
+# * pdm
+.pdm-python
+
+# * a temporary directory to store files you do not wish to share.
 tmp/
 
 **/*.safetensor
@@ -254,7 +260,6 @@ tmp/
 **/*foo*
 **/train_result
 
-**/uv.lock
 
 webdoc/
 **/wandb/
diff --git a/README.md b/README.md
@@ -2,9 +2,9 @@
 
 ## Introduction
 
-**CogKit** is an open-source project that provides a user-friendly interface for researchers and developers to utilize ZhipuAI's [**CogView**](https://huggingface.co/collections/THUDM/cogview-67ac3f241eefad2af015669b) (image generation) and [**CogVideoX**](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) (video generation) models. It streamlines multimodal tasks such as **text-to-image (T2I)**, **text-to-video (T2V)**, and **image-to-video (I2V)**. Users must comply with legal and ethical guidelines to ensure responsible implementation.
+**`cogkit`** is an open-source project that provides a user-friendly interface for researchers and developers to utilize ZhipuAI's [**CogView**](https://huggingface.co/collections/THUDM/cogview-67ac3f241eefad2af015669b) (image generation) and [**CogVideoX**](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) (video generation) models. It streamlines multimodal tasks such as **text-to-image (T2I)**, **text-to-video (T2V)**, and **image-to-video (I2V)**. Users must comply with legal and ethical guidelines to ensure responsible implementation.
 
-Visit our [**Docs**](https://thudum.github.io/CogKit) to start.
+Visit our [**Docs**](https://thudm.github.io/CogKit) to start.
 
 ## Features
 

diff --git a/docker/dockerfile → docker/Dockerfile b/docker/dockerfile → docker/Dockerfile
@@ -12,11 +12,12 @@ RUN apt-get update && \
 
 
 ###### install cogkit ######
-REPO_NAME=cogkit
 WORKDIR /app
 
-RUN git https://github.com/thudm/CogKit
+# refactor url later (and maybe repo name)
+RUN git clone https://github.com/THUDM/CogKit.git
 WORKDIR CogKit
 
+# TODO: use `pdm sync`
 RUN pip install uv
 RUN uv pip install . --system
diff --git a/docs/01-Intro.md b/docs/01-Intro.md
@@ -4,13 +4,7 @@ slug: /
 
 # Introduction
 
-CogKit is a powerful framework for working with ZhipuAI Cog Series models, focusing on multimodal generation and fine-tuning capabilities.
-It provides a unified interface for various AI tasks including text-to-image, text-to-video, and image-to-video generation.
-
-## Key Features
-
-- **Command-line Interface**: Easy-to-use CLI and Python API for both inference and fine-tuning
-- **Fine-tuning Support**: With LoRA or full model fine-tuning support to customize models with your own data
+`cogkit` is a powerful framework for working with cognitive AI models, focusing on multi-modal generation and fine-tuning capabilities. It provides a unified interface for various AI tasks including text-to-image, text-to-video, and image-to-video generation.
 
 ## Supported Models
 

diff --git a/docs/02-Installation.md b/docs/02-Installation.md
@@ -3,37 +3,29 @@
 
 # Installation
 
-CogKit can be installed using pip. We recommend using a virtual environment to avoid conflicts with other packages.
+`cogkit` can be installed using pip. We recommend using a virtual environment to avoid conflicts with other packages.
 
 ## Requirements
 
 - Python 3.10 or higher
-- CUDA-compatible GPU (for optimal performance)
-- At least 8GB of GPU memory for inference, 16GB+ recommended for fine-tuning
+- OpenCV and PyTorch
 
 ## Installation Steps
 
-### Create a virtual environment (recommended)
+### OpenCV
 
-```bash
-# Using venv
-python -m venv cogkit-env
-source cogkit-env/bin/activate
-
-# Or using conda
-conda create -n cogkit-env python=3.10
-conda activate cogkit-env
-```
+Please refer to the [opencv-python installation guide](https://github.com/opencv/opencv-python?tab=readme-ov-file#installation-and-usage) for instructions on installing OpenCV according to your system.
 
-### Install PyTorch
+### PyTorch
 
 Please refer to the [PyTorch installation guide](https://pytorch.org/get-started/locally/) for instructions on installing PyTorch according to your system.
 
-### Install Cogkit
+### CogKit
+
+1. Install `cogkit`:
 
-1. Install Cogkit:
    ```bash
-   pip install cogkit@git+https://github.com/thudm/cogkit.git
+   pip install cogkit@git+https://github.com/THUDM/cogkit.git
    ```
 
 2. Optional: for video tasks (e.g. text-to-video), install additional dependencies:
@@ -42,7 +34,6 @@ Please refer to the [PyTorch installation guide](https://pytorch.org/get-started
    pip install -e .[video]
    ```
 
-
 ### Verify installation
 
 You can verify that cogkit is installed correctly by running:

diff --git a/docs/03-Inference/01-CLI.md b/docs/03-Inference/01-CLI.md
@@ -4,7 +4,7 @@
 <!-- TODO: check this doc -->
 # Command-Line Interface
 
-CogKit provides a powerful command-line interface (CLI) that allows you to perform various tasks without writing Python code. This guide covers the available commands and their usage.
+`cogkit` provides a powerful command-line interface (CLI) that allows you to perform various tasks without writing Python code. This guide covers the available commands and their usage.
 
 ## Overview
 
@@ -15,68 +15,45 @@ cogkit [OPTIONS] COMMAND [ARGS]...
 ```
 
 Available commands:
+
 - `inference`: Generate images or videos using AI models
-- `launch`: Launch a web UI for interactive use
+- `launch`: Launch a API server
 
 Global options:
+
 - `-v, --verbose`: Increase verbosity (can be used multiple times)
 
 ## Inference Command
 
-The `inference` command allows you to generate images and videos:
-
-```bash
-cogkit inference [OPTIONS] PROMPT MODEL_ID_OR_PATH
-```
+The `inference` command allows you to generate images or videos:
 
-### Examples
 
 ```bash
 # Generate an image from text
 cogkit inference "a beautiful sunset over mountains" "THUDM/CogView4-6B"
 
 # Generate a video from text
 cogkit inference "a cat playing with a ball" "THUDM/CogVideoX1.5-5B"
-
 ```
 
-## Fine-tuning Command
-
-The `finetune` command allows you to fine-tune models with your own data:
-
-```bash
-cogkit finetune [OPTIONS]
-```
+<!-- TODO: Add example for i2v -->
 
-> Note: The fine-tuning command is currently under development. Please check back for updates.
+:::tip
+See `cogkit inference --help` for more information.
+:::
 
 <!-- TODO: add docs for launch server -->
 ## Launch Command
 
-The `launch` command starts a web UI for interactive use:
+The `launch` command will starts a API server:
 
+<!-- FIXME: Add examples -->
 ```bash
-cogkit launch [OPTIONS]
+...
 ```
 
-This launches a web interface where you can:
-- Generate images and videos interactively
-- Upload images for image-to-video generation
-- Adjust generation parameters
-- View and download results
-
-### Options
-
-| Option | Description |
-|--------|-------------|
-| `--host TEXT` | Host to bind the server to (default: 127.0.0.1) |
-| `--port INTEGER` | Port to bind the server to (default: 7860) |
-| `--share` | Create a public URL |
+Please refer to [API](./02-API.md#api-server) for details on how to interact with the API server using client interfaces.
 
-### Example
-
-```bash
-# Launch the web UI on the default port
-cogkit launch
-
-```
+:::tip
+See `cogkit launch --help` for more information.
+:::
diff --git a/docs/03-Inference/02-API.md b/docs/03-Inference/02-API.md
@@ -3,11 +3,11 @@
 
 # API
 
-CogKit provides a powerful inference API for generating images and videos using various AI models. This document covers both the Python API and API server.
+`cogkit` provides a powerful inference API for generating images and videos using various AI models. This document covers both the Python API and API server.
 
 ## Python API
 
-You can also use Cogkit programmatically in your Python code:
+You can also use `cogkit` programmatically in your Python code:
 
 ```python
 from cogkit.generation import generate_image, generate_video
@@ -32,6 +32,10 @@ video = generate_video(
 )
 video.save("cat_video.mp4")
 ```
+<!-- TODO: add examples for i2v -->
+
+<!-- FIXME: correct url -->
+See function signatures in [generation.py](...) for more details.
 
 ## API Server
 

diff --git a/docs/04-Finetune/02-Quick Start.md b/docs/04-Finetune/02-Quick Start.md
@@ -30,11 +30,11 @@ We recommend that you read the corresponding [model card](../05-Model%20Card.mdx
 4. If you are using ZeRO strategy, refer to `accelerate_config.yaml` to confirm your ZeRO level and number of GPUs
 
 5. Run the script, for example:
+
    ```bash
    bash scripts/train_ddp_t2i.sh
    ```
 
-
 ## Load Fine-tuned Model
 
 ### LoRA

diff --git a/docs/04-Finetune/03-Data Format.md b/docs/04-Finetune/03-Data Format.md
@@ -7,6 +7,7 @@
 `src/cogkit/finetune/data` directory contains various dataset templates for fine-tuning different models, please refer to the corresponding dataset template based on your task type:
 
 ## Text-to-Image Conversion Dataset (t2i)
+
 - Each directory contains a set of image files (`.png`)
 - The `metadata.jsonl` file contains text descriptions for each image
 
@@ -34,12 +35,11 @@
     ```
 
     :::info
-    - Image files are optional; if not provided, the system will default to using the first frame of the video as the input image
-    - When image files are provided, they are associated with the video file of the same name through the id field
+  - Image files are optional; if not provided, the system will default to using the first frame of the video as the input image
+  - When image files are provided, they are associated with the video file of the same name through the id field
     :::
 
 ## Notes
 
-- Training sets (`train/`) are used for model training
-- Test sets (`test/`) are used for evaluating model performance
-- Each dataset will generate a `.cache/` directory during training, used to store preprocessed cache data. If the dataset changes, you need to **manually delete this directory** and retrain.
+- Training sets (`train/`) are used for model training, test sets (`test/`) are used for evaluating model performance
+- Each dataset will generate a `.cache/` directory during training, used to store preprocessed data. If the dataset changes, you need to **manually delete this directory** and retrain.
diff --git a/docs/05-Model Card.mdx b/docs/05-Model Card.mdx
@@ -3,7 +3,7 @@
 
 # Model Card
 
-Here is a detailed description of how CogKit supports models.
+Here is a detailed description of how `cogkit` supports models.
 
 All training requirements must be strictly followed as specified in the table below, including resolution, number of frames, prompt token limit, and video length requirements.
 
@@ -27,9 +27,9 @@ All training requirements must be strictly followed as specified in the table be
     <td style={{ textAlign: "center" }}>September 19, 2024</td>
   </tr>
   <tr>
-    <td style={{ textAlign: "center" }}>Video Resolution</td>
+    <td style={{ textAlign: "center" }}>Video Resolution (W * H) </td>
     <td colspan="1" style={{ textAlign: "center" }}>1360 * 768</td>
-    <td colspan="1" style={{ textAlign: "center" }}>Min(W, H) &#61 768 <br/> 768 ≤ Max(W, H) ≤ 1360 <br/> Max(W, H) % 16 &#61 0</td>
+    <td colspan="1" style={{ textAlign: "center" }}>Min(W, H) = 768 <br/> 768 ≤ Max(W, H) ≤ 1360 <br/> Max(W, H) % 16 = 0</td>
     <td colspan="3" style={{ textAlign: "center" }}>720 * 480</td>
   </tr>
   <tr>
@@ -80,7 +80,7 @@ All training requirements must be strictly followed as specified in the table be
   </tr>
   <tr>
     <td style={{ textAlign: "center" }}>Resolution</td>
-    <td style={{ textAlign: "center" }}>512 ≤ (W, H) ≤ 2048 <br/> H * W ≤ 2^{21} <br/> Max(W, H) % 32 &#61 0 </td>
+    <td style={{ textAlign: "center" }}>512 ≤ (W, H) ≤ 2048 <br/> H * W ≤ 2^{21} <br/> Max(W, H) % 32 = 0 </td>
   </tr>
   <tr>
     <td style={{ textAlign: "center" }}>Prompt Language</td>

diff --git a/hatch.toml b/hatch.toml