We provide instruction to train or finetune PLM on a custom dataset.
We use support both image and video conversation datasets using jsonl. Each line of jsonl file should follow the following format,
{
"image": "<image path>",
"conversations": [
{
"from": "human",
"value": "human instruction"
},
{
"from": "assistant",
"value": "model response"
}
]
} {
"video": "<video path>",
"conversations": [
{
"from": "human",
"value": " human instruction"
},
{
"from": "assistant",
"value": "model response"
}
]
}Note that for images, we require the image key to be present in the jsonl line, while for videos we require the video key to be present in the jsonl line. The conversations key is common between the two types.
Tip
The repo also support text-only, multi-image, image-region, video-region-caption (RCap), video-region-temporal-localization (RTLoc) and video-region-dense-captioning (RDCap) tasks. Please download the provided dummy-datasets for an example of each dataset.
Given the dataset jsonl file, we can register a new dataset by adding an entry in apps/plm/configs/datasets.yaml.
custom_dataset_name:
annotation: path/to/the/jsonl/file.jsonl
root_dir: path/to/the/image-or-video/root-dirPlease refer to apps/plm/configs/datasets.yaml for already present dummy image, video and grounding datasets.
Training PLM involves creating a .yaml configuration file, defining all model and training related configurable parameters. Please refer to the provided plm_configs for details.
Tip
To run the following code, download the dummy-datasets and extract them to apps/plm/dummy_datasets.
Given a .yaml configuration file, please run the following command to launch the training on a single node with 8 GPUs.
torchrun --nproc-per-node 8 -m apps.plm.train config=apps/plm/configs/stage_3/plm_3b.yamlIn order to run inference / evaluation, please consolidate checkpoints using the following command,
python apps/plm/consolidate.py --ckpt <path to the saved checkpoints.>After consoldating the checkpoints, you can run inference using the following command,
python apps/plm/generate.py \
--ckpt facebook/Perception-LM-3B \
--media_type image \ # Replace with "video" for running inference on video
--media_path <path to image or video> \
--question <Question to be asked about the video.>For evaluation, please refer to evaluation.md.
We also provide a script to launch a distributed multinode training on slurm. Please use the provided utility named stool.py.
python -m core.stool script=apps.plm.train config=apps/plm/configs/stage_3/plm_8b.yaml qos=<QoS> nodes=<num_of_nodes>We provide a step-by-step example for how to finetune PLM on a public dataset that elaborates on each of the steps above in detail. Please see finetune_example.md.