Standalone utilities for dataset preprocessing and checkpoint export.
preprocessing.sh- Downloads the public preprocessing datasets.
- Runs the conversion scripts under
preprocessing/. - Optionally downloads the mixed metadata set used by the training pipeline.
consolidate_checkpoint.py- Converts a NeMo DCP checkpoint directory into Hugging Face-compatible
safetensorsshards. - Optionally copies tokenizer and config artifacts into the output directory.
- Converts a NeMo DCP checkpoint directory into Hugging Face-compatible
Prepares Bard-VL training data from the public source datasets referenced by the repository.
Run the full pipeline:
bash tools/preprocessing.shRun in the background with log redirection:
bash tools/preprocessing.sh --background --log-file logs/preprocessing.log
tail -f logs/preprocessing.logRun a single stage:
bash tools/preprocessing.sh --download-only
bash tools/preprocessing.sh --convert-only
bash tools/preprocessing.sh --metadata-onlyPreview commands without executing them:
bash tools/preprocessing.sh --dry-runFull preprocessing notes are in tools/preprocessing/README.md.
Exports a training checkpoint saved in NeMo DCP format to a standard safetensors model directory.
--dcp-dir- Directory containing the DCP model state, usually
.../epoch_x_step_y/model
- Directory containing the DCP model state, usually
--output-dir- Destination directory for the converted
safetensorsfiles
- Destination directory for the converted
--source-model-dir- Directory containing tokenizer/config artifacts such as
*.jsonand*.txt - Required unless
--skip-copy-artifactsis set
- Directory containing tokenizer/config artifacts such as
Convert a checkpoint and copy tokenizer/config files from a pretrained model directory:
python3 tools/consolidate_checkpoint.py \
--dcp-dir exps/bard_vl_b4_mask_4b_instruct/epoch_0_step_74999/model \
--output-dir exps/bard_vl_b4_mask_4b_instruct/epoch_0_step_74999/safetensors \
--source-model-dir pretrained_models/Qwen3-VL-Bard-4B-InstructControl shard size and target dtype:
python3 tools/consolidate_checkpoint.py \
--dcp-dir exps/bard_vl_b4_mask_4b_instruct/epoch_0_step_74999/model \
--output-dir exps/bard_vl_b4_mask_4b_instruct/epoch_0_step_74999/safetensors \
--source-model-dir pretrained_models/Qwen3-VL-Bard-4B-Instruct \
--max-shard-size-gb 2 \
--dtype bf16Export weights only:
python3 tools/consolidate_checkpoint.py \
--dcp-dir exps/bard_vl_b4_mask_4b_instruct/epoch_0_step_74999/model \
--output-dir exps/bard_vl_b4_mask_4b_instruct/epoch_0_step_74999/safetensors \
--skip-copyRun bash tools/preprocessing.sh --help or python3 tools/consolidate_checkpoint.py --help for the full option list.