Download the code:
git clone https://github.com/bricksdont/multimodalhugs-pipelines
cd multimodalhugs-pipelines
Create a venv:
./scripts/environment/create_env.sh
Then install required software:
./scripts/environment/install.sh
It is a good idea to use dry_run="true" the first time you are running code, which creates all files, executes all
the code but uses only a fraction of the training data, trains for very few steps only, etc - as a
general sanity check. If you want to launch a real run after a dry run you will need to manually
delete folders that the dry run created (e.g. a sub-folder of models), otherwise the steps
will not be repeated.
Then to train a basic model:
./scripts/running/run_basic.sh
This will first download and prepare the PHOENIX training data, and then train a basic MultimodalHugs model. All steps are submitted as SLURM jobs.
If the process is fully reproducible, this should result in a test set BLEU score of 10.691. This
value is inside the file evaluations/phoenix/test_score.bleu.
The following script will train approximately 50 models to search for good hyperparameters (each run will finish in roughly 2 hours):
./scripts/running/run_hyperparam_search.sh
To get a summary of the results, run:
./scripts/summaries/summarize.sh
./scripts/running/run_test_repeatability.sh
This will train three models with identical configurations and seeds, to test if the process is repeatable / reproducible.
Results
| test BLEU | stopped training at epoch | |
|---|---|---|
| phoenix_1 | 10.199 | 29.5051 |
| phoenix_2 | 10.217 | 22.5627 |
| phoenix_3 | 10.472 | 26.0339 |
Investigating if due to training arguments
Using only a single data worker:
| test BLEU | stopped training at epoch | |
|---|---|---|
| phoenix_1_workers_1 | 10.324 | 23.4305 |
| phoenix_2_workers_1 | 10.244 | 32.1085 |
| phoenix_3_workers_1 | 10.189 | 21.261 |
fp32 instead of fp16:
| test BLEU | stopped training at epoch | |
|---|---|---|
| phoenix_1_fp32 | 10.35 | 29.5051 |
| phoenix_2_fp32 | 10.5 | 22.5627 |
| phoenix_3_fp32 | 10.108 | 27.3356 |
fp32 and a single data worker:
| test BLEU | stopped training at epoch | |
|---|---|---|
| phoenix_1_fp32_workers_1 | 10.379 | 27.7695 |
| phoenix_2_fp32_workers_1 | 9.546 | 21.261 |
| phoenix_3_fp32_workers_1 | 10.111 | 25.6 |
Investigate if due to weight initialization
Investigating whether the model weights at checkpoint zero (the setup model) are identical for two models, here are the ones that are not identical:
| key | magnitude of difference |
|---|---|
| multimodal_mapper.mapping_layer.weight | 0.08649232983589172 |
| multimodal_mapper.mapping_layer.bias | 0.08635331690311432 |
| backbone.model.shared.weight | 0.0894961804151535 |
| backbone.model.encoder.embed_tokens.weight | 0.0894961804151535 |
| backbone.model.decoder.embed_tokens.weight | 0.0894961804151535 |
| backbone.lm_head.weight | 0.0894961804151535 |
These results are generated by scripts/debugging/debug_reproducibility.py.
The magnitude of differences seems to indicate that the weight initialization is different, potentially because the seed is not set during setup, when the multimodal mapper weights are created. Specifically, the build_model method of the MultimodalEmbedderModel runs without a fixed seed.
After fixing the random seed before setup
After applying this fix:
| test BLEU | stopped training at epoch | |
|---|---|---|
| phoenix_1 | 9.982 | 24.2983 |
| phoenix_2 | 9.982 | 24.2983 |
| phoenix_3 | 9.982 | 24.2983 |
also, all initial weight parameters are identical between models.