Skip to content

ZurichNLP/multimodalhugs-pipelines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 

Repository files navigation

multimodalhugs-pipelines

Download the code:

git clone https://github.com/bricksdont/multimodalhugs-pipelines
cd multimodalhugs-pipelines

Basic setup

Create a venv:

./scripts/environment/create_env.sh

Then install required software:

./scripts/environment/install.sh

Run experiments

Single experiment

It is a good idea to use dry_run="true" the first time you are running code, which creates all files, executes all the code but uses only a fraction of the training data, trains for very few steps only, etc - as a general sanity check. If you want to launch a real run after a dry run you will need to manually delete folders that the dry run created (e.g. a sub-folder of models), otherwise the steps will not be repeated.

Then to train a basic model:

./scripts/running/run_basic.sh

This will first download and prepare the PHOENIX training data, and then train a basic MultimodalHugs model. All steps are submitted as SLURM jobs.

If the process is fully reproducible, this should result in a test set BLEU score of 10.691. This value is inside the file evaluations/phoenix/test_score.bleu.

Hyperparams exploration

The following script will train approximately 50 models to search for good hyperparameters (each run will finish in roughly 2 hours):

./scripts/running/run_hyperparam_search.sh

To get a summary of the results, run:

./scripts/summaries/summarize.sh

Check if training / generation is reproducible

./scripts/running/run_test_repeatability.sh

This will train three models with identical configurations and seeds, to test if the process is repeatable / reproducible.

Results

test BLEU stopped training at epoch
phoenix_1 10.199 29.5051
phoenix_2 10.217 22.5627
phoenix_3 10.472 26.0339

Investigating if due to training arguments

Using only a single data worker:

test BLEU stopped training at epoch
phoenix_1_workers_1 10.324 23.4305
phoenix_2_workers_1 10.244 32.1085
phoenix_3_workers_1 10.189 21.261

fp32 instead of fp16:

test BLEU stopped training at epoch
phoenix_1_fp32 10.35 29.5051
phoenix_2_fp32 10.5 22.5627
phoenix_3_fp32 10.108 27.3356

fp32 and a single data worker:

test BLEU stopped training at epoch
phoenix_1_fp32_workers_1 10.379 27.7695
phoenix_2_fp32_workers_1 9.546 21.261
phoenix_3_fp32_workers_1 10.111 25.6

Investigate if due to weight initialization

Investigating whether the model weights at checkpoint zero (the setup model) are identical for two models, here are the ones that are not identical:

key magnitude of difference
multimodal_mapper.mapping_layer.weight 0.08649232983589172
multimodal_mapper.mapping_layer.bias 0.08635331690311432
backbone.model.shared.weight 0.0894961804151535
backbone.model.encoder.embed_tokens.weight 0.0894961804151535
backbone.model.decoder.embed_tokens.weight 0.0894961804151535
backbone.lm_head.weight 0.0894961804151535

These results are generated by scripts/debugging/debug_reproducibility.py.

The magnitude of differences seems to indicate that the weight initialization is different, potentially because the seed is not set during setup, when the multimodal mapper weights are created. Specifically, the build_model method of the MultimodalEmbedderModel runs without a fixed seed.

After fixing the random seed before setup

After applying this fix:

test BLEU stopped training at epoch
phoenix_1 9.982 24.2983
phoenix_2 9.982 24.2983
phoenix_3 9.982 24.2983

also, all initial weight parameters are identical between models.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published