Context
cluv only sync code and results between clusters, but we also need datasets to train or validate models. Currently, the users need to manually copy the data on each cluster.
Proposed solution
- Add a new command like
cluv datasets add <path_to_dataset> <destination_cluster> with destination_cluster optionnal ?
- We also need to add a new field in the config to where to save the datasets (with a symlink on a scratch), like
results_path, with an option to override per cluster :
[tool.cluv]
results_path = "logs"
datasets_path = "data"
[tool.cluv.clusters.mila]
datasets_path = "mila_data"
Additional context
There is some special cases to check :
- Datasets from Hugginface
- Datasets already saved on a cluster (ex:
/datasets on Mila)
- Transferring options like Globus.
Context
cluvonly sync code and results between clusters, but we also need datasets to train or validate models. Currently, the users need to manually copy the data on each cluster.Proposed solution
cluv datasets add <path_to_dataset> <destination_cluster>withdestination_clusteroptionnal ?results_path, with an option to override per cluster :Additional context
There is some special cases to check :
/datasetson Mila)