Open
Description
I trained the imagenet dataset on sky storage with resnet model with the following yaml. After an epoch, I killed the model training and tried to launch another yaml on another cluster, but reading/writing files in the bucket becomes Permission denied
(ls
works properly).
# Task 1
resources:
cloud: aws
accelerators: V100
file_mounts:
/imagenet:
name: sky-imagenet-data
mode: MOUNT
setup: |
git clone https://github.com/Michaelvll/examples.git
cd ./examples/imagenet
conda create -n imagenet python=3.9 -y
conda activate imagenet
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -y
pip install -r requirements.txt
pip install wandb
wandb login e2b8c727047ed1b4e28ac07e2c045faa416d6390
run: |
conda activate imagenet
mkdir -p /imagenet/checkpoints/ILSRC2012/imagenet/resnet18-s3
cd ./examples/imagenet
git pull
python main.py -a resnet18 \
--save-path /imagenet/checkpoints/ILSRC2012/imagenet/resnet18-s3 \
--resume /imagenet/checkpoints/ILSRC2012/imagenet/resnet18-s3 \
--workers 32 \
/imagenet/datasets/ILSVRC2012/imagenet
# Task 2
resources:
cloud: aws
accelerators: V100
file_mounts:
/imagenet:
name: sky-imagenet-data
mode: MOUNT
PS: The behavior of the bucket is a bit weird. When I tried to tear down the cluster and re-launch, or wait for several hours and retry, the Permission denied
problem still existed. However, after I remove the public access of the bucket using the s3 console, it seems the Permission denied
disappear.