Skip to content

Sky storage becomes permission denied after training with imagenet #740

Open
@Michaelvll

Description

@Michaelvll

I trained the imagenet dataset on sky storage with resnet model with the following yaml. After an epoch, I killed the model training and tried to launch another yaml on another cluster, but reading/writing files in the bucket becomes Permission denied (ls works properly).

# Task 1
resources:
  cloud: aws
  accelerators: V100

file_mounts:
  /imagenet:
    name: sky-imagenet-data
    mode: MOUNT

setup: |
  git clone https://github.com/Michaelvll/examples.git
  cd ./examples/imagenet
  conda create -n imagenet python=3.9 -y
  conda activate imagenet
  conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -y
  pip install -r requirements.txt
  pip install wandb
  wandb login e2b8c727047ed1b4e28ac07e2c045faa416d6390

run: |
  conda activate imagenet
  mkdir -p /imagenet/checkpoints/ILSRC2012/imagenet/resnet18-s3

  cd ./examples/imagenet
  git pull
  python main.py -a resnet18 \
    --save-path /imagenet/checkpoints/ILSRC2012/imagenet/resnet18-s3 \
    --resume /imagenet/checkpoints/ILSRC2012/imagenet/resnet18-s3 \
    --workers 32 \
    /imagenet/datasets/ILSVRC2012/imagenet
# Task 2
resources:
  cloud: aws
  accelerators: V100

file_mounts:
  /imagenet:
    name: sky-imagenet-data
    mode: MOUNT

PS: The behavior of the bucket is a bit weird. When I tried to tear down the cluster and re-launch, or wait for several hours and retry, the Permission denied problem still existed. However, after I remove the public access of the bucket using the s3 console, it seems the Permission denied disappear.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions