Skip to content

[S3 storage_plugin] Seeing No credential issue at random intervals when saving / restoring snapshot from S3. #142

@hbikki

Description

@hbikki

🐛 Describe the bug

When loading snapshot from s3 we are seeing Nocredentials issue happening, this issue happens at random intervals.
The issue is very similar to this from aiobotocore aio-libs/aiobotocore#1006.
This didn't happen when running <=5 process(assumption based on running tests with varying process.), but the error is consistent when running >5 process.

 Snapshot.take(path=str(save_dir), app_state=app_state)
  • Experimented adding retry with exponential back offs for restoring the snapshot.
  • Tried using different versions of aiobototcore.
  • verified from the logs , the _credential value is present.
  • verified credentials are available form the logs
    /0 [6]:[2023-05-14 00:49:02,211][aiobotocore.credentials][INFO] - Found credentials from IAM Role:
  • The issue doesn't happen when the credentials are set via ~/.aws/credentials file or environment variables.

NOTE:
I don't see the failure when I updated and tested the S3 storage_plugin with botot3 s3 client or using botocore.session
testing time is (2hrs) ~ 100 checkpoints.

Logs:

checkpointing_ddp/0 [3]:Traceback (most recent call last):
checkpointing_ddp/0 [3]:  File "/home/User/torchsnapshot/torchsnapshot/scheduler.py", line 369, in read_buffer
checkpointing_ddp/0 [3]:    await self.storage.read(read_io=read_io)
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,589][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-35' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155640>()]>>
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,590][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [3]:  File "/home/User/torchsnapshot/torchsnapshot/storage_plugins/s3.py", line 60, in read
checkpointing_ddp/0 [3]:    response = await client.get_object(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/client.py", line 354, in _make_api_call
checkpointing_ddp/0 [3]:    http, parsed_response = await self._make_request(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/client.py", line 379, in _make_request
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,610][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,589][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [3]:    return await self._endpoint.make_request(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/endpoint.py", line 96, in _send_request
checkpointing_ddp/0 [3]:    request = await self.create_request(request_dict, operation_model)
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/endpoint.py", line 84, in create_request
checkpointing_ddp/0 [0]:task: <Task pending name='Task-36' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155790>()]>>
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,634][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-37' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155550>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-38' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978007c10>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-39' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978007ac0>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-40' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f596ea95fa0>()]>>
checkpointing_ddp/0 [3]:    await self._event_emitter.emit(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/hooks.py", line 66, in _emit
checkpointing_ddp/0 [3]:    response = await resolve_awaitable(handler(**kwargs))
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/_helpers.py", line 15, in resolve_awaitable
checkpointing_ddp/0 [3]:    return await obj
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/signers.py", line 24, in handler
checkpointing_ddp/0 [3]:    return await self.sign(operation_name, request)
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/signers.py", line 82, in sign
checkpointing_ddp/0 [3]:    auth.add_auth(request)
checkpointing_ddp/0 [3]:  File "/opt/conda/envs/User/lib/python3.9/site-packages/botocore/auth.py", line 418, in add_auth
checkpointing_ddp/0 [3]:    raise NoCredentialsError()
checkpointing_ddp/0 [3]:botocore.exceptions.NoCredentialsError: Unable to locate credentials


Versions

pytorch = 2.0.0+cu117
torchx-nightly>=2023.3.15
torchsnapshot=0.1.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions