Skip to content

Training time on a p3.2xlarge instance suddenly increased 10 fold! Very slow. #1921

@nectario

Description

@nectario

When I used to train my Deep Learning model on a p3.2xlarge instance it used to take 2 seconds per epoch. Now, all of a sudden, it takes around 39 seconds per epoch! Before total training time was 15 minutes and now it can go up to 2 hours!

Please advise why this is happening.

See image attachment.

Thank you!

Nektarios

Describe the bug
Training very slow on p3.2xlarge. Maybe GPU not being used.

To reproduce
A clear, step-by-step set of instructions to reproduce the bug.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.
SageMaker_Slow_Training

System information
A description of your system. Please provide:

  • SageMaker Python SDK version:
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): TensorFlow
  • Framework version: 2.3
  • Python version: 3.7
  • CPU or GPU: GPU
  • Custom Docker image (Y/N):

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions