-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
When I used to train my Deep Learning model on a p3.2xlarge instance it used to take 2 seconds per epoch. Now, all of a sudden, it takes around 39 seconds per epoch! Before total training time was 15 minutes and now it can go up to 2 hours!
Please advise why this is happening.
See image attachment.
Thank you!
Nektarios
Describe the bug
Training very slow on p3.2xlarge. Maybe GPU not being used.
To reproduce
A clear, step-by-step set of instructions to reproduce the bug.
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.

System information
A description of your system. Please provide:
- SageMaker Python SDK version:
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): TensorFlow
- Framework version: 2.3
- Python version: 3.7
- CPU or GPU: GPU
- Custom Docker image (Y/N):
Additional context
Add any other context about the problem here.