Training time on a p3.2xlarge instance suddenly increased 10 fold! Very slow.


When I used to train my Deep Learning model on a p3.2xlarge instance it used to take 2 seconds per epoch. Now, all of a sudden, it takes around 39 seconds per epoch! Before total training time was 15 minutes and now it can go up to 2 hours!

Please advise why this is happening.

See image attachment.

Thank you!

Nektarios


**Describe the bug**
Training very slow on p3.2xlarge. Maybe GPU not being used.

**To reproduce**
A clear, step-by-step set of instructions to reproduce the bug.

**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots or logs**
If applicable, add screenshots or logs to help explain your problem.
![SageMaker_Slow_Training](https://user-images.githubusercontent.com/22030060/94070030-875fba00-fdbf-11ea-89c8-0d3b1378cb13.jpg)

**System information**
A description of your system. Please provide:
- **SageMaker Python SDK version**:
- **Framework name (eg. PyTorch) or algorithm (eg. KMeans)**: TensorFlow
- **Framework version**: 2.3
- **Python version**: 3.7
- **CPU or GPU**: GPU
- **Custom Docker image (Y/N)**:

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training time on a p3.2xlarge instance suddenly increased 10 fold! Very slow. #1921

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training time on a p3.2xlarge instance suddenly increased 10 fold! Very slow. #1921

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions