Is your feature request related to a problem? Please describe.
There is currently no end-2-end support for tensorboard or other training/evaluation loss dashboards.
Describe the solution you'd like
Solution should show real-time metrics for training jobs running:
- Basic metrics (training loss, validation loss)
- Health of job (maybe including email warning if job fails)
- GPU memory usage
Is your feature request related to a problem? Please describe.
There is currently no end-2-end support for tensorboard or other training/evaluation loss dashboards.
Describe the solution you'd like
Solution should show real-time metrics for training jobs running: