Skip to content

AI-Hypercomputer/google-cloud-mldiagnostics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

google-cloud-mldiagnostics

Overview

The google-cloud-mldiagnostics library is a Python package designed to help engineers and researchers monitor and diagnose machine learning training runs with GCP suite of diagnostic toolings. It provides tools for tracking workload progress, collecting metrics and profiling performance.

Supported Framework

  • jax
    • any versions
  • Other in progress

How to install

Install

Install pypi package link

pip install google-cloud-mldiagnostics

This package does not install libtpu, jax and xprof and expects they are installed separately.

How to use

Monitor training

At the beginning of the training script, create a machine learning run:

from google_cloud_mldiagnostics import machinelearning_run

machinelearning_run(
  name=<run-name>,
  gcs_path="gs://<bucket>",
)

Monitor with on-demand profiling

from google_cloud_mldiagnostics import machinelearning_run

machinelearning_run(
  name=<run-name>,
  gcs_path="gs://<bucket>",
  on_demand_xprof=True
)

Monitor with programmatic profiling

from google_cloud_mldiagnostics import machinelearning_run
from google_cloud_mldiagnostics import xprof

machinelearning_run(
  name=<run-name>,
  gcs_path="gs://<bucket>",
)

xprof = xprof()
xprof.start()
# some code
xprof.stop()

Monitor with predefined metrics

from google_cloud_mldiagnostics import machinelearning_run
from google_cloud_mldiagnostics import metrics
from google_cloud_mldiagnostics import metric_types

machinelearning_run(
  name=<run-name>,
  gcs_path="gs://<bucket>",
)

metrics.record(metric_type.MetricType.LOSS, <value>)

To pair the metric value with the current step:

metrics.record(metric_type.MetricType.LOSS, <value>, step=<step>)

Monitor with customer metrics

from google_cloud_mldiagnostics import machinelearning_run
from google_cloud_mldiagnostics import metrics

machinelearning_run(
  name=<run-name>,
  gcs_path="gs://<bucket>",
)

metrics.record("<my-metric>", <value>)

To pair the metric value with the current step:

metrics.record("<my-metric>", <value>, step=<value>)

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages