Skip to content

Monitor Bandwidth Utilization of Nodes While Training #548

@orwa-te

Description

@orwa-te

Environment:

  • Python version [3.7.7]
  • Spark version [3.0.1]
  • TensorFlow version [2.3.0]
  • TensorFlowOnSpark version [2.2.1]
  • Cluster version [Standalone]

Question:
Is there a way to monitor the network utilization of nodes while communicating with each other to transfer the gradients in order to update the model? I want to measure the size of data sent from one node to another one for a single batch and all batches. I think that Tensorboard does not support such a feature

Spark Submit Command Line:
spark-submit --master spark://master:7077 train_file.py --cluster_size 3 --epochs 10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions