Tensorflow mnist-distributed python sdk v2 notebook throws error

## example: Tensorflow mnist-distributed python sdk v2 notebook
## Description: 
I tried running the notebook using CPU cluster (as mentioned in the notebook) -Standard_DS12_v2 (4 cores, 28 GB RAM, 56 GB disk) - 4nodes as well as GPU cluster (Standard_NV6 (6 cores, 56 GB RAM, 380 GB disk) - 2 nodes)

The job fails with the following user error:
{"NonCompliant":"Process 'worker 0' exited with code -2 and error message 'Execution failed. Process killed by signal with name SIGKILL. It was terminated by the runtime due to failure in other processes on the same node. Error: 2022-09-01 02:49:50.471398: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10\n2022-09-01 02:49:50.472092: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11\n2022-09-01 02:49:50.472347: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8\n2022-09-01 02:49:50.474244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0\n2022-09-01 02:49:50.474927: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA\nTo enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2022-09-01 02:49:50.476117: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set\n2022-09-01 02:49:50.477062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: \npciBusID: f0a2:00:00.0 name: Tesla M60 computeCapability: 5.2\ncoreClock: 1.1775GHz coreCount: 16 deviceMemorySize: 7.94GiB deviceMemoryBandwidth: 149.31GiB/s\n2022-09-01 02:49:50.477097: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0\n2022-09-01 02:49:50.477127: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11\n2022-09-01 02:49:50.477151: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11\n2022-09-01 02:49:50.477174: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10\n2022-09-01 02:49:50.477196: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10\n2022-09-01 02:49:50.477220: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10\n2022-09-01 02:49:50.477242: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11\n2022-09-01 02:49:50.477265: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8\n2022-09-01 02:49:50.478934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0\n2022-09-01 02:49:50.478989: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0\n'. Please check the log file '

Checked the log files: The error seems to be in parameter server process:

Traceback (most recent call last):
  File "main.py", line 124, in <module>
    main()
  File "main.py", line 99, in main
    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
  File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 340, in new_func
    return func(*args, **kwargs)
  File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 254, in __init__
    self).__init__(cluster_resolver, communication_options)
  File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 188, in __init__
    communication_options=communication_options))
  File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 327, in __init__
    self._initialize_strategy(self._cluster_resolver)
  File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 335, in _initialize_strategy
    self._initialize_multi_worker(cluster_resolver)
  File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 419, in _initialize_multi_worker
    self._cluster_spec, self._task_type, self._task_id)
  File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/multi_worker_util.py", line 227, in id_in_cluster
    raise ValueError("There is no id for task_type %r" % task_type)
ValueError: There is no id for task_type 'ps'


## Additional context

I also tried changing the strategy to  tf.distribute.MultiWorkerMirroredStrategy()

Request help in solving this issue. 

-


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tensorflow mnist-distributed python sdk v2 notebook throws error #1599

example: Tensorflow mnist-distributed python sdk v2 notebook

Description:

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tensorflow mnist-distributed python sdk v2 notebook throws error #1599

Description

example: Tensorflow mnist-distributed python sdk v2 notebook

Description:

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions