Description
example: Tensorflow mnist-distributed python sdk v2 notebook
Description:
I tried running the notebook using CPU cluster (as mentioned in the notebook) -Standard_DS12_v2 (4 cores, 28 GB RAM, 56 GB disk) - 4nodes as well as GPU cluster (Standard_NV6 (6 cores, 56 GB RAM, 380 GB disk) - 2 nodes)
The job fails with the following user error:
{"NonCompliant":"Process 'worker 0' exited with code -2 and error message 'Execution failed. Process killed by signal with name SIGKILL. It was terminated by the runtime due to failure in other processes on the same node. Error: 2022-09-01 02:49:50.471398: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10\n2022-09-01 02:49:50.472092: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11\n2022-09-01 02:49:50.472347: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8\n2022-09-01 02:49:50.474244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0\n2022-09-01 02:49:50.474927: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA\nTo enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2022-09-01 02:49:50.476117: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set\n2022-09-01 02:49:50.477062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: \npciBusID: f0a2:00:00.0 name: Tesla M60 computeCapability: 5.2\ncoreClock: 1.1775GHz coreCount: 16 deviceMemorySize: 7.94GiB deviceMemoryBandwidth: 149.31GiB/s\n2022-09-01 02:49:50.477097: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0\n2022-09-01 02:49:50.477127: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11\n2022-09-01 02:49:50.477151: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11\n2022-09-01 02:49:50.477174: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10\n2022-09-01 02:49:50.477196: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10\n2022-09-01 02:49:50.477220: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10\n2022-09-01 02:49:50.477242: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11\n2022-09-01 02:49:50.477265: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8\n2022-09-01 02:49:50.478934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0\n2022-09-01 02:49:50.478989: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0\n'. Please check the log file '
Checked the log files: The error seems to be in parameter server process:
Traceback (most recent call last):
File "main.py", line 124, in
main()
File "main.py", line 99, in main
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 340, in new_func
return func(*args, **kwargs)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 254, in init
self).init(cluster_resolver, communication_options)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 188, in init
communication_options=communication_options))
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 327, in init
self._initialize_strategy(self._cluster_resolver)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 335, in _initialize_strategy
self._initialize_multi_worker(cluster_resolver)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 419, in _initialize_multi_worker
self._cluster_spec, self._task_type, self._task_id)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/multi_worker_util.py", line 227, in id_in_cluster
raise ValueError("There is no id for task_type %r" % task_type)
ValueError: There is no id for task_type 'ps'
Additional context
I also tried changing the strategy to tf.distribute.MultiWorkerMirroredStrategy()
Request help in solving this issue.