Skip to content

Tensorflow mnist-distributed python sdk v2 notebook throws error #1599

Open
@jyravi

Description

@jyravi

example: Tensorflow mnist-distributed python sdk v2 notebook

Description:

I tried running the notebook using CPU cluster (as mentioned in the notebook) -Standard_DS12_v2 (4 cores, 28 GB RAM, 56 GB disk) - 4nodes as well as GPU cluster (Standard_NV6 (6 cores, 56 GB RAM, 380 GB disk) - 2 nodes)

The job fails with the following user error:
{"NonCompliant":"Process 'worker 0' exited with code -2 and error message 'Execution failed. Process killed by signal with name SIGKILL. It was terminated by the runtime due to failure in other processes on the same node. Error: 2022-09-01 02:49:50.471398: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10\n2022-09-01 02:49:50.472092: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11\n2022-09-01 02:49:50.472347: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8\n2022-09-01 02:49:50.474244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0\n2022-09-01 02:49:50.474927: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA\nTo enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2022-09-01 02:49:50.476117: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set\n2022-09-01 02:49:50.477062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: \npciBusID: f0a2:00:00.0 name: Tesla M60 computeCapability: 5.2\ncoreClock: 1.1775GHz coreCount: 16 deviceMemorySize: 7.94GiB deviceMemoryBandwidth: 149.31GiB/s\n2022-09-01 02:49:50.477097: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0\n2022-09-01 02:49:50.477127: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11\n2022-09-01 02:49:50.477151: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11\n2022-09-01 02:49:50.477174: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10\n2022-09-01 02:49:50.477196: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10\n2022-09-01 02:49:50.477220: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10\n2022-09-01 02:49:50.477242: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11\n2022-09-01 02:49:50.477265: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8\n2022-09-01 02:49:50.478934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0\n2022-09-01 02:49:50.478989: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0\n'. Please check the log file '

Checked the log files: The error seems to be in parameter server process:

Traceback (most recent call last):
File "main.py", line 124, in
main()
File "main.py", line 99, in main
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 340, in new_func
return func(*args, **kwargs)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 254, in init
self).init(cluster_resolver, communication_options)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 188, in init
communication_options=communication_options))
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 327, in init
self._initialize_strategy(self._cluster_resolver)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 335, in _initialize_strategy
self._initialize_multi_worker(cluster_resolver)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 419, in _initialize_multi_worker
self._cluster_spec, self._task_type, self._task_id)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/tensorflow/python/distribute/multi_worker_util.py", line 227, in id_in_cluster
raise ValueError("There is no id for task_type %r" % task_type)
ValueError: There is no id for task_type 'ps'

Additional context

I also tried changing the strategy to tf.distribute.MultiWorkerMirroredStrategy()

Request help in solving this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions