-
Notifications
You must be signed in to change notification settings - Fork 164
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Hi there, I'm working on an update for the TonY installation script for GCP Dataproc. While I have been able to (locally) successfully update TensorFlow, I cannot seem to get the PyTorch example working. It does not work on 0.4 (the most recent version you explicitly mentioning supporting) or 1.7.1, the most recent release. I get the following error:
File "mnist_distributed.py", line 230, in <module>
main()
File "mnist_distributed.py", line 225, in main
init_process(args)
File "mnist_distributed.py", line 185, in init_process
distributed.init_process_group(
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1607810694534_0006/container_1607810694534_0006_01_000003/venv/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 413, in init_process_group
backend = Backend(backend)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1607810694534_0006/container_1607810694534_0006_01_000003/venv/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 76, in __new__
raise ValueError("TCP backend has been deprecated. Please use "
ValueError: TCP backend has been deprecated. Please use Gloo or MPI backend for collective operations on CPU tensors.
Latest attempt:
PyTorch 1.7.1
torchvision 0.8.2
TonY 0.4.0
Dataproc 2.0 (Hadoop 3.2.1)
Config:
<configuration>
<property>
<name>tony.application.name</name>
<value>PyTorch</value>
</property>
<property>
<name>tony.application.security.enabled</name>
<value>false</value>
</property>
<property>
<name>tony.worker.instances</name>
<value>2</value>
</property>
<property>
<name>tony.worker.memory</name>
<value>4g</value>
</property>
<property>
<name>tony.ps.instances</name>
<value>1</value>
</property>
<property>
<name>tony.ps.memory</name>
<value>2g</value>
</property>
<property>
<name>tony.application.framework</name>
<value>pytorch</value>
</property>
<property>
<name>tony.worker.gpus</name>
<value>1</value>
</property>
</configuration>
Cluster has 1 master, 2 workers and 2 NVIDIA Tesla T4s. However, any combination of configuration I have tried up to this point results in the same error. Any advice would be greatly appreciated!
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested