Skip to content

PyTorch Support #493

@bradmiro

Description

@bradmiro

Hi there, I'm working on an update for the TonY installation script for GCP Dataproc. While I have been able to (locally) successfully update TensorFlow, I cannot seem to get the PyTorch example working. It does not work on 0.4 (the most recent version you explicitly mentioning supporting) or 1.7.1, the most recent release. I get the following error:

  File "mnist_distributed.py", line 230, in <module>
    main()
  File "mnist_distributed.py", line 225, in main
    init_process(args)
  File "mnist_distributed.py", line 185, in init_process
    distributed.init_process_group(
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1607810694534_0006/container_1607810694534_0006_01_000003/venv/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 413, in init_process_group
    backend = Backend(backend)
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1607810694534_0006/container_1607810694534_0006_01_000003/venv/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 76, in __new__
    raise ValueError("TCP backend has been deprecated. Please use "
ValueError: TCP backend has been deprecated. Please use Gloo or MPI backend for collective operations on CPU tensors.

Latest attempt:
PyTorch 1.7.1
torchvision 0.8.2
TonY 0.4.0
Dataproc 2.0 (Hadoop 3.2.1)

Config:

<configuration>
 <property>
  <name>tony.application.name</name>
  <value>PyTorch</value>
 </property>
 <property>
  <name>tony.application.security.enabled</name>
  <value>false</value>
 </property>
 <property>
  <name>tony.worker.instances</name>
  <value>2</value>
 </property>
 <property>
  <name>tony.worker.memory</name>
  <value>4g</value>
 </property>
 <property>
  <name>tony.ps.instances</name>
  <value>1</value>
 </property>
 <property>
  <name>tony.ps.memory</name>
  <value>2g</value>
 </property>
 <property>
  <name>tony.application.framework</name>
  <value>pytorch</value>
 </property>
 <property>
  <name>tony.worker.gpus</name>
  <value>1</value>
 </property>
</configuration>

Cluster has 1 master, 2 workers and 2 NVIDIA Tesla T4s. However, any combination of configuration I have tried up to this point results in the same error. Any advice would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions