PyTorch Support

Hi there, I'm working on an update for the [TonY installation script](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/tony) for GCP Dataproc. While I have been able to (locally) successfully update TensorFlow, I cannot seem to get the PyTorch example working. It does not work on 0.4 (the most recent version you explicitly mentioning [supporting](https://github.com/linkedin/TonY/tree/master/tony-examples/mnist-pytorch)) or 1.7.1, the most recent release. I get the following error:
```
  File "mnist_distributed.py", line 230, in <module>
    main()
  File "mnist_distributed.py", line 225, in main
    init_process(args)
  File "mnist_distributed.py", line 185, in init_process
    distributed.init_process_group(
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1607810694534_0006/container_1607810694534_0006_01_000003/venv/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 413, in init_process_group
    backend = Backend(backend)
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1607810694534_0006/container_1607810694534_0006_01_000003/venv/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 76, in __new__
    raise ValueError("TCP backend has been deprecated. Please use "
ValueError: TCP backend has been deprecated. Please use Gloo or MPI backend for collective operations on CPU tensors.
```

Latest attempt:
PyTorch 1.7.1
torchvision 0.8.2
TonY 0.4.0
[Dataproc 2.0](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-preview) (Hadoop 3.2.1)

Config: 
```
<configuration>
 <property>
  <name>tony.application.name</name>
  <value>PyTorch</value>
 </property>
 <property>
  <name>tony.application.security.enabled</name>
  <value>false</value>
 </property>
 <property>
  <name>tony.worker.instances</name>
  <value>2</value>
 </property>
 <property>
  <name>tony.worker.memory</name>
  <value>4g</value>
 </property>
 <property>
  <name>tony.ps.instances</name>
  <value>1</value>
 </property>
 <property>
  <name>tony.ps.memory</name>
  <value>2g</value>
 </property>
 <property>
  <name>tony.application.framework</name>
  <value>pytorch</value>
 </property>
 <property>
  <name>tony.worker.gpus</name>
  <value>1</value>
 </property>
</configuration>
```

Cluster has 1 master, 2 workers and 2 NVIDIA Tesla T4s.  However, any combination of configuration I have tried up to this point results in the same error. Any advice would be greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PyTorch Support #493

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PyTorch Support #493

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions