Skip to content

Conversation

@typhoonzero
Copy link
Contributor

What changes were proposed in this pull request?

Support running distributed Tensorflow training in kfp workflows.

Only Tensorflow/Keras distributed training with MultiWorkerMirroredStrategy is supported. Pytorch support will be added in other PRs. Distributed training using "paramter servers" will not be supported currently.

TODO:

  • Adding dependence steps before or after the distributed training step.
  • Pytorch support
  1. set number of workers for the training step:

截屏2022-11-07 11 05 24

  1. run the workflow:

截屏2022-11-07 11 06 18

How was this pull request tested?

TBD.

@elyra-bot
Copy link

elyra-bot bot commented Nov 7, 2022

Thanks for making a pull request to Elyra!

To try out this branch on binder, follow this link: Binder

@akchinSTC akchinSTC added status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise. component:pipeline-editor pipeline editor platform: pipeline-Kubeflow Related to usage of Kubeflow Pipelines as pipeline runtime labels Nov 7, 2022
@akchinSTC
Copy link
Member

Thanks @typhoonzero for another contribution!
Few thoughts,

  1. After building the PR, I cant seem to find the new fields in the UI. (could just be my env, but I did a purge and fresh build)
    image
  2. looks like argo specific metadata labels are hardcoded, we will need to support kfp tekton as well. Thanks @ptitzler
  3. lastly, these changes are very specific to TF. Having them displayed dynamically will probably require us to describe what libraries the images contain.. in the metadata, which opens up a big can of worms. Maybe have a ENV var as a flag instead to trigger displaying these extra options...shrug...will definitely require more discussion.

@typhoonzero
Copy link
Contributor Author

@akchinSTC Thanks for the information. Actually, I'm looking for a more generic implementation, like workflow ParallelFor function, not only for distributed data parallel training, but also for parallel data processing features.

By setting a node as a parallelfor node (parallel count > 1), elyra should pass the following envs to the user program:

  • rank
  • nranks
  • runtime pod ip address for each rank

Then the user program can set either TF_CONFIG for Tensorflow distributed training or MASTER_ADDR for pytorch distributed training. In this case, the changes should be like:

  1. A parallel-count property for each node.
  2. Connect input and output to dependency nodes.
  3. bootstrapper.py will set those envs at runtime.
  4. Some examples for Tensorflow, Pytorch and general data processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:pipeline-editor pipeline editor platform: pipeline-Kubeflow Related to usage of Kubeflow Pipelines as pipeline runtime status:Needs Discussion status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants