Skip to content

Testing, Benchmarking, and Optimizing Horovod collectives on Jean-Zay #4

@EiffL

Description

@EiffL

This issue is to track the developments needed to finalize and validate the modified version of Horovod we developed. This overarching goal will encapsulate several smaller issues.

Goal

By the end of the hackweek, having a tested code with an associated Pull Request to https://github.com/horovod/horovod which can fully support our needs for Mesh TensorFlow.

Context

With @kimchitsigai and @mypey we worked on some modifications to Horovod that can support multiple communicators. A description of what we did can be found here: DifferentiableUniverseInitiative/horovod#2
In parallel, a different proposal for supporting multiple groups of devices was proposed here horovod/horovod#2839
In the end, probably one of these 2 implementation will be merged, but we can try to find which one works the best for our purposes.

Participants

The main participants to this task are:

Tasks

Metadata

Metadata

Assignees

Labels

Hackathon GoalHigh level goals for the hack weekHorovodIssues related to the horovod backend

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions