Skip to content

Conversation

@michaelbenayoun
Copy link
Member

What does this PR do?

Provide collective ops on arbitrary python objects.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@tengomucho tengomucho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you provide more information about why this change is necessary and what does it affect?
I do not see the training workflow running, and I do not really know what this changes for training.

"""
Broadcasts arbitrary objects across XLA-distributed processes.
Returns the object from the source rank on all ranks.
If `groups` is specified, broadcast is done separately in each group, and the `src` rank is relative to each group.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider mentioning that object needs to be pickle compatible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@michaelbenayoun
Copy link
Member Author

can you provide more information about why this change is necessary and what does it affect? I do not see the training workflow running, and I do not really know what this changes for training.

It's a bunch of features required to transfer non-tensor values during training. It is needed for GRPO. I just broken down the GRPO PR (#1020 ) into smaller PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants