Skip to content

Conversation

@khatwanimohit
Copy link
Collaborator

@khatwanimohit khatwanimohit commented Oct 28, 2025

Description

All the credit of this PR goes to @jonb377 and @ZacharyGarrett !

This PR introduces Distributed Low-Communication (DiLoCo) training, a technique to reduce communication overhead in
distributed model training. It achieves this by synchronizing model parameters periodically, rather than at every step,
improving efficiency for large models.

Notice 1: Once all tests pass, the "pull ready" label will automatically be assigned.
This label is used for administrative purposes. Please do not add it manually.

Notice 2: For external contributions, our settings currently require an approval from a MaxText maintainer to trigger CI tests.

Tests

a unit test with a simple model, more tests with trainer will come in an upcoming PR

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@khatwanimohit khatwanimohit changed the title [Diloco] add diloco related utils [WIP][Diloco] add diloco related utils Oct 28, 2025
@khatwanimohit khatwanimohit force-pushed the mohit/diloco_utils branch 9 times, most recently from 534c6c9 to 6dc6108 Compare October 29, 2025 20:02
@github-actions
Copy link

🤖 Hi @khatwanimohit, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@khatwanimohit khatwanimohit changed the title [WIP][Diloco] add diloco related utils [Diloco] add diloco related utils Oct 29, 2025
@github-actions
Copy link

🤖 Hi @khatwanimohit, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Review Summary

This pull request introduces Distributed Low-Communication (DiLoCo) training, a technique for efficient distributed training of large models. The implementation looks solid and is accompanied by a comprehensive unit test. The changes to configuration and utility functions are appropriate for integrating this new feature.

🔍 General Feedback

  • The core logic in src/MaxText/diloco.py is well-structured and follows the principles outlined in the referenced papers.
  • The addition of a detailed unit test in tests/diloco_test.py is excellent and greatly helps in verifying the correctness of the implementation.
  • One potential issue was identified in the sharding logic, for which a suggestion has been provided.

Overall, this is a great contribution that adds a valuable feature to MaxText.

@khatwanimohit khatwanimohit force-pushed the mohit/diloco_utils branch 2 times, most recently from c5281c1 to 3aac2cc Compare November 4, 2025 19:11
@github-actions
Copy link

github-actions bot commented Nov 4, 2025

🤖 Hi @khatwanimohit, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Review Summary

This Pull Request introduces Distributed Low-Communication (DiLoCo) training utilities and integrates them into the MaxText configuration. The implementation appears sound, and the accompanying unit tests provide good coverage for the core functionality.

🔍 General Feedback

  • The addition of drjax and other dependency updates are appropriate for the new feature.
  • The configuration changes in base.yml and pyconfig.py correctly expose and handle the new DiLoCo parameters.
  • The new diloco.py module is well-structured and implements the DiLoCo algorithm effectively.
  • The diloco_test.py provides a thorough simulation of the DiLoCo training process, with clear explanations of expected values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants