Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let parameter_server = | ||
| Worker::create(allocated_parameter_servers[0].clone(), network.clone()).await; |
There was a problem hiding this comment.
Guard parameter server creation against empty allocation results
The code dereferences allocated_parameter_servers[0] to create parameter_server before confirming that the allocation returned any items. When allocation fails or times out, the match arm returns an empty Vec, so indexing at [0] will panic before the subsequent allocated_parameter_servers.len() == 1 guard executes. This will crash the scheduler instead of allowing graceful handling of the allocation failure. Construct the Worker only after verifying that at least one parameter server was allocated or use safe indexing with early return.
Useful? React with 👍 / 👎.
l45k
left a comment
There was a problem hiding this comment.
I think there are some place that need a bit cleaning and polishing.
22bdf86 to
a7ef150
Compare
687713f to
972d06e
Compare
l45k
left a comment
There was a problem hiding this comment.
There is only a small rebase artifact. Otherwise it is good to go.
nfnt
left a comment
There was a problem hiding this comment.
In my opinion, moving the responsibility of how to configure executors to the users is breaking the concepts currently provided by Hypha:
Hypha knows how to do DiLoCo because it bundles the logic for it in its scheduler and workers. The driver approach in the workers provide the necessary implementations for DiLiCo-specific parameter servers and workers. Albeit, the worker driver currently depends on a complicated Python setup, making it hard to bundle and setup. Still, everything to run DiLoCo is provided by Hypha itself.
This PR changes this, so that a part of a DiLoCo setup has to be done (and understood) by users for a rather complex setup step. IMO, Hypha itself should take care of this complicated step and this is a problem of how we bundle and setup Python-based drivers, also considering different hardware, e.g. ROCm vs CUDA.
In the long-term this makes more sense to me though: I can imagine Hypha providing different building-blocks for all kinds of distributed ML tasks and enabling users to configure Hypha to their needs.
|
It's not just motivated by ROCm vs CUDA but also how we install and setup accelerate or where caches are written too; some of that can be configured via env like I'd argue that the current hard coded setup is rather brittle and lacks the ability to align the executor with the hardware configuration. That being said, I agree that this PR seemingly changes a few fundamental things, but one may also argue that in the current form one does need to understand the executor in detail to adjust it. This change is just decoupling the executor so that it is easy to define and configure it. As the cherry on top we solve the whole Python bundle and setup problem along with it. To be clear though, Hypha will continue to ship official, supported executors as turnkey solutions as they're an intricate set that needs to be well tuned to work together. For the future and advanced users this change will allow for simple addition of different executor types i.e. enabling different training regiments. |
Bring artifact headers, progress messages, and scheduler trackers back to `u32` for count-based fields so they match the wire format and avoid serde/int mismatch issues, while keeping `u64` for time-based ones.
Replace the hard-coded driver field with named executor descriptors, wiring workers to advertise executors, and overhaule the scheduler config to aling it with the executor types. This allows for full control over the executor configuration, enabling users to customize the behavior according to their specific needs or environment requirements, such as supporing AMD instead of NVIDIA. Closes #80 Co-Authored-By: ChatGPT <openai@users.noreply.github.com>
972d06e to
f2f88f2
Compare
Replace the hard-coded driver field with named executor descriptors, wiring workers to advertise executors, and overhaul the scheduler config to align it with the executor types.
This allows for full control over the executor configuration, enabling users to customize the behavior according to their specific needs or environment requirements, such as supporting AMD instead of NVIDIA.
N.B. This PR is based in on
#108#115Closes #80