Skip to content

feat: support executor configs#112

Merged
orlandohohmeier merged 2 commits intoalphafrom
orlandohohmeier/driver-config
Nov 11, 2025
Merged

feat: support executor configs#112
orlandohohmeier merged 2 commits intoalphafrom
orlandohohmeier/driver-config

Conversation

@orlandohohmeier
Copy link
Contributor

@orlandohohmeier orlandohohmeier commented Nov 4, 2025

Replace the hard-coded driver field with named executor descriptors, wiring workers to advertise executors, and overhaul the scheduler config to align it with the executor types.

This allows for full control over the executor configuration, enabling users to customize the behavior according to their specific needs or environment requirements, such as supporting AMD instead of NVIDIA.

N.B. This PR is based in on #108 #115

Closes #80

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +342 to +343
let parameter_server =
Worker::create(allocated_parameter_servers[0].clone(), network.clone()).await;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard parameter server creation against empty allocation results

The code dereferences allocated_parameter_servers[0] to create parameter_server before confirming that the allocation returned any items. When allocation fails or times out, the match arm returns an empty Vec, so indexing at [0] will panic before the subsequent allocated_parameter_servers.len() == 1 guard executes. This will crash the scheduler instead of allowing graceful handling of the allocation failure. Construct the Worker only after verifying that at least one parameter server was allocated or use safe indexing with early return.

Useful? React with 👍 / 👎.

Copy link
Contributor

@l45k l45k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some place that need a bit cleaning and polishing.

@orlandohohmeier orlandohohmeier force-pushed the orlandohohmeier/driver-config branch 2 times, most recently from 22bdf86 to a7ef150 Compare November 6, 2025 13:39
@juliangieseke juliangieseke changed the base branch from main to alpha November 6, 2025 14:41
@orlandohohmeier orlandohohmeier force-pushed the orlandohohmeier/driver-config branch 2 times, most recently from 687713f to 972d06e Compare November 10, 2025 13:19
@orlandohohmeier orlandohohmeier requested a review from l45k November 11, 2025 08:22
Copy link
Contributor

@l45k l45k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only a small rebase artifact. Otherwise it is good to go.

Copy link
Contributor

@nfnt nfnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, moving the responsibility of how to configure executors to the users is breaking the concepts currently provided by Hypha:
Hypha knows how to do DiLoCo because it bundles the logic for it in its scheduler and workers. The driver approach in the workers provide the necessary implementations for DiLiCo-specific parameter servers and workers. Albeit, the worker driver currently depends on a complicated Python setup, making it hard to bundle and setup. Still, everything to run DiLoCo is provided by Hypha itself.
This PR changes this, so that a part of a DiLoCo setup has to be done (and understood) by users for a rather complex setup step. IMO, Hypha itself should take care of this complicated step and this is a problem of how we bundle and setup Python-based drivers, also considering different hardware, e.g. ROCm vs CUDA.

In the long-term this makes more sense to me though: I can imagine Hypha providing different building-blocks for all kinds of distributed ML tasks and enabling users to configure Hypha to their needs.

@orlandohohmeier
Copy link
Contributor Author

orlandohohmeier commented Nov 11, 2025

It's not just motivated by ROCm vs CUDA but also how we install and setup accelerate or where caches are written too; some of that can be configured via env like HF_HOME but others like which extras to install would require changes to the hard coded configuration.

I'd argue that the current hard coded setup is rather brittle and lacks the ability to align the executor with the hardware configuration.

That being said, I agree that this PR seemingly changes a few fundamental things, but one may also argue that in the current form one does need to understand the executor in detail to adjust it. This change is just decoupling the executor so that it is easy to define and configure it. As the cherry on top we solve the whole Python bundle and setup problem along with it.

To be clear though, Hypha will continue to ship official, supported executors as turnkey solutions as they're an intricate set that needs to be well tuned to work together.

For the future and advanced users this change will allow for simple addition of different executor types i.e. enabling different training regiments.

orlandohohmeier and others added 2 commits November 11, 2025 15:23
Bring artifact headers, progress messages, and scheduler trackers back to `u32` for count-based fields so they match the wire format and avoid serde/int mismatch issues, while keeping `u64` for time-based ones.
Replace the hard-coded driver field with named executor descriptors, wiring workers to advertise executors, and overhaule the scheduler config to aling it with the executor types.

This allows for full control over the executor configuration, enabling users to customize the behavior according to their specific needs or environment requirements, such as supporing AMD instead of NVIDIA.

Closes #80

Co-Authored-By: ChatGPT <openai@users.noreply.github.com>
@orlandohohmeier orlandohohmeier force-pushed the orlandohohmeier/driver-config branch from 972d06e to f2f88f2 Compare November 11, 2025 14:26
@orlandohohmeier orlandohohmeier requested a review from l45k November 11, 2025 14:33
Copy link
Contributor

@l45k l45k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@orlandohohmeier orlandohohmeier merged commit c23b2e4 into alpha Nov 11, 2025
8 of 9 checks passed
@nfnt nfnt mentioned this pull request Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Task: Adjust Worker to Support Driver Configuration

3 participants