feat: support executor configs by orlandohohmeier · Pull Request #112 · hypha-space/hypha

orlandohohmeier · 2025-11-04T21:39:40Z

Replace the hard-coded driver field with named executor descriptors, wiring workers to advertise executors, and overhaul the scheduler config to align it with the executor types.

This allows for full control over the executor configuration, enabling users to customize the behavior according to their specific needs or environment requirements, such as supporting AMD instead of NVIDIA.

N.B. This PR is based in on ~~#108~~ #115

Closes #80

codecov · 2025-11-04T21:42:38Z

Codecov Report

❌ Patch coverage is 13.06991% with 286 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
crates/messages/src/lib.rs	0.00%	54 Missing ⚠️
crates/scheduler/src/bin/hypha-scheduler.rs	0.00%	46 Missing ⚠️
crates/worker/src/job_manager.rs	0.00%	45 Missing ⚠️
crates/worker/src/executor/process.rs	37.68%	43 Missing ⚠️
crates/scheduler/src/scheduler_config.rs	0.00%	42 Missing ⚠️
crates/worker/src/resources.rs	0.00%	25 Missing ⚠️
crates/worker/src/config.rs	0.00%	19 Missing ⚠️
crates/worker/src/bin/hypha-worker.rs	0.00%	6 Missing ⚠️
crates/worker/src/executor/parameter_server.rs	0.00%	4 Missing ⚠️
crates/worker/src/lease_manager.rs	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

crates/scheduler/src/bin/hypha-scheduler.rs

chatgpt-codex-connector · 2025-11-04T21:43:38Z

crates/scheduler/src/bin/hypha-scheduler.rs

+    let parameter_server =
+        Worker::create(allocated_parameter_servers[0].clone(), network.clone()).await;


Guard parameter server creation against empty allocation results

The code dereferences allocated_parameter_servers[0] to create parameter_server before confirming that the allocation returned any items. When allocation fails or times out, the match arm returns an empty Vec, so indexing at [0] will panic before the subsequent allocated_parameter_servers.len() == 1 guard executes. This will crash the scheduler instead of allowing graceful handling of the allocation failure. Construct the Worker only after verifying that at least one parameter server was allocated or use safe indexing with early return.

Useful? React with 👍 / 👎.

l45k

I think there are some place that need a bit cleaning and polishing.

crates/messages/src/lib.rs

crates/scheduler/src/scheduler_config.rs

crates/scheduler/src/bin/hypha-scheduler.rs

crates/scheduler/src/scheduler_config.rs

crates/scheduler/src/bin/hypha-scheduler.rs

crates/worker/src/executor/process.rs

l45k

There is only a small rebase artifact. Otherwise it is good to go.

crates/messages/src/lib.rs

crates/scheduler/src/bin/hypha-scheduler.rs

nfnt

In my opinion, moving the responsibility of how to configure executors to the users is breaking the concepts currently provided by Hypha:
Hypha knows how to do DiLoCo because it bundles the logic for it in its scheduler and workers. The driver approach in the workers provide the necessary implementations for DiLiCo-specific parameter servers and workers. Albeit, the worker driver currently depends on a complicated Python setup, making it hard to bundle and setup. Still, everything to run DiLoCo is provided by Hypha itself.
This PR changes this, so that a part of a DiLoCo setup has to be done (and understood) by users for a rather complex setup step. IMO, Hypha itself should take care of this complicated step and this is a problem of how we bundle and setup Python-based drivers, also considering different hardware, e.g. ROCm vs CUDA.

In the long-term this makes more sense to me though: I can imagine Hypha providing different building-blocks for all kinds of distributed ML tasks and enabling users to configure Hypha to their needs.

orlandohohmeier · 2025-11-11T11:58:21Z

It's not just motivated by ROCm vs CUDA but also how we install and setup accelerate or where caches are written too; some of that can be configured via env like HF_HOME but others like which extras to install would require changes to the hard coded configuration.

I'd argue that the current hard coded setup is rather brittle and lacks the ability to align the executor with the hardware configuration.

That being said, I agree that this PR seemingly changes a few fundamental things, but one may also argue that in the current form one does need to understand the executor in detail to adjust it. This change is just decoupling the executor so that it is easy to define and configure it. As the cherry on top we solve the whole Python bundle and setup problem along with it.

To be clear though, Hypha will continue to ship official, supported executors as turnkey solutions as they're an intricate set that needs to be well tuned to work together.

For the future and advanced users this change will allow for simple addition of different executor types i.e. enabling different training regiments.

Bring artifact headers, progress messages, and scheduler trackers back to `u32` for count-based fields so they match the wire format and avoid serde/int mismatch issues, while keeping `u64` for time-based ones.

Replace the hard-coded driver field with named executor descriptors, wiring workers to advertise executors, and overhaule the scheduler config to aling it with the executor types. This allows for full control over the executor configuration, enabling users to customize the behavior according to their specific needs or environment requirements, such as supporing AMD instead of NVIDIA. Closes #80 Co-Authored-By: ChatGPT <openai@users.noreply.github.com>

l45k

LGTM.

chatgpt-codex-connector bot reviewed Nov 4, 2025

View reviewed changes

l45k reviewed Nov 5, 2025

View reviewed changes

orlandohohmeier force-pushed the orlandohohmeier/driver-config branch 2 times, most recently from 22bdf86 to a7ef150 Compare November 6, 2025 13:39

juliangieseke force-pushed the main branch from 11e7a6c to fcd4d18 Compare November 6, 2025 14:28

juliangieseke changed the base branch from main to alpha November 6, 2025 14:41

l45k reviewed Nov 7, 2025

View reviewed changes

crates/worker/src/executor/process.rs Show resolved Hide resolved

orlandohohmeier force-pushed the orlandohohmeier/driver-config branch 2 times, most recently from 687713f to 972d06e Compare November 10, 2025 13:19

orlandohohmeier requested a review from l45k November 11, 2025 08:22

l45k reviewed Nov 11, 2025

View reviewed changes

crates/messages/src/lib.rs Outdated Show resolved Hide resolved

crates/scheduler/src/bin/hypha-scheduler.rs Outdated Show resolved Hide resolved

nfnt reviewed Nov 11, 2025

View reviewed changes

orlandohohmeier and others added 2 commits November 11, 2025 15:23

refactor: align counters on u32

ff797ea

Bring artifact headers, progress messages, and scheduler trackers back to `u32` for count-based fields so they match the wire format and avoid serde/int mismatch issues, while keeping `u64` for time-based ones.

orlandohohmeier force-pushed the orlandohohmeier/driver-config branch from 972d06e to f2f88f2 Compare November 11, 2025 14:26

orlandohohmeier enabled auto-merge November 11, 2025 14:33

orlandohohmeier requested a review from l45k November 11, 2025 14:33

l45k approved these changes Nov 11, 2025

View reviewed changes

orlandohohmeier merged commit c23b2e4 into alpha Nov 11, 2025
8 of 9 checks passed

nfnt mentioned this pull request Nov 13, 2025

Add Quickstart Guide #121

Merged

		let parameter_server =
		Worker::create(allocated_parameter_servers[0].clone(), network.clone()).await;

Conversation

orlandohohmeier commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

l45k left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

l45k left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nfnt left a comment

Choose a reason for hiding this comment

Uh oh!

orlandohohmeier commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

l45k left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

orlandohohmeier commented Nov 4, 2025 •

edited

Loading

codecov bot commented Nov 4, 2025 •

edited

Loading

orlandohohmeier commented Nov 11, 2025 •

edited

Loading