Network Vtable #366

lightsighter · 2025-11-21T03:38:59Z

Proposed interface for a network vtable.

One open question is to define the thread-safety of the callbacks. Will Realm provide the thread safety or will the client need to do that?

Another open question: do we think we'll ever need variable-sized key value data? Currently the interface assumes we'll always know the resulting value size in advance before doing a get call.

codecov · 2025-11-21T03:46:17Z

❌ 1 Tests Failed:

Tests completed	Failed	Passed	Skipped
1440	1	1439	0

View the top 1 failed test(s) by shortest run time

example_vtable::example_vtable
Stack Traces | 0.543s run time
Failed

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

sbahirnv

I like the interface!

I will try to come up with mini-app which would use this interface to bootstrap Realm.

src/realm/runtime.h

sbahirnv · 2025-11-21T07:47:38Z

src/realm/runtime.h

+      bool (*get)(const void *key, size_t key_size, void *value,
+                  size_t value_size) = nullptr;
+      ////////////////////////
+      // OPTIONAL FUNCTIONS //


I am thinking if we would need init() and shutdown() allowing client to do realm specific setup if any.

I don't think you need that because I think anything you'd need to do would be done in the join and leave methods. If we do conclude we need additional methods we can add them later, but let's not speculate.

In general this vtable is what I had in mind but it's proposed differently by @lightsighter . The join/leave being the callbacks to react on membership changes - this is clear. That would be a watch(key, callback) interface in my vtable which is more generic and this is more specialized for bootstrap, that's fine.

Where is the method that the bootstrapping process suppose to call and block on until the membership service actually completed the admission protocol? I don't understand how this would be done without calling something like explicit vtable->init(args, local_ucp_worker_address)

Where is the method that the bootstrapping process suppose to call and block on until the membership service actually completed the admission protocol?

That shouldn't be necessary. The admission protocol should be implemented only using puts and gets, nothing else should be required (except maybe an atomic compare and swap, although I think this is optional). Puts and gets will do their thing. It is up to the admission protocol to be robust to processes joining and leaving. You can continuously poll on keys and values as necessary. The bootstrapping protocol does not need to be fast, it just needs to be correct and it should require clients to provide the bare minimum amount of functionality.

how are you going to implement a group join here? where you have a group of X processes that are elastically joining your existing job and for example the requirement will be you need to wait for all X processes to be admitted as a group?

how are you going to implement a group join here? where you have a group of X processes that are elastically joining your existing job and for example the requirement will be you need to wait for all X processes to be admitted as a group?

See the updated interface.

lightsighter · 2025-11-21T11:14:08Z

Another thing that I'm waffling on a bit is whether to pull the optional methods for join and leave into a separate interface that can be registered dynamically while execution is taking place and will support multiple "subscribers" for notifications of other processes joining and leaving. I think this is probably we're going to want so that different Realm clients can all subscribe without having to coordinate with each other. The bootstrap callbacks don't have this requirement obviously since there's only one client that is ultimately ever going to start Realm in a process.

src/realm/runtime.h

apryakhin · 2025-11-22T02:38:30Z

src/realm/runtime.h

+      // live longer than the function call. The function should return true if
+      // the put succeeds and false if it doesn't. If the call fails then it is
+      // likely that the network initialization might not succeed.
+      bool (*put)(const void *key, size_t key_size, const void *value,


That's a good definition. I'd like to discuss blocking vs non-blocking and if non-blocking then returning a way to know when this put completes.

I don't think any of these callbacks should be non-blocking. As I mentioned elsewhere, the callback interface should require the bare minimum number of calls necessary for the bootstrap protocol to be implemented correctly. All that should be required to do that are blocking put and get. The callbacks should block until the put/get finishes and then return the result. Realm must ensure forward progress as mentioned in the documentation, so we should expect that these calls can and will block for some amount of time.

sbahirnv · 2025-11-22T08:16:04Z

Another thing that I'm waffling on a bit is whether to pull the optional methods for join and leave into a separate interface that can be registered dynamically while execution is taking place and will support multiple "subscribers" for notifications of other processes joining and leaving. I think this is probably we're going to want so that different Realm clients can all subscribe without having to coordinate with each other. The bootstrap callbacks don't have this requirement obviously since there's only one client that is ultimately ever going to start Realm in a process.

If I understand correctly: one process links against librealm.so, but different parts of that process can register for join/leave notifications. Only one part actually calls init()/start() though. If so, I like the idea of separating join/leave into its own interface because different parts need different things. Like, a monitoring system might only care about tracking who joined and left, while the main app is using Realm to actually do work. Both can do their thing independently. It makes the registration a bit messier since multiple parts are registering, but with added benfits.

lightsighter · 2025-11-22T08:49:17Z

Another thing that I'm waffling on a bit is whether to pull the optional methods for join and leave into a separate interface that can be registered dynamically while execution is taking place and will support multiple "subscribers" for notifications of other processes joining and leaving. I think this is probably we're going to want so that different Realm clients can all subscribe without having to coordinate with each other. The bootstrap callbacks don't have this requirement obviously since there's only one client that is ultimately ever going to start Realm in a process.

If I understand correctly: one process links against librealm.so, but different parts of that process can register for join/leave notifications. Only one part actually calls init()/start() though. If so, I like the idea of separating join/leave into its own interface because different parts need different things. Like, a monitoring system might only care about tracking who joined and left, while the main app is using Realm to actually do work. Both can do their thing independently. It makes the registration a bit messier since multiple parts are registering, but with added benfits.

Yes, many things can be using Realm, although there is just one that needs to be responsible for starting Realm. Multiple clients of the same Realm (e.g. both Flash Cache and Dynamo) might want to subscribe to notifications about processes joining and leaving the Realm independently.

sbahirnv

Interface looks great to me!

Before merging though, we need the network_init() implementation or it won't link. Want me to implement it based on your interface design(I can fork from this branch and continue), or are you planning to do that?

I'm going to test it to bootstrap UCX backend.

src/realm/runtime.h

lightsighter · 2025-11-24T22:55:39Z

One other note on this is that I removed the join/leave callbacks. I don't actually think they are necessary once you look at the machine subscription interface. The machine subscription interface does need a bit of work, but I think we can handle that separately from this API.

This reverts commit b2590f6.

apryakhin · 2025-11-25T23:19:32Z

Another thing that I'm waffling on a bit is whether to pull the optional methods for join and leave into a separate interface that can be registered dynamically while execution is taking place and will support multiple "subscribers" for notifications of other processes joining and leaving. I think this is probably we're going to want so that different Realm clients can all subscribe without having to coordinate with each other. The bootstrap callbacks don't have this requirement obviously since there's only one client that is ultimately ever going to start Realm in a process.

First of all we probably don't need separate join and leave but just a single callback with an action type. Second is that with the "watch(prefix, callback)" interface you accomplish exactly that. The registration is dynamic...anywhere inside realm at any time you should be able to subscribe to notifications. The semantics of watch and to implement efficiently is not easy though.

However, I might have misunderstood the reasoning behind the need to remove that at all or rather move to another interface.

I think there are number of different ways to do the vtable in general and since you started thinking about it I see that it diverges somewhat from my original thinking (and that's okay). I still cannot understand by how much. Although, we maybe need to sync and discuss again the semantics of it to make sure we are on the same page here.

apryakhin · 2025-11-25T23:21:36Z

One other note on this is that I removed the join/leave callbacks. I don't actually think they are necessary once you look at the machine subscription interface. The machine subscription interface does need a bit of work, but I think we can handle that separately from this API.

What would be helpfull is a little pseudo-code snippet that would show-case how this vtable is going to be used. Perhaps that's not on you but since you are taking a stab at this interface anyways .

lightsighter · 2025-11-26T10:14:41Z

However, I might have misunderstood the reasoning behind the need to remove that at all or rather move to another interface.

The reasoning is that we actually already have an interface for this in the Machine class. It looks like it is not fully implemented, and it definitely needs some work, but I think I prefer that than making it part of the bootstrapping interface.

What would be helpfull is a little pseudo-code snippet that would show-case how this vtable is going to be used.

We can discuss this in the meeting tomorrow, but I may be willing to do the first-pass implementation of the UCX bootstrap using this interface.

apryakhin · 2025-11-26T15:02:38Z

src/realm/runtime.h

+      // size is less than or equal to the value_size passed in by Realm, then
+      // the value buffer should be populated and the value_size updated with
+      // the actual size of the value found. If the value is not found or is
+      // too large for the specified buffer then value_size should be set to


I don't understand the case when the actual value of the entry inside the kv-store is larger that provided buffer. How the caller is going to know about the right size?

We discussed this in the meeting, but just answering for completeness in case anyone else is observing: the caller in this case for both puts and gets will be Realm, so Realm will know what the types of the values are that it has associated with particular keys, so therefore it will also know the expected size. For example, the blobs of UCX data should all be the same size, so each process will know how big the blob of UCX data it put was and therefore will know how big each of the expected UCX blobs should be when it goes to retrieve them. Even if in the future we don't know the expected size of some value, we can use this interface to find out. We can do one call to get the key with a zero-sized buffer, the call will fail, and report back the expected size that is larger than zero. At that point, we can allocate a buffer of that size and do the get call again to extract the data.

Yes, returning the required size in case the provided buffer is larger is what I would do.

Just FYI:

the blobs of UCX data should all be the same size

This statement is not accurate based on what @SeyedMir told me.

apryakhin · 2025-11-26T21:49:11Z

PMIX:

Add PMIX #349

apryakhin · 2025-12-01T14:40:18Z

src/realm/runtime.h

+    // impacting forward progress.
+    struct NetworkVtable {
+      ////////////////////////
+      // REQUIRED FUNCTIONS //


We haven't yet refactored the comments in this public header but in general we use doxygen format. Just making a note here that probably should reformatted going forwards.

…le in the runtime singleton

apryakhin · 2025-12-10T16:57:39Z

src/realm/runtime_impl.cc

+  bool RuntimeImpl::network_init(int *argc, char ***argv,
+                                 const Runtime::NetworkVtable &vtable)
  {
+#if defined(REALM_USE_UCX) || defined(REALM_USE_MPI) || defined(REALM_USE_GASNET1) ||    \


my understanding is that we are testing only with the UCX. If UCX is enabled then we should expect the vtable and otherwise fail. All other case should fallback to legacy realm bootstrap.

Does our understanding here diverge on how this should work?

No, UCX can still bootstrap itself using other mechanisms that it knows about if no vtable is provided.

apryakhin · 2025-12-10T17:19:17Z

src/realm/ucx/ucp_internal.cc

+          bool success = false;
+          // Try this up to 100 times, if we don't succeed then
+          // we'll time out and fail to join
+          for(unsigned idx = 0; idx < 100; idx++) {


nit: under heavy contention the value might be too small... also move it maybe out and "constexpr define it"

I will pull it out into a constexpr, but it needs to be a finite value. I don't want it looping forever, there must be a timeout. Perhaps we can actually just time it with a wall-clock timer.

apryakhin · 2025-12-10T17:24:44Z

src/realm/ucx/ucp_internal.cc

+          // we'll time out and fail to join
+          for(unsigned idx = 0; idx < 100; idx++) {
+            if(runtime->network_vtable_cas(key.data(), key.size(), &offset, &offset_size,
+                                           &desired, sizeof(desired))) {


desired here is always an old value after the failure - should that be recomputed otherwise this CAS will never succeed ?

Indeed, I do recompute it later in the loop, see line 869.

apryakhin · 2025-12-10T17:27:02Z

src/realm/ucx/ucp_internal.cc

+        uint64_t offset = 0;
+        if(Network::my_node_id == 0) {
+          // Do the work to do the CAS to bump the number of processes
+          constexpr std::string_view key("realm_total_spaces");


the key suppose to be written as part of cas?

No, key's should be immutable.

The API reflects the fact that the implementer of the vtable API is not allowed to mutate keys.

apryakhin · 2025-12-10T17:34:35Z

src/realm/ucx/ucp_internal.cc

+        Network::max_node_id += offset;
+      }
+      // TODO: what to do about processes that have already left
+      Network::all_peers.add_range(0, Network::max_node_id);


the same range has to be deleted from the peers someplace elsewhere right? Is that something still a TODO for this change?

I'm not handling elasticity in this PR. I'm setting up for it, but not implementing it.

apryakhin · 2025-12-10T17:40:14Z

src/realm/ucx/bootstrap/bootstrap.cc

+          return 1;
+        }
+        // Synchronize to make sure everyone is done
+        if(!runtime->network_vtable_bar())


I recall that availability bar and cas is that how you decide whether we are static or elastic. Or is it only the presence of cas and bar must be defined at all times?

Read the comment here: https://github.com/StanfordLegion/realm/blob/mbauer-network-vtable/src/realm/runtime.h#L117-L129

apryakhin · 2025-12-10T17:43:50Z

src/realm/ucx/ucp_internal.cc

      Network::max_node_id = ucc_comm->get_world_size() - 1;
-      Network::all_peers.add_range(0, ucc_comm->get_world_size() - 1);
-      Network::all_peers.remove(ucc_comm->get_rank());
+      if(runtime->network_vtable_elastic()) {


NIT: The changes are missing to maintain endpoints - the UCCBcast I assume will only be group-local, given we maintain a legacy code path that populates the endpoint map. I know that PR is still WIP but just to make sure we are aligned, another words that it's on missing on purpose but just hasn't been done yet.

Again, I'm not implementing elasticity here. Even when we do implement elasticity though, each group of processes will still always have a UCC communicator between all the processes in the group (even processes that join in isolation will make a UCC communicator containing just themselves). We'll use this internally in the UCX module to aid in doing communication and synchronization for the processes in a group as part of them joining and leaving together.

apryakhin · 2025-12-10T17:48:20Z

src/realm/runtime_impl.cc

+    return static_cast<RuntimeImpl *>(impl)->network_init(argc, argv, vtable);
+  }

+  bool Runtime::network_init(const NetworkVtable &vtable)


NIT: void RuntimeImpl::start(void) will have to be refactored. Perhaps it would make sense to add a bunch of TODOs while you keep a working context in your brain on what needs to be done, so someone can pick it up later.

As I said in the Realm meeting today, I'm still writing up all the things that have to happen for elasticity. That is beyond the scope of this PR.

apryakhin · 2025-12-10T17:52:22Z

src/realm/runtime.h

+                  const void *desired, size_t desired_size, const void *vtable_data,
+                  size_t vtable_data_size) = nullptr;
+    };
+    bool network_init(const NetworkVtable &vtable);


How is the "default" implementation will be provided? For example whatever @sbahirnv is going to add for PMIX? Another words, when an appplication program is lazy, doesn't know what it wants and doesn't want to understand or deal with the vtable except to just pick something up by default.

They can pass in an empty vtable and it will fall-back to the existing bootstrap mechanisms that we already have for a given network module. If any of them succeeds then everything is good. If they all fail, then we die the same way we do today.

apryakhin · 2025-12-10T17:54:07Z

src/realm/ucx/ucp_internal.cc

      Network::max_node_id = ucc_comm->get_world_size() - 1;
-      Network::all_peers.add_range(0, ucc_comm->get_world_size() - 1);
-      Network::all_peers.remove(ucc_comm->get_rank());
+      if(runtime->network_vtable_elastic()) {


We need to cross-examine this, make sure it works with Kubernetes control plane.

I think that's outside the scope of Realm. We don't care how people implement the vtable. If you look at how network_vtable_elastic is implemented, all it does is check for the presences of the cas function. If we have such a function, then the client has given us all we need to be elastic.

eddy16112 · 2025-12-11T06:21:16Z

src/realm/runtime.h

+       * if the key exists or not. If the callback fails then the network
+       * initialization might not succeed.
+       */
+      bool (*get)(const void *key, size_t key_size, void *value, size_t *value_size,


PMIx_Get needs the target rank where to get the data. How do you plan to align it with your get API?

That's up to the client to handle on their implementation of the vtable interface. If they're going to use PMIx, then they need to get the parameters to pass to it by themselves without help from Realm.

Just to be explicitly clear about how this should work: the client will call PMIx_Init in the main thread before doing anything. They will get back a process ID that PMIx assigns. They then fill in the vtable with their callbacks. They store the process ID that PMIx gave them (along with any other data they want) in the vtable_data. Then they call Realm's Runtime::network_init and pass in the vtable. Realm makes a copy of the vtable (including the vtable_data buffer) and then goes about doing its initialization. Realm will do whatever callbacks it needs to through the vtable interface. Each callback gets passed the vtable_data buffer. The callbacks know how to fish out the process ID from the vtable_data buffer that was packed by the client in the first place. They then use that information to do calls into PMIx. Realm doesn't need to know about any of this and doesn't need to tell clients how to interpret their data.

Adds a MPI-based NetworkVtable example that demonstrates how to implement the bootstrap interface. The example builds and runs with `mpirun -n 2` successfully.

realm: proposed interface for a vtable for a network bootstrap

f4aa019

lightsighter requested review from apryakhin and sbahirnv November 21, 2025 03:47

realm: more documentation for network vtable interface

dd31837

sbahirnv reviewed Nov 21, 2025

View reviewed changes

realm: fix formatting

67673a9

apryakhin reviewed Nov 22, 2025

View reviewed changes

src/realm/runtime.h Show resolved Hide resolved

apryakhin reviewed Nov 22, 2025

View reviewed changes

sbahirnv approved these changes Nov 22, 2025

View reviewed changes

src/realm/runtime.h Show resolved Hide resolved

src/realm/runtime.h Outdated Show resolved Hide resolved

sbahirnv mentioned this pull request Nov 24, 2025

App realm bootstrap vtable #364

Closed

lightsighter added 3 commits November 24, 2025 13:47

realm: more proposed updates to the vtable for bootstrapping callbacks

09f1b55

realm: more updated documentation for network vtable

79f1ba4

realm: fix formatting

d5ada1d

lightsighter added 4 commits November 24, 2025 19:11

prealm: small prealm bug fix

b2590f6

Revert "prealm: small prealm bug fix"

ae97119

This reverts commit b2590f6.

realm: fix minor typo in vtable interface

dd9e53e

realm: fix formatting

05edc0b

apryakhin reviewed Nov 26, 2025

View reviewed changes

apryakhin reviewed Dec 1, 2025

View reviewed changes

realm: switch to doxygen comments

fd97d7b

apryakhin approved these changes Dec 1, 2025

View reviewed changes

lightsighter added 4 commits December 4, 2025 23:58

Merge branch 'main' into mbauer-network-vtable

803fe38

realm: initial implementation of general methods for the network vtab…

d9e8066

…le in the runtime singleton

realm: initial support for using the vtable to bootstrap ucx

b79e726

ucx: apparently the ucx bootstrap handle class must be pure C

f488a35

apryakhin reviewed Dec 10, 2025

View reviewed changes

eddy16112 reviewed Dec 11, 2025

View reviewed changes

Add MPI based vtable example (#377)

c87ee69

Adds a MPI-based NetworkVtable example that demonstrates how to implement the bootstrap interface. The example builds and runs with `mpirun -n 2` successfully.

Network Vtable #366

Are you sure you want to change the base?

Network Vtable #366

Uh oh!

Conversation

lightsighter commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 1 Tests Failed:

Uh oh!

sbahirnv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apryakhin Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apryakhin Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lightsighter commented Nov 21, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbahirnv commented Nov 22, 2025

Uh oh!

lightsighter commented Nov 22, 2025

Uh oh!

sbahirnv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lightsighter commented Nov 24, 2025

Uh oh!

apryakhin commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apryakhin commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lightsighter commented Nov 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apryakhin commented Nov 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lightsighter commented Nov 21, 2025 •

edited

Loading

codecov bot commented Nov 21, 2025 •

edited

Loading

apryakhin Nov 22, 2025 •

edited

Loading

apryakhin Nov 24, 2025 •

edited

Loading

apryakhin commented Nov 25, 2025 •

edited

Loading

apryakhin commented Nov 25, 2025 •

edited

Loading

apryakhin Dec 10, 2025 •

edited

Loading

apryakhin Dec 10, 2025 •

edited

Loading

apryakhin Dec 10, 2025 •

edited

Loading