Skip to content

Conversation

@lightsighter
Copy link
Contributor

@lightsighter lightsighter commented Nov 21, 2025

Proposed interface for a network vtable.

One open question is to define the thread-safety of the callbacks. Will Realm provide the thread safety or will the client need to do that?

Another open question: do we think we'll ever need variable-sized key value data? Currently the interface assumes we'll always know the resulting value size in advance before doing a get call.

@codecov
Copy link

codecov bot commented Nov 21, 2025

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
1440 1 1439 0
View the top 1 failed test(s) by shortest run time
example_vtable::example_vtable
Stack Traces | 0.543s run time
Failed

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Collaborator

@sbahirnv sbahirnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the interface!

I will try to come up with mini-app which would use this interface to bootstrap Realm.

bool (*get)(const void *key, size_t key_size, void *value,
size_t value_size) = nullptr;
////////////////////////
// OPTIONAL FUNCTIONS //
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking if we would need init() and shutdown() allowing client to do realm specific setup if any.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need that because I think anything you'd need to do would be done in the join and leave methods. If we do conclude we need additional methods we can add them later, but let's not speculate.

Copy link
Contributor

@apryakhin apryakhin Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this vtable is what I had in mind but it's proposed differently by @lightsighter . The join/leave being the callbacks to react on membership changes - this is clear. That would be a watch(key, callback) interface in my vtable which is more generic and this is more specialized for bootstrap, that's fine.

Where is the method that the bootstrapping process suppose to call and block on until the membership service actually completed the admission protocol? I don't understand how this would be done without calling something like explicit vtable->init(args, local_ucp_worker_address)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the method that the bootstrapping process suppose to call and block on until the membership service actually completed the admission protocol?

That shouldn't be necessary. The admission protocol should be implemented only using puts and gets, nothing else should be required (except maybe an atomic compare and swap, although I think this is optional). Puts and gets will do their thing. It is up to the admission protocol to be robust to processes joining and leaving. You can continuously poll on keys and values as necessary. The bootstrapping protocol does not need to be fast, it just needs to be correct and it should require clients to provide the bare minimum amount of functionality.

Copy link
Contributor

@apryakhin apryakhin Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are you going to implement a group join here? where you have a group of X processes that are elastically joining your existing job and for example the requirement will be you need to wait for all X processes to be admitted as a group?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are you going to implement a group join here? where you have a group of X processes that are elastically joining your existing job and for example the requirement will be you need to wait for all X processes to be admitted as a group?

See the updated interface.

@lightsighter
Copy link
Contributor Author

Another thing that I'm waffling on a bit is whether to pull the optional methods for join and leave into a separate interface that can be registered dynamically while execution is taking place and will support multiple "subscribers" for notifications of other processes joining and leaving. I think this is probably we're going to want so that different Realm clients can all subscribe without having to coordinate with each other. The bootstrap callbacks don't have this requirement obviously since there's only one client that is ultimately ever going to start Realm in a process.

// live longer than the function call. The function should return true if
// the put succeeds and false if it doesn't. If the call fails then it is
// likely that the network initialization might not succeed.
bool (*put)(const void *key, size_t key_size, const void *value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good definition. I'd like to discuss blocking vs non-blocking and if non-blocking then returning a way to know when this put completes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any of these callbacks should be non-blocking. As I mentioned elsewhere, the callback interface should require the bare minimum number of calls necessary for the bootstrap protocol to be implemented correctly. All that should be required to do that are blocking put and get. The callbacks should block until the put/get finishes and then return the result. Realm must ensure forward progress as mentioned in the documentation, so we should expect that these calls can and will block for some amount of time.

@sbahirnv
Copy link
Collaborator

Another thing that I'm waffling on a bit is whether to pull the optional methods for join and leave into a separate interface that can be registered dynamically while execution is taking place and will support multiple "subscribers" for notifications of other processes joining and leaving. I think this is probably we're going to want so that different Realm clients can all subscribe without having to coordinate with each other. The bootstrap callbacks don't have this requirement obviously since there's only one client that is ultimately ever going to start Realm in a process.

If I understand correctly: one process links against librealm.so, but different parts of that process can register for join/leave notifications. Only one part actually calls init()/start() though. If so, I like the idea of separating join/leave into its own interface because different parts need different things. Like, a monitoring system might only care about tracking who joined and left, while the main app is using Realm to actually do work. Both can do their thing independently. It makes the registration a bit messier since multiple parts are registering, but with added benfits.

@lightsighter
Copy link
Contributor Author

Another thing that I'm waffling on a bit is whether to pull the optional methods for join and leave into a separate interface that can be registered dynamically while execution is taking place and will support multiple "subscribers" for notifications of other processes joining and leaving. I think this is probably we're going to want so that different Realm clients can all subscribe without having to coordinate with each other. The bootstrap callbacks don't have this requirement obviously since there's only one client that is ultimately ever going to start Realm in a process.

If I understand correctly: one process links against librealm.so, but different parts of that process can register for join/leave notifications. Only one part actually calls init()/start() though. If so, I like the idea of separating join/leave into its own interface because different parts need different things. Like, a monitoring system might only care about tracking who joined and left, while the main app is using Realm to actually do work. Both can do their thing independently. It makes the registration a bit messier since multiple parts are registering, but with added benfits.

Yes, many things can be using Realm, although there is just one that needs to be responsible for starting Realm. Multiple clients of the same Realm (e.g. both Flash Cache and Dynamo) might want to subscribe to notifications about processes joining and leaving the Realm independently.

Copy link
Collaborator

@sbahirnv sbahirnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interface looks great to me!

Before merging though, we need the network_init() implementation or it won't link. Want me to implement it based on your interface design(I can fork from this branch and continue), or are you planning to do that?

I'm going to test it to bootstrap UCX backend.

@lightsighter
Copy link
Contributor Author

One other note on this is that I removed the join/leave callbacks. I don't actually think they are necessary once you look at the machine subscription interface. The machine subscription interface does need a bit of work, but I think we can handle that separately from this API.

@apryakhin
Copy link
Contributor

apryakhin commented Nov 25, 2025

Another thing that I'm waffling on a bit is whether to pull the optional methods for join and leave into a separate interface that can be registered dynamically while execution is taking place and will support multiple "subscribers" for notifications of other processes joining and leaving. I think this is probably we're going to want so that different Realm clients can all subscribe without having to coordinate with each other. The bootstrap callbacks don't have this requirement obviously since there's only one client that is ultimately ever going to start Realm in a process.

First of all we probably don't need separate join and leave but just a single callback with an action type. Second is that with the "watch(prefix, callback)" interface you accomplish exactly that. The registration is dynamic...anywhere inside realm at any time you should be able to subscribe to notifications. The semantics of watch and to implement efficiently is not easy though.

However, I might have misunderstood the reasoning behind the need to remove that at all or rather move to another interface.

I think there are number of different ways to do the vtable in general and since you started thinking about it I see that it diverges somewhat from my original thinking (and that's okay). I still cannot understand by how much. Although, we maybe need to sync and discuss again the semantics of it to make sure we are on the same page here.

@apryakhin
Copy link
Contributor

apryakhin commented Nov 25, 2025

One other note on this is that I removed the join/leave callbacks. I don't actually think they are necessary once you look at the machine subscription interface. The machine subscription interface does need a bit of work, but I think we can handle that separately from this API.

What would be helpfull is a little pseudo-code snippet that would show-case how this vtable is going to be used. Perhaps that's not on you but since you are taking a stab at this interface anyways .

@lightsighter
Copy link
Contributor Author

However, I might have misunderstood the reasoning behind the need to remove that at all or rather move to another interface.

The reasoning is that we actually already have an interface for this in the Machine class. It looks like it is not fully implemented, and it definitely needs some work, but I think I prefer that than making it part of the bootstrapping interface.

What would be helpfull is a little pseudo-code snippet that would show-case how this vtable is going to be used.

We can discuss this in the meeting tomorrow, but I may be willing to do the first-pass implementation of the UCX bootstrap using this interface.

// size is less than or equal to the value_size passed in by Realm, then
// the value buffer should be populated and the value_size updated with
// the actual size of the value found. If the value is not found or is
// too large for the specified buffer then value_size should be set to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the case when the actual value of the entry inside the kv-store is larger that provided buffer. How the caller is going to know about the right size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in the meeting, but just answering for completeness in case anyone else is observing: the caller in this case for both puts and gets will be Realm, so Realm will know what the types of the values are that it has associated with particular keys, so therefore it will also know the expected size. For example, the blobs of UCX data should all be the same size, so each process will know how big the blob of UCX data it put was and therefore will know how big each of the expected UCX blobs should be when it goes to retrieve them. Even if in the future we don't know the expected size of some value, we can use this interface to find out. We can do one call to get the key with a zero-sized buffer, the call will fail, and report back the expected size that is larger than zero. At that point, we can allocate a buffer of that size and do the get call again to extract the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, returning the required size in case the provided buffer is larger is what I would do.

Just FYI:

the blobs of UCX data should all be the same size

This statement is not accurate based on what @SeyedMir told me.

@apryakhin
Copy link
Contributor

PMIX:

// impacting forward progress.
struct NetworkVtable {
////////////////////////
// REQUIRED FUNCTIONS //
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't yet refactored the comments in this public header but in general we use doxygen format. Just making a note here that probably should reformatted going forwards.

bool RuntimeImpl::network_init(int *argc, char ***argv,
const Runtime::NetworkVtable &vtable)
{
#if defined(REALM_USE_UCX) || defined(REALM_USE_MPI) || defined(REALM_USE_GASNET1) || \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is that we are testing only with the UCX. If UCX is enabled then we should expect the vtable and otherwise fail. All other case should fallback to legacy realm bootstrap.

Does our understanding here diverge on how this should work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, UCX can still bootstrap itself using other mechanisms that it knows about if no vtable is provided.

bool success = false;
// Try this up to 100 times, if we don't succeed then
// we'll time out and fail to join
for(unsigned idx = 0; idx < 100; idx++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: under heavy contention the value might be too small... also move it maybe out and "constexpr define it"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will pull it out into a constexpr, but it needs to be a finite value. I don't want it looping forever, there must be a timeout. Perhaps we can actually just time it with a wall-clock timer.

// we'll time out and fail to join
for(unsigned idx = 0; idx < 100; idx++) {
if(runtime->network_vtable_cas(key.data(), key.size(), &offset, &offset_size,
&desired, sizeof(desired))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

desired here is always an old value after the failure - should that be recomputed otherwise this CAS will never succeed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I do recompute it later in the loop, see line 869.

uint64_t offset = 0;
if(Network::my_node_id == 0) {
// Do the work to do the CAS to bump the number of processes
constexpr std::string_view key("realm_total_spaces");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the key suppose to be written as part of cas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, key's should be immutable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API reflects the fact that the implementer of the vtable API is not allowed to mutate keys.

Network::max_node_id += offset;
}
// TODO: what to do about processes that have already left
Network::all_peers.add_range(0, Network::max_node_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same range has to be deleted from the peers someplace elsewhere right? Is that something still a TODO for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not handling elasticity in this PR. I'm setting up for it, but not implementing it.

return 1;
}
// Synchronize to make sure everyone is done
if(!runtime->network_vtable_bar())
Copy link
Contributor

@apryakhin apryakhin Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall that availability bar and cas is that how you decide whether we are static or elastic. Or is it only the presence of cas and bar must be defined at all times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Network::max_node_id = ucc_comm->get_world_size() - 1;
Network::all_peers.add_range(0, ucc_comm->get_world_size() - 1);
Network::all_peers.remove(ucc_comm->get_rank());
if(runtime->network_vtable_elastic()) {
Copy link
Contributor

@apryakhin apryakhin Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: The changes are missing to maintain endpoints - the UCCBcast I assume will only be group-local, given we maintain a legacy code path that populates the endpoint map. I know that PR is still WIP but just to make sure we are aligned, another words that it's on missing on purpose but just hasn't been done yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm not implementing elasticity here. Even when we do implement elasticity though, each group of processes will still always have a UCC communicator between all the processes in the group (even processes that join in isolation will make a UCC communicator containing just themselves). We'll use this internally in the UCX module to aid in doing communication and synchronization for the processes in a group as part of them joining and leaving together.

return static_cast<RuntimeImpl *>(impl)->network_init(argc, argv, vtable);
}

bool Runtime::network_init(const NetworkVtable &vtable)
Copy link
Contributor

@apryakhin apryakhin Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: void RuntimeImpl::start(void) will have to be refactored. Perhaps it would make sense to add a bunch of TODOs while you keep a working context in your brain on what needs to be done, so someone can pick it up later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in the Realm meeting today, I'm still writing up all the things that have to happen for elasticity. That is beyond the scope of this PR.

const void *desired, size_t desired_size, const void *vtable_data,
size_t vtable_data_size) = nullptr;
};
bool network_init(const NetworkVtable &vtable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the "default" implementation will be provided? For example whatever @sbahirnv is going to add for PMIX? Another words, when an appplication program is lazy, doesn't know what it wants and doesn't want to understand or deal with the vtable except to just pick something up by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can pass in an empty vtable and it will fall-back to the existing bootstrap mechanisms that we already have for a given network module. If any of them succeeds then everything is good. If they all fail, then we die the same way we do today.

Network::max_node_id = ucc_comm->get_world_size() - 1;
Network::all_peers.add_range(0, ucc_comm->get_world_size() - 1);
Network::all_peers.remove(ucc_comm->get_rank());
if(runtime->network_vtable_elastic()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to cross-examine this, make sure it works with Kubernetes control plane.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's outside the scope of Realm. We don't care how people implement the vtable. If you look at how network_vtable_elastic is implemented, all it does is check for the presences of the cas function. If we have such a function, then the client has given us all we need to be elastic.

* if the key exists or not. If the callback fails then the network
* initialization might not succeed.
*/
bool (*get)(const void *key, size_t key_size, void *value, size_t *value_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PMIx_Get needs the target rank where to get the data. How do you plan to align it with your get API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's up to the client to handle on their implementation of the vtable interface. If they're going to use PMIx, then they need to get the parameters to pass to it by themselves without help from Realm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be explicitly clear about how this should work: the client will call PMIx_Init in the main thread before doing anything. They will get back a process ID that PMIx assigns. They then fill in the vtable with their callbacks. They store the process ID that PMIx gave them (along with any other data they want) in the vtable_data. Then they call Realm's Runtime::network_init and pass in the vtable. Realm makes a copy of the vtable (including the vtable_data buffer) and then goes about doing its initialization. Realm will do whatever callbacks it needs to through the vtable interface. Each callback gets passed the vtable_data buffer. The callbacks know how to fish out the process ID from the vtable_data buffer that was packed by the client in the first place. They then use that information to do calls into PMIx. Realm doesn't need to know about any of this and doesn't need to tell clients how to interpret their data.

Adds a MPI-based NetworkVtable example that demonstrates how to
implement the bootstrap interface.

The example builds and runs with `mpirun -n 2` successfully.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants