-
Notifications
You must be signed in to change notification settings - Fork 15
Network Vtable #366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Network Vtable #366
Conversation
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
sbahirnv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the interface!
I will try to come up with mini-app which would use this interface to bootstrap Realm.
src/realm/runtime.h
Outdated
| bool (*get)(const void *key, size_t key_size, void *value, | ||
| size_t value_size) = nullptr; | ||
| //////////////////////// | ||
| // OPTIONAL FUNCTIONS // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking if we would need init() and shutdown() allowing client to do realm specific setup if any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need that because I think anything you'd need to do would be done in the join and leave methods. If we do conclude we need additional methods we can add them later, but let's not speculate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this vtable is what I had in mind but it's proposed differently by @lightsighter . The join/leave being the callbacks to react on membership changes - this is clear. That would be a watch(key, callback) interface in my vtable which is more generic and this is more specialized for bootstrap, that's fine.
Where is the method that the bootstrapping process suppose to call and block on until the membership service actually completed the admission protocol? I don't understand how this would be done without calling something like explicit vtable->init(args, local_ucp_worker_address)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the method that the bootstrapping process suppose to call and block on until the membership service actually completed the admission protocol?
That shouldn't be necessary. The admission protocol should be implemented only using puts and gets, nothing else should be required (except maybe an atomic compare and swap, although I think this is optional). Puts and gets will do their thing. It is up to the admission protocol to be robust to processes joining and leaving. You can continuously poll on keys and values as necessary. The bootstrapping protocol does not need to be fast, it just needs to be correct and it should require clients to provide the bare minimum amount of functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are you going to implement a group join here? where you have a group of X processes that are elastically joining your existing job and for example the requirement will be you need to wait for all X processes to be admitted as a group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are you going to implement a group join here? where you have a group of X processes that are elastically joining your existing job and for example the requirement will be you need to wait for all X processes to be admitted as a group?
See the updated interface.
|
Another thing that I'm waffling on a bit is whether to pull the optional methods for |
src/realm/runtime.h
Outdated
| // live longer than the function call. The function should return true if | ||
| // the put succeeds and false if it doesn't. If the call fails then it is | ||
| // likely that the network initialization might not succeed. | ||
| bool (*put)(const void *key, size_t key_size, const void *value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good definition. I'd like to discuss blocking vs non-blocking and if non-blocking then returning a way to know when this put completes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any of these callbacks should be non-blocking. As I mentioned elsewhere, the callback interface should require the bare minimum number of calls necessary for the bootstrap protocol to be implemented correctly. All that should be required to do that are blocking put and get. The callbacks should block until the put/get finishes and then return the result. Realm must ensure forward progress as mentioned in the documentation, so we should expect that these calls can and will block for some amount of time.
If I understand correctly: one process links against librealm.so, but different parts of that process can register for join/leave notifications. Only one part actually calls init()/start() though. If so, I like the idea of separating join/leave into its own interface because different parts need different things. Like, a monitoring system might only care about tracking who joined and left, while the main app is using Realm to actually do work. Both can do their thing independently. It makes the registration a bit messier since multiple parts are registering, but with added benfits. |
Yes, many things can be using Realm, although there is just one that needs to be responsible for starting Realm. Multiple clients of the same Realm (e.g. both Flash Cache and Dynamo) might want to subscribe to notifications about processes joining and leaving the Realm independently. |
sbahirnv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interface looks great to me!
Before merging though, we need the network_init() implementation or it won't link. Want me to implement it based on your interface design(I can fork from this branch and continue), or are you planning to do that?
I'm going to test it to bootstrap UCX backend.
|
One other note on this is that I removed the join/leave callbacks. I don't actually think they are necessary once you look at the machine subscription interface. The machine subscription interface does need a bit of work, but I think we can handle that separately from this API. |
First of all we probably don't need separate join and leave but just a single callback with an action type. Second is that with the "watch(prefix, callback)" interface you accomplish exactly that. The registration is dynamic...anywhere inside realm at any time you should be able to subscribe to notifications. The semantics of watch and to implement efficiently is not easy though. However, I might have misunderstood the reasoning behind the need to remove that at all or rather move to another interface. I think there are number of different ways to do the vtable in general and since you started thinking about it I see that it diverges somewhat from my original thinking (and that's okay). I still cannot understand by how much. Although, we maybe need to sync and discuss again the semantics of it to make sure we are on the same page here. |
What would be helpfull is a little pseudo-code snippet that would show-case how this vtable is going to be used. Perhaps that's not on you but since you are taking a stab at this interface anyways . |
The reasoning is that we actually already have an interface for this in the
We can discuss this in the meeting tomorrow, but I may be willing to do the first-pass implementation of the UCX bootstrap using this interface. |
src/realm/runtime.h
Outdated
| // size is less than or equal to the value_size passed in by Realm, then | ||
| // the value buffer should be populated and the value_size updated with | ||
| // the actual size of the value found. If the value is not found or is | ||
| // too large for the specified buffer then value_size should be set to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the case when the actual value of the entry inside the kv-store is larger that provided buffer. How the caller is going to know about the right size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this in the meeting, but just answering for completeness in case anyone else is observing: the caller in this case for both puts and gets will be Realm, so Realm will know what the types of the values are that it has associated with particular keys, so therefore it will also know the expected size. For example, the blobs of UCX data should all be the same size, so each process will know how big the blob of UCX data it put was and therefore will know how big each of the expected UCX blobs should be when it goes to retrieve them. Even if in the future we don't know the expected size of some value, we can use this interface to find out. We can do one call to get the key with a zero-sized buffer, the call will fail, and report back the expected size that is larger than zero. At that point, we can allocate a buffer of that size and do the get call again to extract the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, returning the required size in case the provided buffer is larger is what I would do.
Just FYI:
the blobs of UCX data should all be the same size
This statement is not accurate based on what @SeyedMir told me.
|
PMIX: |
| // impacting forward progress. | ||
| struct NetworkVtable { | ||
| //////////////////////// | ||
| // REQUIRED FUNCTIONS // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't yet refactored the comments in this public header but in general we use doxygen format. Just making a note here that probably should reformatted going forwards.
…le in the runtime singleton
| bool RuntimeImpl::network_init(int *argc, char ***argv, | ||
| const Runtime::NetworkVtable &vtable) | ||
| { | ||
| #if defined(REALM_USE_UCX) || defined(REALM_USE_MPI) || defined(REALM_USE_GASNET1) || \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding is that we are testing only with the UCX. If UCX is enabled then we should expect the vtable and otherwise fail. All other case should fallback to legacy realm bootstrap.
Does our understanding here diverge on how this should work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, UCX can still bootstrap itself using other mechanisms that it knows about if no vtable is provided.
| bool success = false; | ||
| // Try this up to 100 times, if we don't succeed then | ||
| // we'll time out and fail to join | ||
| for(unsigned idx = 0; idx < 100; idx++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: under heavy contention the value might be too small... also move it maybe out and "constexpr define it"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will pull it out into a constexpr, but it needs to be a finite value. I don't want it looping forever, there must be a timeout. Perhaps we can actually just time it with a wall-clock timer.
| // we'll time out and fail to join | ||
| for(unsigned idx = 0; idx < 100; idx++) { | ||
| if(runtime->network_vtable_cas(key.data(), key.size(), &offset, &offset_size, | ||
| &desired, sizeof(desired))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
desired here is always an old value after the failure - should that be recomputed otherwise this CAS will never succeed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I do recompute it later in the loop, see line 869.
| uint64_t offset = 0; | ||
| if(Network::my_node_id == 0) { | ||
| // Do the work to do the CAS to bump the number of processes | ||
| constexpr std::string_view key("realm_total_spaces"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the key suppose to be written as part of cas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, key's should be immutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API reflects the fact that the implementer of the vtable API is not allowed to mutate keys.
| Network::max_node_id += offset; | ||
| } | ||
| // TODO: what to do about processes that have already left | ||
| Network::all_peers.add_range(0, Network::max_node_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same range has to be deleted from the peers someplace elsewhere right? Is that something still a TODO for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not handling elasticity in this PR. I'm setting up for it, but not implementing it.
| return 1; | ||
| } | ||
| // Synchronize to make sure everyone is done | ||
| if(!runtime->network_vtable_bar()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall that availability bar and cas is that how you decide whether we are static or elastic. Or is it only the presence of cas and bar must be defined at all times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Network::max_node_id = ucc_comm->get_world_size() - 1; | ||
| Network::all_peers.add_range(0, ucc_comm->get_world_size() - 1); | ||
| Network::all_peers.remove(ucc_comm->get_rank()); | ||
| if(runtime->network_vtable_elastic()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: The changes are missing to maintain endpoints - the UCCBcast I assume will only be group-local, given we maintain a legacy code path that populates the endpoint map. I know that PR is still WIP but just to make sure we are aligned, another words that it's on missing on purpose but just hasn't been done yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I'm not implementing elasticity here. Even when we do implement elasticity though, each group of processes will still always have a UCC communicator between all the processes in the group (even processes that join in isolation will make a UCC communicator containing just themselves). We'll use this internally in the UCX module to aid in doing communication and synchronization for the processes in a group as part of them joining and leaving together.
| return static_cast<RuntimeImpl *>(impl)->network_init(argc, argv, vtable); | ||
| } | ||
|
|
||
| bool Runtime::network_init(const NetworkVtable &vtable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: void RuntimeImpl::start(void) will have to be refactored. Perhaps it would make sense to add a bunch of TODOs while you keep a working context in your brain on what needs to be done, so someone can pick it up later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in the Realm meeting today, I'm still writing up all the things that have to happen for elasticity. That is beyond the scope of this PR.
| const void *desired, size_t desired_size, const void *vtable_data, | ||
| size_t vtable_data_size) = nullptr; | ||
| }; | ||
| bool network_init(const NetworkVtable &vtable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the "default" implementation will be provided? For example whatever @sbahirnv is going to add for PMIX? Another words, when an appplication program is lazy, doesn't know what it wants and doesn't want to understand or deal with the vtable except to just pick something up by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can pass in an empty vtable and it will fall-back to the existing bootstrap mechanisms that we already have for a given network module. If any of them succeeds then everything is good. If they all fail, then we die the same way we do today.
| Network::max_node_id = ucc_comm->get_world_size() - 1; | ||
| Network::all_peers.add_range(0, ucc_comm->get_world_size() - 1); | ||
| Network::all_peers.remove(ucc_comm->get_rank()); | ||
| if(runtime->network_vtable_elastic()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to cross-examine this, make sure it works with Kubernetes control plane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's outside the scope of Realm. We don't care how people implement the vtable. If you look at how network_vtable_elastic is implemented, all it does is check for the presences of the cas function. If we have such a function, then the client has given us all we need to be elastic.
| * if the key exists or not. If the callback fails then the network | ||
| * initialization might not succeed. | ||
| */ | ||
| bool (*get)(const void *key, size_t key_size, void *value, size_t *value_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PMIx_Get needs the target rank where to get the data. How do you plan to align it with your get API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's up to the client to handle on their implementation of the vtable interface. If they're going to use PMIx, then they need to get the parameters to pass to it by themselves without help from Realm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be explicitly clear about how this should work: the client will call PMIx_Init in the main thread before doing anything. They will get back a process ID that PMIx assigns. They then fill in the vtable with their callbacks. They store the process ID that PMIx gave them (along with any other data they want) in the vtable_data. Then they call Realm's Runtime::network_init and pass in the vtable. Realm makes a copy of the vtable (including the vtable_data buffer) and then goes about doing its initialization. Realm will do whatever callbacks it needs to through the vtable interface. Each callback gets passed the vtable_data buffer. The callbacks know how to fish out the process ID from the vtable_data buffer that was packed by the client in the first place. They then use that information to do calls into PMIx. Realm doesn't need to know about any of this and doesn't need to tell clients how to interpret their data.
Adds a MPI-based NetworkVtable example that demonstrates how to implement the bootstrap interface. The example builds and runs with `mpirun -n 2` successfully.
Proposed interface for a network vtable.
One open question is to define the thread-safety of the callbacks. Will Realm provide the thread safety or will the client need to do that?
Another open question: do we think we'll ever need variable-sized key value data? Currently the interface assumes we'll always know the resulting value size in advance before doing a
getcall.