-
Notifications
You must be signed in to change notification settings - Fork 409
prov/verbs: Reload the list of interfaces on each call to fi_getinfo() #10886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Re-scanning the interfaces is expensive. Utility providers will force scanning multiple times. |
@shefty would it be possible to give the application control over when the interfaces should be re-scanned? |
4a490f0
to
209a93e
Compare
@sydidelot - Giving control to the app is the best idea I could come up with to handle this. Going that route, I'd define a flag to pass into fi_getinfo(). An alternative I could think of that wouldn't involve application changes is to make refresh timer based. I'm not thrilled with that option. An environment variable to set the fi_getinfo() flag seems better. I think that could allow the util providers to unset the flag after it first passed to the core provider. |
I don't follow this point. Can you please clarify it please? |
I was suggesting defining a flag that can be passed into fi_getinfo(). When set, a new scan is performed. The utility providers can always clear that flag, assuming that the core provider is called directly first, before any utility providers process fi_getinfo. (This is the case). In addition, an environment variable could be set which the libfabric core can check. If set, the libfabric core would add this flag when calling the core provider's getinfo handler. |
Thanks for the clarification @shefty! I didn't know that the core provider was called first. |
209a93e
to
b81f941
Compare
@shefty May I ask you to have a look at the new code when you have time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not review the last patch. I'll need more time for that. See comments so far.
if (rxd_env.always_rescan) | ||
flags |= FI_RESCAN; | ||
else | ||
flags &= ~FI_RESCAN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user doesn't set any environment variable, this has the impact of removing the FI_RESCAN flag from their request. That's probably not what we want. I would leave the flags unchanged in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code and always_rescan
is now an integer that can have the following values:
-1
: The user didn't set the environment variable0
: rescan explicitly disabled by the environment variable1
: rescan explicitly enabled by the environment variable
Note that the value is now retrieved with fi_param_get_bool()
but the underlying value is effectively an integer (and not a boolean).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe convert to an enum? It should be okay to use get_bool, as that will leave the value alone if the variable isn't defined. I'd rename to just 'rescan' if false means never.
It seems that this would be a common flow (unset, false, true), but I don't see anything similar in tree based on a quick search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed the parameter to just 'rescan' as you suggested but I kept the boolean. Converting it to an enum will require some extra work and I think it is beyond the scope of this PR.
prov/verbs/src/verbs_info.c
Outdated
|
||
initialized = true; | ||
vrb_devs_free(verbs_devs); | ||
fi_freeinfo((void *)*all_infos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cast shouldn't be needed. Please update the function signature to remove the const.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to remove the "const" in this function, we will also have to change the definition of util_prov::info
and remove the "const" from there. Here is how the structure is defined right now:
struct util_prov {
const struct fi_provider *prov;
const struct fi_info *info; <--- here.
ofi_alter_info_t alter_defaults;
const int flags;
};
Is it what you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... when that was defined, I think the assumption was that the info list was a static list of structures. But verbs needs to set this dynamically, including freeing it. Maybe that const should be removed. Can you tell if removing it causes an issue in a bunch of other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the "const" only causes an issue to the verbs provider. I added a commit to remove it as a consequence.
The reason to add the FI_RESCAN flag is to only rescan if that flag is set. This maintains the current behavior and performance. Setting the environment variable can then be used to force rescan on every call into the verbs provider. |
b81f941
to
83f2396
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shefty Thanks for the review! I have addressed all your review comments.
if (rxd_env.always_rescan) | ||
flags |= FI_RESCAN; | ||
else | ||
flags &= ~FI_RESCAN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code and always_rescan
is now an integer that can have the following values:
-1
: The user didn't set the environment variable0
: rescan explicitly disabled by the environment variable1
: rescan explicitly enabled by the environment variable
Note that the value is now retrieved with fi_param_get_bool()
but the underlying value is effectively an integer (and not a boolean).
@j-xiong May I ask you to check why the Intel integration tests are failing? Are the failures related to the PR? |
@sydidelot Please rebase to pick up github action changes. |
@j-xiong Rebased but it's still failing in Intel CI pipeline. Can you please let me know what tests are failing? I assume they are related to my changes. |
There were a lot of failures, probably issue with CI itself. Restarted the tests. |
@j-xiong My PR is still failing and I have no visibility into the tests results. |
@sydidelot We are having infrastructure issue with CI, still under investigation. |
Fabtests over rxm/verbs failed with the following error:
|
With the next changes, vrb_init_mutex will not only be used during initialization, but it will also be used for protecting updates to vrb_util_prov.info Rename it "vrb_info_mutex" so the new name better reflects the new role of this mutex. Signed-off-by: Sylvain Didelot <[email protected]>
With the next changes to allow dynamic reloads of the verbs interfaces, the fabric can no longer hold a reference to vrb_util_prov.info, otherwise it may cause a use-after-free if the interface is removed from that list. Signed-off-by: Sylvain Didelot <[email protected]>
Rework the provider to make vrb_get_verbs_info() static as the function is only called from verbs_domain.c Signed-off-by: Sylvain Didelot <[email protected]>
Because we will dynamically update the list vrb_util_prov.info at run time, we need to serialize all accesses to that list with vrb_info_mutex. Signed-off-by: Sylvain Didelot <[email protected]>
Careful, API change! This commit adds the new flag FI_RESCAN to indicate that the provider should rescan available network interfaces. This operation may be computationally expensive but required for paiplications that need to re-scan network interfaces at run time. Signed-off-by: Sylvain Didelot <[email protected]>
Signed-off-by: Sylvain Didelot <[email protected]>
Signed-off-by: Sylvain Didelot <[email protected]>
Double pointer is not needed and can be removed to simplify the code. Signed-off-by: Sylvain Didelot <[email protected]>
The verbs provider requires to set this list dynamically, so make it non const. Signed-off-by: Sylvain Didelot <[email protected]>
The verbs provider loads the list of verbs devices available on the node only once, on the first call to fi_getinfo(). If a network device is added (hot-plugged) after this initial call to fi_getinfo(), it won't be visible in libfabric. A subsidiary problem is that fi_getinfo() only returns network adapters with active links. If the link is initially inactive and becomes active after the first call to fi_getinfo(), this interface will not be visible in libfabric. This is a particularly a problem for long-running services where restarting a process to discover newly added network devices is not an option. With this patch, the list of verbs interfaces is reloaded on each call to fi_getinfo(). Fixes ofiwg#10881 Signed-off-by: Sylvain Didelot <[email protected]>
84ca6b6
to
4fe0c97
Compare
@sydidelot Adding tests for the new flag is a good idea. The question though, is how to create a scenario that the flag can show a difference. |
@shefty Do you have any more concerns with this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only change request is to clean up the text describing the env var. Thanks!
fi_param_define(&rxd_prov, "rescan", FI_PARAM_BOOL, | ||
"Enables dynamic network interface detection by rescanning " | ||
"system interfaces on each fi_getinfo() invocation. " | ||
"(default: -1)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd modify the text to something like: "Force or disable rescanning for network interface changes... (default: unset)". Replace -1 with unset, since -1 doesn't make much sense to the user reading this as a boolean value. Or simply drop the default text completely. I would make the text clear that setting this to true will force rescanning, while setting it to false will disable rescanning.
The verbs provider loads the list of verbs devices available on the node
only once, on the first call to fi_getinfo().
If a network device is added (hot-plugged) after this initial call to
fi_getinfo(), it won't be visible in libfabric.
A subsidiary problem is that fi_getinfo() only returns network
adapters with active links.
If the link is initially inactive and becomes active after the first call to
fi_getinfo(), this interface will not be visible in libfabric.
This is a particularly a problem for long-running services where restarting
a process to discover newly added network devices is not an option.
With this patch, the list of verbs interfaces is reloaded on each call
to fi_getinfo().
Fixes #10881
Signed-off-by: Sylvain Didelot [email protected]