Skip to content

Refresh aspired servables/versions following config update #1518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: r1.15
Choose a base branch
from

Conversation

njhill
Copy link

@njhill njhill commented Dec 18, 2019

Currently when the configured model list is updated via a call to handleReloadConfigRequest, the request thread blocks until any newly added models become available.

Their availability however depends on the filesystem polling thread rescanning the filesystem at some periodic interval, meaning that there's an arbitrary delay before the requested changes actually take effect and the RPC returns.

This problem may not be very noticeable with the default polling interval of 1 second, but seems undesirable for longer intervals and in particular makes API-based dynamic reconfiguration incompatible with the --file_system_poll_wait_seconds=0 setting (in this case all handleReloadConfigRequest calls time-out and do not take effect).

Fixes #1519

Currently when the configured model list is updated via a call to
handleReloadConfigRequest, the request thread blocks until any newly
added models become available.

Their availability however depends on the filesystem polling thread
rescanning the filesystem at some periodic interval, meaning that
there's an arbitrary delay before the requested changes actually take
effect and the RPC returns.

This problem may not be very noticeable with the default polling
interval of 1 second, but seems undesirable for longer intervals and
in particular makes API-based dynamic reconfiguration incompatible with
the --file_system_poll_wait_seconds=0 setting (in this case all
handleReloadConfigRequest calls time-out and do not take effect).
@njhill
Copy link
Author

njhill commented Dec 18, 2019

I have opened this against 1.15 since that's the version we are using, but can rebase on a different branch if needed.

Also apologies in advance for the code, I am not very familiar with C++.

@njhill njhill requested review from nrobeR and christisg and removed request for nrobeR January 15, 2020 23:56
Copy link
Member

@christisg christisg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add unit test coverage?

@njhill
Copy link
Author

njhill commented Feb 12, 2020

Thanks @christisg, I've pushed a commit to address your logging comment. I will aim to add unit test coverage when I get a chance... it will take me a bit longer due to unfamiliarity with C++ and the codebase/test framework.

@astleychen
Copy link

@njhill thanks for reporting this bug. It's painful and took me more than 3 hrs to figure out REAL behavior when setting --file_system_poll_wait_seconds=0 to mitigate GCS bucket class A/B operation request calls in polling. Hope we can see your fixes soon. :)

@netfs
Copy link
Collaborator

netfs commented Nov 18, 2020

@njhill do you want wrap this PR by adding unit-test as requested by the reviewer?

thanks!

@njhill
Copy link
Author

njhill commented Nov 18, 2020

@netfs apologies for letting this lag. I am not sure when I will realistically have a chance to do this since I'm especially busy right now and not very familiar with C++ or the testing setup so it would take me a decent chunk of time to do.

Any help with that part would be appreciated!

@netfs
Copy link
Collaborator

netfs commented Nov 18, 2020

no worries @njhill.

@astleychen do you want to help here and add tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants