Skip to content

[RUBY-3496] Fix legacy read pool retry error #2878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions lib/mongo/retryable/base_worker.rb
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ def initialize(retryable)

private

# Indicate which exception classes that are generally retryable.
# Indicate which exception classes that are generally retryable.
#
# @return [ Array<Mongo:Error> ] Array of exception classes that are
# considered retryable.
Expand All @@ -58,7 +58,8 @@ def retryable_exceptions
Error::ConnectionPerished,
Error::ServerNotUsable,
Error::SocketError,
Error::SocketTimeoutError
Error::SocketTimeoutError,
Error::PoolError,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Error::PoolError,
Error::PoolClearedError,

According to the specification only the PoolClearedError should be retried, not all pool errors. So, having PoolError here seems to be too broad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@comandeo-mongo Error::ServerNotUsable and Error::ConnectionPerished are listed here as retryable but not in the specification, could you clarify if they are retryable? Asking because we started seeing these errors after upgrading to the latest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@comandeo-mongo I've addressed your comment, please look at it when you have the time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @joelim41 , I am finally looking into it, sorry for the delay. Meanwhile, I would like to clarify - did you start seeing Error::ServerNotUsable and Error::ConnectionPerished only after the upgrade? It might be an indication of some other issues.

Copy link
Contributor

@comandeo-mongo comandeo-mongo Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelim41 one more "side" question - is there something preventing you from using modern retries? Legacy retry mechanism is deprecates since 2.9.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@comandeo-mongo this happened after we upgraded from v2.16.0 to v2.20.0, when we experienced some scaling issues in one of our mongoS clusters. I see Error::ServerNotUsable was only introduced in v2.19.0 and Error::ConnectionPerished was made retryable also in v2.19.0 here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just started exploring what it takes to move to modern retries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed that you are upgrading from 2.16. So yes, you weren't able to see those errors. I think it is OK if you see them when a mongos has issues.

As to this PR - I think we should separate the list of the retryable errors for modern and legacy retries. PoolError has a special treatment in modern retries, so I would like to avoid just adding it to retryable_exceptions. I'll make a commit to your PR with the proposed solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thank you!

].freeze
end

Expand Down
57 changes: 34 additions & 23 deletions spec/integration/retryable_reads_errors_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -73,31 +73,42 @@
client.subscribe(Mongo::Monitoring::CONNECTION_POOL, subscriber)
end

it "retries on PoolClearedError" do
# After the first find fails, the pool is paused and retry is triggered.
# Now, a race is started between the second find acquiring a connection,
# and the first retrying the read. Now, retry reads cause the cluster to
# be rescanned and the pool to be unpaused, allowing the second checkout
# to succeed (when it should fail). Therefore we want the second find's
# check out to win the race. This gives the check out a little head start.
allow_any_instance_of(Mongo::Server::ConnectionPool).to receive(:ready).and_wrap_original do |m, *args, &block|
::Utils.wait_for_condition(5) do
# check_out_results should contain:
# - find1 connection check out successful
# - pool cleared
# - find2 connection check out failed
# We wait here for the third event to happen before we ready the pool.
cmap_events.select do |e|
event_types.include?(e.class)
end.length >= 3
shared_examples_for 'retries on PoolClearedError' do
it "retries on PoolClearedError" do
# After the first find fails, the pool is paused and retry is triggered.
# Now, a race is started between the second find acquiring a connection,
# and the first retrying the read. Now, retry reads cause the cluster to
# be rescanned and the pool to be unpaused, allowing the second checkout
# to succeed (when it should fail). Therefore we want the second find's
# check out to win the race. This gives the check out a little head start.
allow_any_instance_of(Mongo::Server::ConnectionPool).to receive(:ready).and_wrap_original do |m, *args, &block|
::Utils.wait_for_condition(5) do
# check_out_results should contain:
# - find1 connection check out successful
# - pool cleared
# - find2 connection check out failed
# We wait here for the third event to happen before we ready the pool.
cmap_events.select do |e|
event_types.include?(e.class)
end.length >= 3
end
m.call(*args, &block)
end
m.call(*args, &block)
threads.map(&:join)
expect(check_out_results[0]).to be_a(Mongo::Monitoring::Event::Cmap::ConnectionCheckedOut)
expect(check_out_results[1]).to be_a(Mongo::Monitoring::Event::Cmap::PoolCleared)
expect(check_out_results[2]).to be_a(Mongo::Monitoring::Event::Cmap::ConnectionCheckOutFailed)
expect(find_events.length).to eq(3)
end
threads.map(&:join)
expect(check_out_results[0]).to be_a(Mongo::Monitoring::Event::Cmap::ConnectionCheckedOut)
expect(check_out_results[1]).to be_a(Mongo::Monitoring::Event::Cmap::PoolCleared)
expect(check_out_results[2]).to be_a(Mongo::Monitoring::Event::Cmap::ConnectionCheckOutFailed)
expect(find_events.length).to eq(3)
end

it_behaves_like 'retries on PoolClearedError'

context 'legacy read retries' do

let(:client) { authorized_client.with(options.merge(retry_reads: false, max_read_retries: 1)) }

it_behaves_like 'retries on PoolClearedError'
end

after do
Expand Down
Loading