Skip to content

CNDB-12466: Wait for OutboundConnection to notice disconnect when simulating disconnect in ConnectionTest.testMessageDeliveryOnReconnect #1507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 5, 2025

Conversation

jkni
Copy link

@jkni jkni commented Jan 15, 2025

What is the issue

ConnectionTest.testMessageDeliveryOnReconnect is flaky, where sometimes an outbound message is not delivered to the handler.

What does this PR fix and why was it fixed

If the OutboundConnection does not see that its channel is closed until flushing the message in Delivery#doRun, the message will be handled via onFailedSerialize and dropped. This PR changes the test to ensure the OutboundConnection is disconnected before enqueuing a message, triggering a reconnect. This fix has been tested 200 times in CI and did not fail.

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits

…ulating disconnect in ConnectionTest.testMessageDeliveryOnReconnect
@jkni jkni requested a review from a team January 15, 2025 23:36
Copy link

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-1507 rejected by Butler


2 new test failure(s) in 1 builds
See build details here


Found 2 new test failures

Test Explanation Branch history Upstream history
...id>>,ascii>>,wide=false,scenario=SSTABLE_QUERY] regression 🔴 🔵🔵🔵🔵🔵🔵🔵
...yIteratorTest.testTotalAndReadBytesManySSTables regression 🔴 🔵🔵🔵🔵🔵🔵🔵

Found 4 known test failures

@pkolaczk
Copy link

pkolaczk commented Jan 16, 2025

If the OutboundConnection does not see that its channel is closed until flushing the message in Delivery#doRun, the message will be handled via onFailedSerialize and dropped.

But is that the correct behavior?
If there exists an auto-reconnect mechanism, shouldn't it attempt to reconnect and resubmit the message in those cases as well instead of dropping the message when the timing is not right?

I just wonder if the flakiness of this test is not the result of the test problem, but the actual problem with the code under test and maybe we should improve the code.

@jkni
Copy link
Author

jkni commented Jan 16, 2025

It's a good question. My reading of the conversations around the messaging service rewrite is that this change was intentional, but let me open it up to a broader discussion.

@jkni
Copy link
Author

jkni commented Jan 16, 2025

After some further research and discussion, I do think this behavior is in line with the current design for eventually consistent messaging. The Messaging Service rewrite intentionally eliminated the concept of droppable verbs/retrying droppable verbs. Where retrying has been re-implemented for timeouts/disconnects of certain messages, it was implemented as a higher level https://issues.apache.org/jira/browse/CASSANDRA-18816. The prevailing view is that the messaging service at this level is in line with the eventually consistent design of most messages.

@jkni jkni added the core-team label Jan 21, 2025
@jkni jkni removed the request for review from a team January 21, 2025 16:18
@jkni jkni merged commit 46716b0 into main Feb 5, 2025
463 of 473 checks passed
@jkni jkni deleted the CNDB-12466 branch February 5, 2025 17:28
djatnieks pushed a commit that referenced this pull request Mar 11, 2025
…ulating disconnect in ConnectionTest.testMessageDeliveryOnReconnect (#1507)

ConnectionTest.testMessageDeliveryOnReconnect is flaky, where sometimes
an outbound message is not delivered to the handler.
If the OutboundConnection does not see that its channel is closed until
flushing the message in `Delivery#doRun`, the message will be handled
via `onFailedSerialize` and dropped. This PR changes the test to ensure
the OutboundConnection is disconnected before enqueuing a message,
triggering a reconnect.
djatnieks pushed a commit that referenced this pull request May 18, 2025
…ulating disconnect in ConnectionTest.testMessageDeliveryOnReconnect (#1507)

ConnectionTest.testMessageDeliveryOnReconnect is flaky, where sometimes
an outbound message is not delivered to the handler.
If the OutboundConnection does not see that its channel is closed until
flushing the message in `Delivery#doRun`, the message will be handled
via `onFailedSerialize` and dropped. This PR changes the test to ensure
the OutboundConnection is disconnected before enqueuing a message,
triggering a reconnect.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants