Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNDB-12466: Wait for OutboundConnection to notice disconnect when simulating disconnect in ConnectionTest.testMessageDeliveryOnReconnect #1507

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jkni
Copy link

@jkni jkni commented Jan 15, 2025

What is the issue

ConnectionTest.testMessageDeliveryOnReconnect is flaky, where sometimes an outbound message is not delivered to the handler.

What does this PR fix and why was it fixed

If the OutboundConnection does not see that its channel is closed until flushing the message in Delivery#doRun, the message will be handled via onFailedSerialize and dropped. This PR changes the test to ensure the OutboundConnection is disconnected before enqueuing a message, triggering a reconnect. This fix has been tested 200 times in CI and did not fail.

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits

…ulating disconnect in ConnectionTest.testMessageDeliveryOnReconnect
@jkni jkni requested a review from a team January 15, 2025 23:36
@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-1507 rejected by Butler


2 new test failure(s) in 1 builds
See build details here


Found 2 new test failures

Test Explanation Branch history Upstream history
...id>>,ascii>>,wide=false,scenario=SSTABLE_QUERY] regression 🔴 🔵🔵🔵🔵🔵🔵🔵
...yIteratorTest.testTotalAndReadBytesManySSTables regression 🔴 🔵🔵🔵🔵🔵🔵🔵

Found 4 known test failures

@pkolaczk
Copy link

pkolaczk commented Jan 16, 2025

If the OutboundConnection does not see that its channel is closed until flushing the message in Delivery#doRun, the message will be handled via onFailedSerialize and dropped.

But is that the correct behavior?
If there exists an auto-reconnect mechanism, shouldn't it attempt to reconnect and resubmit the message in those cases as well instead of dropping the message when the timing is not right?

I just wonder if the flakiness of this test is not the result of the test problem, but the actual problem with the code under test and maybe we should improve the code.

@jkni
Copy link
Author

jkni commented Jan 16, 2025

It's a good question. My reading of the conversations around the messaging service rewrite is that this change was intentional, but let me open it up to a broader discussion.

@jkni
Copy link
Author

jkni commented Jan 16, 2025

After some further research and discussion, I do think this behavior is in line with the current design for eventually consistent messaging. The Messaging Service rewrite intentionally eliminated the concept of droppable verbs/retrying droppable verbs. Where retrying has been re-implemented for timeouts/disconnects of certain messages, it was implemented as a higher level https://issues.apache.org/jira/browse/CASSANDRA-18816. The prevailing view is that the messaging service at this level is in line with the eventually consistent design of most messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants