Skip to content

CNDB-17110: Fix possible deadlock if flush fails#2275

Merged
pkolaczk merged 1 commit intomainfrom
c17110-flush-deadlock
Mar 24, 2026
Merged

CNDB-17110: Fix possible deadlock if flush fails#2275
pkolaczk merged 1 commit intomainfrom
c17110-flush-deadlock

Conversation

@pkolaczk
Copy link
Copy Markdown

JVMStabilityInspector.inspectThrowable(t) can throw an exception.
In that case the control flow never reaches
postFlush.latch.countDown() and threads waiting for flush to finish
never unlock and never get a chance to propagate the exception
up the call chain. This ends up in a bad lockup with no error logged
anywhere, system appearing to be performing a pending flush,
but no flush actually running. This scenario has been observed in tests
when the system hit the limit of open files.

This commit moves latch.countDown() to a finally block
so it can never be skipped.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 13, 2026

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@pkolaczk
Copy link
Copy Markdown
Author

A short explanation why there are no unit tests for this:
This is hard to reproduce in an isolated way; but it was triggered by #2269 tests before they were fixed to cleanup tables properly after each test. Triggering the deadlock required bringing the system to an unstable state by running out of file descriptors. I confirmed that moving the latch countdown to a finally block resolved the deadlock and logged the flush failure reason properly (which was: "Too many open files"). Considering the triviality of the fix, trying to reproduce the same thing in a unit test is a bad investment of our time.

@pkolaczk pkolaczk force-pushed the c17110-flush-deadlock branch from 888a0d1 to 5dddf6e Compare March 13, 2026 09:37
@pkolaczk pkolaczk changed the title CNDB-16394: Fix possible deadlock if flush fails CNDB-17110: Fix possible deadlock if flush fails Mar 13, 2026
@pkolaczk pkolaczk requested a review from jasonstack March 13, 2026 11:39
@pkolaczk
Copy link
Copy Markdown
Author

The tests that failed in CI pass locally.
The failures in CI unrelated.

Comment thread src/java/org/apache/cassandra/db/ColumnFamilyStore.java
Copy link
Copy Markdown

@jasonstack jasonstack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pkolaczk pkolaczk force-pushed the c17110-flush-deadlock branch from 3c42509 to 6635395 Compare March 17, 2026 10:34
@pkolaczk
Copy link
Copy Markdown
Author

test this please

`JVMStabilityInspector.inspectThrowable(t)` can throw an exception.
In that case the control flow never reaches
`postFlush.latch.countDown()` and threads waiting for flush to finish
never unlock and never get a chance to propagate the exception
up the call chain. This ends up in a bad lockup with no error logged
anywhere, system appearing to be performing a pending flush,
but no flush actually running. This scenario has been observed in tests
when the system hit the limit of open files.

This commit moves `latch.countDown()` to a finally block
so it can never be skipped.
@pkolaczk pkolaczk force-pushed the c17110-flush-deadlock branch from 6635395 to 73bc816 Compare March 18, 2026 16:05
@sonarqubecloud
Copy link
Copy Markdown

@pkolaczk pkolaczk merged commit b5bd312 into main Mar 24, 2026
2 of 4 checks passed
@pkolaczk pkolaczk deleted the c17110-flush-deadlock branch March 24, 2026 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants