Skip to content

Conversation

@e673
Copy link
Collaborator

@e673 e673 commented Dec 8, 2025

#1751

Current behavior:

  • Failed WriteData requests generated by WriteBackCache are constantly retried without checking the error kind.
  • Client fsync requests hang indefinitely until the data is flushed.

Expected behavior:

  • Only WriteData requests with retriable errors should be retried.
  • Client fsync requests should fail (for example, ENOSPC error is expected here).

Discussion needed:

  • How to choose the correct error code?
  • How to deal with fatal errors?

@e673 e673 force-pushed the users/nasonov/issue-1751-handle-errors branch from 18c468c to 76ee2bf Compare December 8, 2025 13:07
@github-actions
Copy link
Contributor

github-actions bot commented Dec 8, 2025

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 76ee2bf.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
9680 9677 0 2 0 1 0

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 76ee2bf.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
4 2 0 2 0 0 0

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 76ee2bf.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
4 2 0 2 0 0 0

@e673 e673 force-pushed the users/nasonov/issue-1751-handle-errors branch from 76ee2bf to a98bb37 Compare December 8, 2025 23:55
@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2025

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit a98bb37.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
9681 9680 0 0 0 1 0

@e673 e673 force-pushed the users/nasonov/issue-1751-handle-errors branch from a98bb37 to 099ce7d Compare December 12, 2025 11:49
@e673 e673 requested review from SvartMetal and qkrorlqr December 12, 2025 11:50
@e673 e673 marked this pull request as ready for review December 12, 2025 11:50
}

if (WriteBackCache) {
// TODO
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: What should I do if FlushNodeData returned non-retriable error?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that fsync(fd) and sync() should return errors until the problematic fd is closed

Stats->FlushCompleted();
CompleteFlush(nodeState);
if (nonRetriableError) {
// TODO(#1751): handle non-retriable errors
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: What should I do if it is not possible to flush?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep problematic fd's data in the cache until this fd is closed, then just discard its data

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also experiment with ext4 and see what it does when it fails to flush (e.g. in case the fs is out of space)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants