Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[segment replication] Add async publish checkpoint task #17619

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

guojialiang92
Copy link

Description

Added a test. In the current situation, if the primary shard publish checkpoint fails, it will cause the replica shard and the primary shard to fail to synchronize.
Added an asynchronous task. When the primary shard detects that the replica is behind for more than a certain time threshold, it triggers a publish checkpoint. And ensure that the above tests can be passed.

Related Issues

Resolves 17595

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added bug Something isn't working Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Mar 18, 2025
@guojialiang92 guojialiang92 changed the title Dev/add async publish checkpoint task [segment replication] Add async publish checkpoint task Mar 18, 2025
Copy link
Contributor

❌ Gradle check result for 2a272aa: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 54945b2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@guojialiang92 guojialiang92 force-pushed the dev/add_async_publish_checkpoint_task branch from 54945b2 to 23c1b87 Compare March 22, 2025 03:08
Copy link
Contributor

❌ Gradle check result for 23c1b87: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@guojialiang92 guojialiang92 force-pushed the dev/add_async_publish_checkpoint_task branch from 23c1b87 to 4394239 Compare March 22, 2025 12:57
Copy link
Contributor

❌ Gradle check result for 4394239: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
@guojialiang92 guojialiang92 force-pushed the dev/add_async_publish_checkpoint_task branch from 4394239 to 9b5a236 Compare March 24, 2025 02:10
Copy link
Contributor

❕ Gradle check result for 9b5a236: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
      1 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

❌ Gradle check result for 546787d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
@guojialiang92 guojialiang92 force-pushed the dev/add_async_publish_checkpoint_task branch from 546787d to 5e09825 Compare March 24, 2025 09:53
Copy link
Contributor

❕ Gradle check result for 5e09825: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

❌ Gradle check result for af8670a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
@guojialiang92 guojialiang92 force-pushed the dev/add_async_publish_checkpoint_task branch from af8670a to e21129f Compare March 25, 2025 02:39
Copy link
Contributor

❌ Gradle check result for e21129f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>

Signed-off-by: guojialiang <[email protected]>
Copy link
Contributor

❕ Gradle check result for c01232b: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: guojialiang <[email protected]>
Copy link
Contributor

❌ Gradle check result for 89574de:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 90f7337:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] segment replication stops when publish checkpoint fails
1 participant