Skip to content

DAOS-19036 dtx: handle DTX race issues#18428

Merged
daltonbohning merged 2 commits into
masterfrom
Nasf-Fan/DAOS-19036_1
Jun 18, 2026
Merged

DAOS-19036 dtx: handle DTX race issues#18428
daltonbohning merged 2 commits into
masterfrom
Nasf-Fan/DAOS-19036_1

Conversation

@Nasf-Fan

@Nasf-Fan Nasf-Fan commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Mainly including the following fixes:

  1. When DTX leader switch, it is possible that the old DTX leader wanted to abort such DTX but not completed before its eviction. And then the new DTX leader may re-execute related modification successfully and try to commit such DTX. If without control, it is possible that those in-flight DTX ABORT RPC from the old DTX leader may abort the DTX that is to be committed by the new DTX leader, then break DTX semantics.

    The patch adds @Version parameter when abort DTX: when new DTX leader handles resent RPC from client, related DTX version will be refreshed if it has been prepared by old DTX leader; anytime when abort DTX locally, the logic will compare the version from ABORT request with related DTX version and skip stale ABORT RPC.

  2. vos_dtx_load_mbs() maybe triggered before related DTX prepared locally. Under such case, related MBS information is empty. We need to handle such case to avoid segmentation fault.

  3. Handle race between DTX resync and IO handler for resent RPC.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

Ticket title is 'Argonne Daos_user : Engine ranks 590, 593, and 596 entered Errored state unexpectedly'
Status is 'In Progress'
Labels: 'ALCF'
https://daosio.atlassian.net/browse/DAOS-19036

@daosbuild3

Copy link
Copy Markdown
Collaborator

@daosbuild3

Copy link
Copy Markdown
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-19036_1 branch from e0a5ca2 to 2243c14 Compare June 4, 2026 08:37
@daosbuild3

Copy link
Copy Markdown
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-19036_1 branch from 2243c14 to bdab418 Compare June 4, 2026 08:41
@daosbuild3

Copy link
Copy Markdown
Collaborator

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18428/4/execution/node/1317/log

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18428/4/testReport/

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18428/4/execution/node/1399/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-19036_1 branch from bdab418 to 8a5489d Compare June 5, 2026 06:13
@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18428/5/execution/node/1383/log

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18428/5/execution/node/1331/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-19036_1 branch 2 times, most recently from 30fb5d3 to 389a433 Compare June 6, 2026 03:29
Mainly including the following fixes:

1. When DTX leader switch, it is possible that the old DTX leader
   wanted to abort such DTX but not completed before its eviction.
   And then the new DTX leader may re-execute related modification
   successfully and try to commit such DTX. If without control, it
   is possible that those in-flight DTX ABORT RPC from the old DTX
   leader may abort the DTX that is to be committed by the new DTX
   leader, then break DTX semantics.

   The patch adds @Version parameter when abort DTX: when new DTX
   leader handles resent RPC from client, related DTX version will
   be refreshed if it has been prepared by old DTX leader; anytime
   when abort DTX locally, the logic will compare the version from
   ABORT request with related DTX version and skip stale ABORT RPC.

2. vos_dtx_load_mbs() maybe triggered before related DTX prepared
   locally. Under such case, related MBS information is empty. We
   need to handle such case to avoid segmentation fault.

3. Handle race between DTX resync and IO handler for resent RPC.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-19036_1 branch from 389a433 to 9a20cab Compare June 6, 2026 03:34
@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18428/8/testReport/

@Nasf-Fan Nasf-Fan marked this pull request as ready for review June 9, 2026 04:58
@Nasf-Fan Nasf-Fan requested review from a team as code owners June 9, 2026 04:58
@Nasf-Fan

Nasf-Fan commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18428/8/testReport/

dfuse_daos_build_wb failed for DAOS-19005, not related with the patch.

@Nasf-Fan

Copy link
Copy Markdown
Contributor Author

Ping reviewers, thanks!

Comment thread src/dtx/dtx_resync.c
* commit/abort decision (against regular IO handler) by race.
*/
if ((ent->ie_dtx_ver > dra->resync_version) ||
(ent->ie_dtx_ver == dra->resync_version && !dra->for_all))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirm,
L551 looks a little bit confuse, for example pool map change to ver 10, and do an update dtx@pm_ver_10, and then dtx_resync @pm_ver_10. Will above code ignore the update dtx@pm_ver_10? if ignore it, how to ensure dtx_resync()'s semantics?
why dtx_resync()'s "block" parameter is true for rebuild, and false for ds_cont_local_open()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the update with pm_ver@10, since the latest pm_ver is also 10, means that the update is against the latest pm_ver, so the DTX leader is alive and will be in charge of committing or aborting related DTX entry.

The parameter block is used to distinguish whether for rebuild or for regular container open. For rebuild case, the DTX (new) leader will handle related DTX entry; for container open case, since related DTX leader ULT is already exit and nobody will sponsor commit or abort action, then DTX resync needs to do that.

@Nasf-Fan Nasf-Fan requested a review from a team June 16, 2026 04:09
@daltonbohning

Copy link
Copy Markdown
Contributor

I merged latest master since CI should be clean now.

@daltonbohning daltonbohning merged commit c61ae69 into master Jun 18, 2026
42 checks passed
@daltonbohning daltonbohning deleted the Nasf-Fan/DAOS-19036_1 branch June 18, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants