Skip to content

DAOS-19016 test: Stale event pointer dereference in autotest kv_put/kv_get spin loops#18489

Open
knard38 wants to merge 5 commits into
masterfrom
ckochhof/fix/master/daos-19016/patch-001
Open

DAOS-19016 test: Stale event pointer dereference in autotest kv_put/kv_get spin loops#18489
knard38 wants to merge 5 commits into
masterfrom
ckochhof/fix/master/daos-19016/patch-001

Conversation

@knard38

@knard38 knard38 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes a latent stale event pointer dereference in the kv_put() and kv_get() spin loops of src/utils/daos_autotest.c`, and validates the fix with a fault-injection functional test.

Background

Commit 8f3ac4a5e1 (DAOS-16478) changed the event polling in daos_autotest.c from blocking (DAOS_EQ_WAIT) to non-blocking spin loops (DAOS_EQ_NOWAIT). The spin loops were not updated to handle the rc < 0 case from daos_eq_poll(). When daos_eq_poll() returns a negative error code (e.g., -DER_HG from a transient Mercury transport error), the events output array is not populated — yet the code unconditionally dereferences evp->ev_error on the next line. This causes a SIGSEGV, event state corruption, or double submission.

This bug was discovered during the investigation of DAOS-18859, where it was ruled out as the root cause (the DAOS-18859 crash occurs before any kv_put()/kv_get() call). It is fixed here as an independent defect.

Fix — src/utils/daos_autotest.c

Three changes applied to both kv_put() and kv_get() spin loops, plus the drain loop in kv_put():

  • Initialize evp = NULL before each spin loop so that a stale pointer from a previous iteration is always detectable.
  • Break on rc < 0 immediately after the spin loop to prevent dereferencing evp after a poll failure.
  • Add D_ASSERT(evp != NULL) after each loop to catch future regressions during development and fault-injection runs.

Additionally, the kv_put() drain loop is fixed to capture ev_error for I/O completions that arrive during a concurrent poll failure — previously those errors were silently dropped.

Fault injection point — src/client/api/event.c / src/include/daos/common.h

A new fault injection point DAOS_FAULT_EQ_POLL_FAIL (DAOS_FAIL_SYS_TEST_GROUP_LOC | 0x1000, decimal ID 135168) is added in daos_eq_poll(). When triggered, it returns -DER_HG before any event is dequeued, simulating a transient Mercury transport error without needing a real network failure. This makes the fix verifiable in CI.

Functional test — src/tests/ftest/pool/autotest_eq_poll_fi.py

New test class PoolAutotestEqPollFITest (Quick-Functional, hw/medium):

  1. Creates a pool.
  2. Runs daos pool autotest with DAOS_FAULT_EQ_POLL_FAIL active (enabled via the YAML fault_list section with max_faults: 5 and interval: 100).
  3. Asserts that autotest exits with rc == 1 (clean failure, no crash or hang).
  4. Asserts that DER_HG(-1020) appears in stderr, confirming the error propagated correctly without a stale-pointer dereference.
  5. Verifies the pool remains healthy after the expected autotest failure.

Test timeout rationale

The test timeout is set to 300 seconds. This value was derived from 5 consecutive CI runs (build pr_18489-build_001), which completed in:

Run Duration (s)
repeat001 119.395
repeat002 119.840
repeat003 119.256
repeat004 119.782
repeat005 119.758

All runs completed in ~120 s with less than 0.6 s of variance. The 300 s timeout applies a 2.5× safety factor over the observed maximum (~120 s), providing comfortable headroom for slower CI nodes or transient load.

Files changed

File Change
src/utils/daos_autotest.c Fix: NULL init, rc < 0 break, D_ASSERT, drain-loop error capture
src/client/api/event.c New DAOS_FAULT_EQ_POLL_FAIL fault injection point in daos_eq_poll()
src/include/daos/common.h New DAOS_FAULT_EQ_POLL_FAIL constant
src/tests/ftest/pool/autotest_eq_poll_fi.py New fault-injection functional test
src/tests/ftest/pool/autotest_eq_poll_fi.yaml Test configuration (1 server, 1 client, 20G pool, fault list)
src/tests/ftest/util/fault_config_utils.py Register DAOS_FAULT_EQ_POLL_FAIL in the fault config registry

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

Ticket title is 'Stale event pointer dereference in autotest kv_put/kv_get spin loops'
Status is 'In Review'
Errors are Title of PR is too long
https://daosio.atlassian.net/browse/DAOS-19016

…et loops

The kv_put() and kv_get() functions in src/utils/daos_autotest.c have a
latent bug: when daos_eq_poll() returns a negative error code the event
pointer evp is not populated, yet the code unconditionally dereferences
evp->ev_error on the next line.  This causes a SIGSEGV, event state
corruption, or double submission.

Fix:
- Initialize evp = NULL before each spin loop so that the stale-pointer
  condition is always detectable.
- Break out of the loop when rc < 0 so evp is never dereferenced after a
  poll failure.
- Add D_ASSERT(evp != NULL) after each loop to catch future regressions.
- In the kv_put() drain loop, capture ev_error for completions that arrive
  during a concurrent poll failure.

To facilitate testing, add fault injection point DAOS_FAULT_EQ_POLL_FAIL
(DAOS_FAIL_SYS_TEST_GROUP_LOC | 0x1000, decimal 135168) in daos_eq_poll().
When triggered it returns -DER_HG, simulating a transient Mercury transport
error without needing a real network failure.

Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
@knard38 knard38 self-assigned this Jun 15, 2026
…ndling

Add a new pool functional test PoolAutotestEqPollFITest that verifies the
fix for the stale event pointer dereference in the kv_put() / kv_get()
spin loops of src/utils/daos_autotest.c (DAOS-19016).

The test enables fault injection point DAOS_FAULT_EQ_POLL_FAIL (ID 135168)
via the YAML fault_list section.  This causes daos_eq_poll() to return
-DER_HG, exercising the rc < 0 break added by the fix.

Verification:
  - daos pool autotest exits with rc == 1 (clean failure, no crash)
  - DER_HG(-1020) appears in the stderr output
  - the pool remains healthy after the expected autotest failure

Features: autotest
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
@knard38 knard38 force-pushed the ckochhof/fix/master/daos-19016/patch-001 branch from 992704b to 030bece Compare June 15, 2026 07:27
@knard38 knard38 marked this pull request as ready for review June 15, 2026 07:40
@knard38 knard38 requested review from a team as code owners June 15, 2026 07:40
@daosbuild3

Copy link
Copy Markdown
Collaborator

@daltonbohning daltonbohning left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM - just nits on test code

Comment thread src/tests/ftest/pool/autotest_eq_poll_fi.yaml Outdated
Comment thread src/tests/ftest/pool/autotest_eq_poll_fi.py Outdated
Comment thread src/tests/ftest/pool/autotest_eq_poll_fi.py Outdated
Comment thread src/tests/ftest/pool/autotest_eq_poll_fi.py Outdated
Comment thread src/tests/ftest/pool/autotest_eq_poll_fi.py Outdated
Comment thread src/client/api/event.c Outdated
epa.count = 0;

/* Fault injection: crt_progress failure BEFORE dequeue; caller's evp remains stale. */
fa = d_fault_attr_lookup(DAOS_FAULT_EQ_POLL_FAIL);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d_fault_attr_lookup is expensive (requires acquiring lock and lookup in the gurt hashtable). we should not put this FI in the hotpath of eq_poll which can be called in a loop by application with 0 timeout.

i would rather not do any FI at all in polling please.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback — I wasn't aware of that cost (first time writing a FI test in DAOS).

To avoid the expensive d_fault_attr_lookup() call in the hot path, would guarding it with d_fault_inject (the global on/off switch from gurt/fault_inject.h) be acceptable?

/* Fault injection: simulate crt_progress failure before dequeue; evp remains stale. */
if (unlikely(d_fault_inject)) {
    fa = d_fault_attr_lookup(DAOS_FAULT_EQ_POLL_FAIL);
    if (fa != NULL && D_SHOULD_FAIL(fa)) {
        daos_eq_putref(epa.eqx);
        return -DER_HG;
    }
}

When FI is globally disabled, the branch reduces to a comparison against zero — no lock, no hashtable lookup.
Would that be acceptable or is it not allowed at all to have FI code in such hot path of the code?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would not say it is not allowed in the sense that there is no written rule on this.

but just speaking from a developer perspective or even CI, all tests are done with FI enabled (non-release build), so not just FI stage. So this will be exercised in all user cases where apps or tests call poll in a loop anywhere we use a non-release build.
so for this particular case, i do not see really a big benefit of this FI test case to incur such an issue for non release builds.

@knard38 knard38 Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I will remove the FI test and the functional test associated to it.

  • Remove FI and functional test

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with commit 2609b65

kanard38 added 2 commits June 16, 2026 12:52
Fix reviewers comment:
- Remove FI and functional test

Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
@knard38 knard38 requested review from daltonbohning and mchaarawi and removed request for daltonbohning and phender June 16, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants