DAOS-19016 test: Stale event pointer dereference in autotest kv_put/kv_get spin loops#18489
DAOS-19016 test: Stale event pointer dereference in autotest kv_put/kv_get spin loops#18489knard38 wants to merge 5 commits into
Conversation
|
Ticket title is 'Stale event pointer dereference in autotest kv_put/kv_get spin loops' |
…et loops The kv_put() and kv_get() functions in src/utils/daos_autotest.c have a latent bug: when daos_eq_poll() returns a negative error code the event pointer evp is not populated, yet the code unconditionally dereferences evp->ev_error on the next line. This causes a SIGSEGV, event state corruption, or double submission. Fix: - Initialize evp = NULL before each spin loop so that the stale-pointer condition is always detectable. - Break out of the loop when rc < 0 so evp is never dereferenced after a poll failure. - Add D_ASSERT(evp != NULL) after each loop to catch future regressions. - In the kv_put() drain loop, capture ev_error for completions that arrive during a concurrent poll failure. To facilitate testing, add fault injection point DAOS_FAULT_EQ_POLL_FAIL (DAOS_FAIL_SYS_TEST_GROUP_LOC | 0x1000, decimal 135168) in daos_eq_poll(). When triggered it returns -DER_HG, simulating a transient Mercury transport error without needing a real network failure. Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
…ndling Add a new pool functional test PoolAutotestEqPollFITest that verifies the fix for the stale event pointer dereference in the kv_put() / kv_get() spin loops of src/utils/daos_autotest.c (DAOS-19016). The test enables fault injection point DAOS_FAULT_EQ_POLL_FAIL (ID 135168) via the YAML fault_list section. This causes daos_eq_poll() to return -DER_HG, exercising the rc < 0 break added by the fix. Verification: - daos pool autotest exits with rc == 1 (clean failure, no crash) - DER_HG(-1020) appears in the stderr output - the pool remains healthy after the expected autotest failure Features: autotest Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
992704b to
030bece
Compare
|
Test stage Test RPMs on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18489/2/execution/node/1011/log |
daltonbohning
left a comment
There was a problem hiding this comment.
Overall LGTM - just nits on test code
| epa.count = 0; | ||
|
|
||
| /* Fault injection: crt_progress failure BEFORE dequeue; caller's evp remains stale. */ | ||
| fa = d_fault_attr_lookup(DAOS_FAULT_EQ_POLL_FAIL); |
There was a problem hiding this comment.
d_fault_attr_lookup is expensive (requires acquiring lock and lookup in the gurt hashtable). we should not put this FI in the hotpath of eq_poll which can be called in a loop by application with 0 timeout.
i would rather not do any FI at all in polling please.
There was a problem hiding this comment.
Thanks for the feedback — I wasn't aware of that cost (first time writing a FI test in DAOS).
To avoid the expensive d_fault_attr_lookup() call in the hot path, would guarding it with d_fault_inject (the global on/off switch from gurt/fault_inject.h) be acceptable?
/* Fault injection: simulate crt_progress failure before dequeue; evp remains stale. */
if (unlikely(d_fault_inject)) {
fa = d_fault_attr_lookup(DAOS_FAULT_EQ_POLL_FAIL);
if (fa != NULL && D_SHOULD_FAIL(fa)) {
daos_eq_putref(epa.eqx);
return -DER_HG;
}
}When FI is globally disabled, the branch reduces to a comparison against zero — no lock, no hashtable lookup.
Would that be acceptable or is it not allowed at all to have FI code in such hot path of the code?
There was a problem hiding this comment.
i would not say it is not allowed in the sense that there is no written rule on this.
but just speaking from a developer perspective or even CI, all tests are done with FI enabled (non-release build), so not just FI stage. So this will be exercised in all user cases where apps or tests call poll in a loop anywhere we use a non-release build.
so for this particular case, i do not see really a big benefit of this FI test case to incur such an issue for non release builds.
There was a problem hiding this comment.
Fair enough, I will remove the FI test and the functional test associated to it.
- Remove FI and functional test
…/daos-19016/patch-001
Fix reviewers comment: - Remove FI and functional test Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
…/daos-19016/patch-001
Description
Fixes a latent stale event pointer dereference in the
kv_put()andkv_get()spin loops of src/utils/daos_autotest.c`, and validates the fix with a fault-injection functional test.Background
Commit
8f3ac4a5e1(DAOS-16478) changed the event polling indaos_autotest.cfrom blocking (DAOS_EQ_WAIT) to non-blocking spin loops (DAOS_EQ_NOWAIT). The spin loops were not updated to handle therc < 0case fromdaos_eq_poll(). Whendaos_eq_poll()returns a negative error code (e.g.,-DER_HGfrom a transient Mercury transport error), theeventsoutput array is not populated — yet the code unconditionally dereferencesevp->ev_erroron the next line. This causes aSIGSEGV, event state corruption, or double submission.This bug was discovered during the investigation of DAOS-18859, where it was ruled out as the root cause (the DAOS-18859 crash occurs before any
kv_put()/kv_get()call). It is fixed here as an independent defect.Fix —
src/utils/daos_autotest.cThree changes applied to both
kv_put()andkv_get()spin loops, plus the drain loop inkv_put():evp = NULLbefore each spin loop so that a stale pointer from a previous iteration is always detectable.rc < 0immediately after the spin loop to prevent dereferencingevpafter a poll failure.D_ASSERT(evp != NULL)after each loop to catch future regressions during development and fault-injection runs.Additionally, the
kv_put()drain loop is fixed to captureev_errorfor I/O completions that arrive during a concurrent poll failure — previously those errors were silently dropped.Fault injection point —
src/client/api/event.c/src/include/daos/common.hA new fault injection point
DAOS_FAULT_EQ_POLL_FAIL(DAOS_FAIL_SYS_TEST_GROUP_LOC | 0x1000, decimal ID 135168) is added indaos_eq_poll(). When triggered, it returns-DER_HGbefore any event is dequeued, simulating a transient Mercury transport error without needing a real network failure. This makes the fix verifiable in CI.Functional test —
src/tests/ftest/pool/autotest_eq_poll_fi.pyNew test class
PoolAutotestEqPollFITest(Quick-Functional,hw/medium):daos pool autotestwithDAOS_FAULT_EQ_POLL_FAILactive (enabled via the YAMLfault_listsection withmax_faults: 5andinterval: 100).rc == 1(clean failure, no crash or hang).DER_HG(-1020)appears in stderr, confirming the error propagated correctly without a stale-pointer dereference.Test timeout rationale
The test timeout is set to 300 seconds. This value was derived from 5 consecutive CI runs (build pr_18489-build_001), which completed in:
All runs completed in ~120 s with less than 0.6 s of variance. The 300 s timeout applies a 2.5× safety factor over the observed maximum (~120 s), providing comfortable headroom for slower CI nodes or transient load.
Files changed
src/utils/daos_autotest.crc < 0break,D_ASSERT, drain-loop error capturesrc/client/api/event.cDAOS_FAULT_EQ_POLL_FAILfault injection point indaos_eq_poll()src/include/daos/common.hDAOS_FAULT_EQ_POLL_FAILconstantsrc/tests/ftest/pool/autotest_eq_poll_fi.pysrc/tests/ftest/pool/autotest_eq_poll_fi.yamlsrc/tests/ftest/util/fault_config_utils.pyDAOS_FAULT_EQ_POLL_FAILin the fault config registrySteps for the author:
After all prior steps are complete: