test: fix API timing false positives in test_population_latency#5942
Merged
Manciukic merged 1 commit intoJun 10, 2026
Merged
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5942 +/- ##
=======================================
Coverage 83.00% 83.00%
=======================================
Files 277 277
Lines 30106 30106
=======================================
Hits 24989 24989
Misses 5117 5117
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Contributor
Author
|
Perf pipeline build to verify the fix to |
marco-marangoni
previously approved these changes
Jun 9, 2026
Contributor
Discussed offline, it's strange that Get /machine-config causes page faults, but perhaps we can at least improve the commit message. |
2f4a35d to
7883ea8
Compare
Revert the kernel-version-based exclusion for API response time validation (added in 03eb60c) and instead disable the check only where it's actually expected to be slow: in test_population_latency's restored VMs. The population latency test restores a snapshot with the fault_all uffd handler and immediately issues API calls while all guest memory is being faulted in. On host kernels 6.1 and 6.18, Get /machine-config takes ~800-900ms in this window — the exact reason is unclear (the handler reads Firecracker's own struct, not guest memory), but it consistently exceeds MAX_API_CALL_DURATION_MS. Rather than blanket-disabling the timing check per kernel version, scope the exclusion to the one test where it's known to be unreliable. Signed-off-by: Riccardo Mancini <mancio@amazon.com>
7883ea8 to
036b8fe
Compare
JackThomson2
approved these changes
Jun 9, 2026
ilstam
approved these changes
Jun 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
test_population_latencyfails on host kernel 6.18 (and previously on 6.1) with:The test uses the
fault_alluffd handler which eagerly faults in all guest memory after snapshot restore. During this window, any API call (likeGet /machine-config) competes with the faulting activity, causing response times to exceed theMAX_API_CALL_DURATION_MSthreshold.This was previously worked around by blanket-disabling
time_api_requestson host kernel 6.1 (commit 03eb60c), but the same issue appeared on 6.18 after it was added to CI.Fix
!= "6.1") back to unconditionalTrue— re-enables the check on 6.1 for all other tests.time_api_requestsonly for restored VMs intest_population_latency, where slow API responses during uffd faulting are expected by design.Why this is correct
_validate_api_response_times()is called at VM teardown (kill()), which happens afteryieldreturns frombuild_n_from_snapshotmicrovm.time_api_requests = Falseinside the loop body (before any test logic) ensures the flag is already set whenkill()eventually checks itmemory_monitoris attached (monitor_memory=False), soGet /machine-configis not called during restore itselfTesting
./tools/devtool checkstylepasses (20 passed, 4 skipped)