test: fix API timing false positives in test_population_latency by Manciukic · Pull Request #5942 · firecracker-microvm/firecracker

Manciukic · 2026-06-09T12:13:38Z

Problem

test_population_latency fails on host kernel 6.18 (and previously on 6.1) with:

AssertionError: Get /machine-config API call exceeded maximum duration: 884 ms

The test uses the fault_all uffd handler which eagerly faults in all guest memory after snapshot restore. During this window, any API call (like Get /machine-config) competes with the faulting activity, causing response times to exceed the MAX_API_CALL_DURATION_MS threshold.

This was previously worked around by blanket-disabling time_api_requests on host kernel 6.1 (commit 03eb60c), but the same issue appeared on 6.18 after it was added to CI.

Fix

Revert the kernel-version-based exclusion (!= "6.1") back to unconditional True — re-enables the check on 6.1 for all other tests.
Disable time_api_requests only for restored VMs in test_population_latency, where slow API responses during uffd faulting are expected by design.

Why this is correct

_validate_api_response_times() is called at VM teardown (kill()), which happens after yield returns from build_n_from_snapshot
Setting microvm.time_api_requests = False inside the loop body (before any test logic) ensures the flag is already set when kill() eventually checks it
No memory_monitor is attached (monitor_memory=False), so Get /machine-config is not called during restore itself

Testing

./tools/devtool checkstyle passes (20 passed, 4 skipped)

codecov · 2026-06-09T12:24:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.00%. Comparing base (d300fd2) to head (036b8fe).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #5942   +/-   ##
=======================================
  Coverage   83.00%   83.00%           
=======================================
  Files         277      277           
  Lines       30106    30106           
=======================================
  Hits        24989    24989           
  Misses       5117     5117

Flag	Coverage Δ
5.10-m5n.metal	`83.31% <ø> (+<0.01%)`	⬆️
5.10-m6a.metal	`82.65% <ø> (+<0.01%)`	⬆️
5.10-m6g.metal	`79.94% <ø> (ø)`
5.10-m6i.metal	`83.30% <ø> (-0.01%)`	⬇️
5.10-m7a.metal-48xl	`82.64% <ø> (+<0.01%)`	⬆️
5.10-m7g.metal	`79.94% <ø> (ø)`
5.10-m7i.metal-24xl	`83.27% <ø> (-0.01%)`	⬇️
5.10-m7i.metal-48xl	`83.28% <ø> (+<0.01%)`	⬆️
5.10-m8g.metal-24xl	`79.94% <ø> (ø)`
5.10-m8g.metal-48xl	`79.94% <ø> (+<0.01%)`	⬆️
5.10-m8i.metal-48xl	`83.28% <ø> (+<0.01%)`	⬆️
5.10-m8i.metal-96xl	`83.28% <ø> (ø)`
6.1-m5n.metal	`83.33% <ø> (+<0.01%)`	⬆️
6.1-m6a.metal	`82.68% <ø> (+<0.01%)`	⬆️
6.1-m6g.metal	`79.94% <ø> (-0.01%)`	⬇️
6.1-m6i.metal	`83.33% <ø> (+<0.01%)`	⬆️
6.1-m7a.metal-48xl	`82.66% <ø> (ø)`
6.1-m7g.metal	`79.93% <ø> (-0.01%)`	⬇️
6.1-m7i.metal-24xl	`83.34% <ø> (ø)`
6.1-m7i.metal-48xl	`83.34% <ø> (ø)`
6.1-m8g.metal-24xl	`79.93% <ø> (-0.01%)`	⬇️
6.1-m8g.metal-48xl	`79.94% <ø> (ø)`
6.1-m8i.metal-48xl	`83.34% <ø> (+<0.01%)`	⬆️
6.1-m8i.metal-96xl	`83.34% <ø> (ø)`
6.18-m5n.metal	`83.33% <ø> (+<0.01%)`	⬆️
6.18-m6a.metal	`82.67% <ø> (-0.01%)`	⬇️
6.18-m6g.metal	`79.94% <ø> (ø)`
6.18-m6i.metal	`83.33% <ø> (+<0.01%)`	⬆️
6.18-m7a.metal-48xl	`82.66% <ø> (ø)`
6.18-m7g.metal	`79.94% <ø> (ø)`
6.18-m7i.metal-24xl	`83.34% <ø> (+<0.01%)`	⬆️
6.18-m7i.metal-48xl	`83.34% <ø> (ø)`
6.18-m8g.metal-24xl	`79.94% <ø> (ø)`
6.18-m8g.metal-48xl	`79.94% <ø> (ø)`
6.18-m8i.metal-48xl	`83.34% <ø> (ø)`
6.18-m8i.metal-96xl	`83.35% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Manciukic · 2026-06-09T13:02:27Z

Perf pipeline build to verify the fix to test_population_latency: https://buildkite.com/firecracker/performance-a-b-tests/builds/929/canvas

ilstam · 2026-06-09T14:39:17Z

Get /machine-config after uffd-based snapshot restore is slow while pages are being demand-faulted

Discussed offline, it's strange that Get /machine-config causes page faults, but perhaps we can at least improve the commit message.

Revert the kernel-version-based exclusion for API response time validation (added in 03eb60c) and instead disable the check only where it's actually expected to be slow: in test_population_latency's restored VMs. The population latency test restores a snapshot with the fault_all uffd handler and immediately issues API calls while all guest memory is being faulted in. On host kernels 6.1 and 6.18, Get /machine-config takes ~800-900ms in this window — the exact reason is unclear (the handler reads Firecracker's own struct, not guest memory), but it consistently exceeds MAX_API_CALL_DURATION_MS. Rather than blanket-disabling the timing check per kernel version, scope the exclusion to the one test where it's known to be unreliable. Signed-off-by: Riccardo Mancini <mancio@amazon.com>

Manciukic marked this pull request as ready for review June 9, 2026 14:07

Manciukic added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Jun 9, 2026

Manciukic enabled auto-merge (rebase) June 9, 2026 14:07

marco-marangoni previously approved these changes Jun 9, 2026

View reviewed changes

Manciukic force-pushed the fix-population-latency-api-timing branch from 2f4a35d to 7883ea8 Compare June 9, 2026 14:48

Manciukic dismissed marco-marangoni’s stale review via 036b8fe June 9, 2026 14:48

Manciukic force-pushed the fix-population-latency-api-timing branch from 7883ea8 to 036b8fe Compare June 9, 2026 14:48

JackThomson2 approved these changes Jun 9, 2026

View reviewed changes

ilstam approved these changes Jun 9, 2026

View reviewed changes

Manciukic merged commit 9700741 into firecracker-microvm:main Jun 10, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: fix API timing false positives in test_population_latency#5942

test: fix API timing false positives in test_population_latency#5942
Manciukic merged 1 commit into
firecracker-microvm:mainfrom
Manciukic:fix-population-latency-api-timing

Manciukic commented Jun 9, 2026

Uh oh!

codecov Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Manciukic commented Jun 9, 2026

Uh oh!

ilstam commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Manciukic commented Jun 9, 2026

Problem

Fix

Why this is correct

Testing

Uh oh!

codecov Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Manciukic commented Jun 9, 2026

Uh oh!

ilstam commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Jun 9, 2026 •

edited

Loading