Skip to content

Mock out idle sleeps in orchestration tests (~9x faster)#5223

Closed
saitcakmak wants to merge 1 commit into
facebook:mainfrom
saitcakmak:speedup-orchestration-tests
Closed

Mock out idle sleeps in orchestration tests (~9x faster)#5223
saitcakmak wants to merge 1 commit into
facebook:mainfrom
saitcakmak:speedup-orchestration-tests

Conversation

@saitcakmak
Copy link
Copy Markdown
Contributor

@saitcakmak saitcakmak commented Jun 4, 2026

TLDR

Mocks out sleep in orchestrator tests, reducing the test runtime significantly.

Summary

The ax/orchestration test suite was the single slowest test directory in Ax — ~400s of wall time — but it spent almost all of that sleeping, not computing (~20s CPU of 400s wall). This PR removes the idle waits via a shared _mock_orchestrator_poll_sleep() helper called from both test classes' setUp (the subclass TestAxOrchestratorMultiTypeExperiment calls TestCase.setUp directly rather than super(), so both need it).

Two sources of pure idle time are removed:

  1. Orchestrator polling loop (ax.orchestration.orchestrator.sleep). Many tests leave init_seconds_between_polls / min_seconds_before_poll at their non-zero defaults (1.0s), so the polling loop just waits. Loop termination depends on total_seconds_elapsed — which accumulates the configured interval regardless of whether sleep actually pauses — so removing the wait is behavior-preserving.

  2. Retry backoff in retry_on_exception (initial_wait_seconds=5 → 5s + 10s = 15s for test_suppress_all_storage_errors). Only sleep is mocked, via a wraps-ed copy of the time module (patch("ax.utils.common.executils.time", Mock(wraps=time))), so the retry count and every other time.* function are untouched — and trial-TTL tests that rely on real time.sleep (e.g. test_generate_candidates_can_remove_stale_candidates) keep working against the real clock.

Result

Single-threaded, all 164 tests still pass:

Stage Wall time
Baseline 401.6s
+ poll-sleep mock 103.5s
+ retry-sleep mock 42.9s

~89% faster (401.6s → 42.9s). Previously-dominant tests (test_ignore_global_stopping ~20s, test_run_n_trials* ~11s, test_suppress_all_storage_errors ~15s) all drop to sub-second. The TTL tests intentionally still take ~2.2s, confirming real elapsed-time behavior is preserved.

GitHub CI results:

Before: 28m 5s https://github.com/facebook/Ax/actions/runs/26967821950/job/79575013374
After: 20m 46s https://github.com/facebook/Ax/actions/runs/26968208766/job/79576349702?pr=5223

Test Plan

  • pytest ax/orchestration/ → 164 passed
  • ufmt + flake8 clean

The ax/orchestration test suite spent ~400s of wall time almost entirely
sleeping, not computing. Two sources of pure idle time are removed via a
shared _mock_orchestrator_poll_sleep() helper called from both test classes'
setUp:

1. The orchestrator polling loop (ax.orchestration.orchestrator.sleep). Many
   tests leave init_seconds_between_polls / min_seconds_before_poll at their
   non-zero defaults. Loop termination depends on total_seconds_elapsed (which
   accumulates the configured interval regardless of actual sleeping), so
   removing the wait is behavior-preserving.

2. The exponential backoff between DB-save retries in retry_on_exception
   (initial_wait_seconds=5 -> 5s + 10s = 15s for test_suppress_all_storage_errors).
   Only sleep is mocked, via a wraps-ed copy of the time module, so the retry
   count and all other time.* functions are untouched and trial-TTL tests that
   rely on real time.sleep still work.

Runtime: 401.6s -> 42.9s (single-threaded), all 164 tests still pass.
@meta-cla meta-cla Bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jun 4, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.49%. Comparing base (b3ddd71) to head (99046ec).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5223      +/-   ##
==========================================
- Coverage   96.50%   96.49%   -0.02%     
==========================================
  Files         617      617              
  Lines       69776    70012     +236     
==========================================
+ Hits        67339    67557     +218     
- Misses       2437     2455      +18     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Jun 4, 2026

@saitcakmak has imported this pull request. If you are a Meta employee, you can view this in D107551763.

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Jun 4, 2026

@saitcakmak merged this pull request in 873933b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed Do not delete this pull request or issue due to inactivity. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants