[PerfTest] Fix shutdown hooks: use wait=false with readyz polling (#2795)#2796
[PerfTest] Fix shutdown hooks: use wait=false with readyz polling (#2795)#2796daviddahl wants to merge 6 commits intoopen-telemetry:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the pipeline perf-test Docker step templates to avoid suite-aborting shutdown timeouts by switching shutdown hooks from blocking wait=true requests to non-blocking wait=false plus follow-up polling.
Changes:
- Replace
.../shutdown?wait=true&timeout_secs=...with.../shutdown?wait=falseand addon_error: { continue: true }to prevent 504/404 from failing the suite. - Add polling loops intended to wait for shutdown progress (via
/readyzor an endpoint probe) with a 300s cap and warning. - Apply the pattern across the “docker”, “docker-filtered”, and “docker-otel” template variants.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 14 comments.
| File | Description |
|---|---|
| tools/pipeline_perf_test/test_suites/integration/templates/test_steps/df-loadgen-steps-docker.yaml | Convert loadgen/engine/backend shutdown hooks to wait=false with polling to avoid blocking timeouts. |
| tools/pipeline_perf_test/test_suites/integration/templates/test_steps/df-loadgen-steps-docker-filtered.yaml | Same shutdown-hook changes as docker variant for the filtered suite. |
| tools/pipeline_perf_test/test_suites/integration/templates/test_steps/df-loadgen-steps-docker-otel.yaml | Update load-generator shutdown hook to wait=false with polling in the OTel variant. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…en-telemetry#2795) The df-loadgen-steps-docker*.yaml templates used wait=true on shutdown hooks, which blocks until all pipelines terminate. The admin server stays alive after pipeline shutdown by design (run_forever parks the main thread), so wait=true returns 504 when drain takes longer than timeout_secs. With raise_for_status=true (default) and no on_error configured, this aborts the suite before the report step -- losing all metrics collected during the observation window. Fix: switch to wait=false (initiates drain, returns 202 immediately) with a readyz polling loop that confirms termination without blocking. Add on_error: continue: true so 404/connection errors are non-fatal. Files changed: - df-loadgen-steps-docker.yaml: 3 shutdown hooks (8085, 8086, 8087) - df-loadgen-steps-docker-filtered.yaml: 3 shutdown hooks - df-loadgen-steps-docker-otel.yaml: 1 shutdown hook (8085) Fixes open-telemetry#2795
- Restore timeout_secs={{drain_timeout_secs | default(60)}} on all shutdown
URLs: even with wait=false the server uses timeout_secs as the graceful
drain deadline, so dropping it would silently override suite configuration
- Fix backend shutdown polling to use /readyz (consistent with load-gen and
engine hooks) instead of /telemetry/metrics which is served process-wide
and never returns non-200 after pipeline shutdown
- Fix elapsed time in polling log messages: loop counter i increments once
per 5s sleep so elapsed seconds is i*5, not i
- Update log messages from 'stopped' to 'no longer ready' to accurately
reflect that readyz=503 means 'not ready' (e.g. Draining) not necessarily
'fully terminated'
b30fcbf to
e6ecff8
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2796 +/- ##
=======================================
Coverage 86.03% 86.03%
=======================================
Files 704 704
Lines 264591 264591
=======================================
Hits 227654 227654
Misses 36413 36413
Partials 524 524
🚀 New features to boost your workflow:
|
|
All feedback addressed in latest push (rebased on current main):
@cijothomas ready for re-review when you have a chance. |
Conflict resolution during rebase dropped the method: POST field from the send_http_request shutdown hooks. The pydantic validator requires this field and rejected the config with ValidationError at load time.
|
Fixed CI failure -- conflict resolution during rebase dropped |
| fi; | ||
| sleep 5; | ||
| done; | ||
| echo "WARNING: Load generator still running after 300s, proceeding anyway"; |
There was a problem hiding this comment.
I'm probably missing something, but could you explain how this is this different from setting raise_for_status: false and bumping the drain timeout from 60s to 300s?
I'm also wondering under which circumstances is draining taking longer than a minute? That's a pretty long time to flush the queues when we know the df engine can process millions of records/sec 😜
There was a problem hiding this comment.
I think the problem is on our side where we are testing with Kafka, etc
There was a problem hiding this comment.
Is setting raise_for_status: false and bumping the timeout a valid solution to this problem instead of writing this shell script in all the polling steps?
There was a problem hiding this comment.
You're right that raise_for_status: false + a longer timeout_secs would also prevent the suite abort. The polling approach adds live log feedback and early exit, but it's more code.
The drain timing issue came up in downstream deployments with heavier exporter configurations where flush times regularly exceeded the 60s default. The other common failure was 404 from the API path rename when testing against historical commits.
Happy to simplify to raise_for_status: false if the team prefers -- the polling is nice-to-have for observability but not strictly necessary.
Drop the run_command readyz polling loops in favor of the simpler raise_for_status: false + on_error: continue approach. The shutdown hooks now stay as close to the original template as possible -- just two fields added to make non-2xx responses non-fatal.
|
Simplified -- dropped the polling, just |
JakeDern
left a comment
There was a problem hiding this comment.
LGTM, thanks @daviddahl!
Summary
Fixes #2795.
The
df-loadgen-steps-docker*.yamltest step templates usedwait=trueon shutdown hooks, which blocks until all pipelines terminate. The admin server stays alive after pipeline shutdown by design (run_foreverparks the main thread), sowait=truereturns 504 Gateway Timeout when drain takes longer thantimeout_secs.Since
send_http_requestdefaults toraise_for_status: truewith noon_errorconfigured, this 504 aborts the entire suite before the report step -- losing all metrics collected during the observation window.Changes
Switch from blocking
wait=trueto non-blockingwait=falsewith active readyz polling across all three affected template files:df-loadgen-steps-docker.yamldf-loadgen-steps-docker-filtered.yamldf-loadgen-steps-docker-otel.yamlPattern for each shutdown hook:
POST .../shutdown?wait=false-- returns 202 immediately, drain runs in background/readyzevery 5s until 503 (pipelines terminated) or endpoint stops respondingon_error: continue: trueso 404/connection errors (service already exited) are non-fataldestroyregardless)Testing
Verified in downstream deployments using the orchestrator framework across a range of workloads. The polling approach provides clear log output ("Engine pipelines stopped after Ns") and never aborts the suite due to shutdown timing.
/cc @cijothomas