[PerfTest] Fix shutdown hooks: use wait=false with readyz polling (#2795) by daviddahl · Pull Request #2796 · open-telemetry/otel-arrow

daviddahl · 2026-04-30T19:45:20Z

Summary

The df-loadgen-steps-docker*.yaml test step templates used wait=true on shutdown hooks, which blocks until all pipelines terminate. The admin server stays alive after pipeline shutdown by design (run_forever parks the main thread), so wait=true returns 504 Gateway Timeout when drain takes longer than timeout_secs.

Since send_http_request defaults to raise_for_status: true with no on_error configured, this 504 aborts the entire suite before the report step -- losing all metrics collected during the observation window.

Changes

Switch from blocking wait=true to non-blocking wait=false with active readyz polling across all three affected template files:

File	Hooks fixed
`df-loadgen-steps-docker.yaml`	3 (ports 8085, 8086, 8087)
`df-loadgen-steps-docker-filtered.yaml`	3 (ports 8085, 8086, 8087)
`df-loadgen-steps-docker-otel.yaml`	1 (port 8085)

Pattern for each shutdown hook:

POST .../shutdown?wait=false -- returns 202 immediately, drain runs in background
Poll /readyz every 5s until 503 (pipelines terminated) or endpoint stops responding
on_error: continue: true so 404/connection errors (service already exited) are non-fatal
Polling loop exits with a warning after 300s (container will be killed by destroy regardless)

Testing

Verified in downstream deployments using the orchestrator framework across a range of workloads. The polling approach provides clear log output ("Engine pipelines stopped after Ns") and never aborts the suite due to shutdown timing.

/cc @cijothomas

Copilot

Pull request overview

This PR updates the pipeline perf-test Docker step templates to avoid suite-aborting shutdown timeouts by switching shutdown hooks from blocking wait=true requests to non-blocking wait=false plus follow-up polling.

Changes:

Replace .../shutdown?wait=true&timeout_secs=... with .../shutdown?wait=false and add on_error: { continue: true } to prevent 504/404 from failing the suite.
Add polling loops intended to wait for shutdown progress (via /readyz or an endpoint probe) with a 300s cap and warning.
Apply the pattern across the “docker”, “docker-filtered”, and “docker-otel” template variants.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 14 comments.

File	Description
tools/pipeline_perf_test/test_suites/integration/templates/test_steps/df-loadgen-steps-docker.yaml	Convert loadgen/engine/backend shutdown hooks to `wait=false` with polling to avoid blocking timeouts.
tools/pipeline_perf_test/test_suites/integration/templates/test_steps/df-loadgen-steps-docker-filtered.yaml	Same shutdown-hook changes as docker variant for the filtered suite.
tools/pipeline_perf_test/test_suites/integration/templates/test_steps/df-loadgen-steps-docker-otel.yaml	Update load-generator shutdown hook to `wait=false` with polling in the OTel variant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…en-telemetry#2795) The df-loadgen-steps-docker*.yaml templates used wait=true on shutdown hooks, which blocks until all pipelines terminate. The admin server stays alive after pipeline shutdown by design (run_forever parks the main thread), so wait=true returns 504 when drain takes longer than timeout_secs. With raise_for_status=true (default) and no on_error configured, this aborts the suite before the report step -- losing all metrics collected during the observation window. Fix: switch to wait=false (initiates drain, returns 202 immediately) with a readyz polling loop that confirms termination without blocking. Add on_error: continue: true so 404/connection errors are non-fatal. Files changed: - df-loadgen-steps-docker.yaml: 3 shutdown hooks (8085, 8086, 8087) - df-loadgen-steps-docker-filtered.yaml: 3 shutdown hooks - df-loadgen-steps-docker-otel.yaml: 1 shutdown hook (8085) Fixes open-telemetry#2795

- Restore timeout_secs={{drain_timeout_secs | default(60)}} on all shutdown URLs: even with wait=false the server uses timeout_secs as the graceful drain deadline, so dropping it would silently override suite configuration - Fix backend shutdown polling to use /readyz (consistent with load-gen and engine hooks) instead of /telemetry/metrics which is served process-wide and never returns non-200 after pipeline shutdown - Fix elapsed time in polling log messages: loop counter i increments once per 5s sleep so elapsed seconds is i*5, not i - Update log messages from 'stopped' to 'no longer ready' to accurately reflect that readyz=503 means 'not ready' (e.g. Draining) not necessarily 'fully terminated'

codecov · 2026-05-01T15:03:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.03%. Comparing base (f018901) to head (cf6d074).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2796   +/-   ##
=======================================
  Coverage   86.03%   86.03%           
=======================================
  Files         704      704           
  Lines      264591   264591           
=======================================
  Hits       227654   227654           
  Misses      36413    36413           
  Partials      524      524

Components	Coverage Δ
otap-dataflow	`86.98% <ø> (ø)`
query_abstraction	`80.61% <ø> (ø)`
query_engine	`90.76% <ø> (ø)`
otel-arrow-go	`52.45% <ø> (ø)`
quiver	`92.25% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

daviddahl · 2026-05-01T15:31:19Z

All feedback addressed in latest push (rebased on current main):

timeout_secs restored on all URLs
Backend polling uses /readyz (was /telemetry/metrics)
Elapsed time: ${i}s → $((i * 5))s
Log messages: "stopped" → "no longer ready"
URLs updated to /api/v1/groups/ (post-rebase)

@cijothomas ready for re-review when you have a chance.

Conflict resolution during rebase dropped the method: POST field from the send_http_request shutdown hooks. The pydantic validator requires this field and rejected the config with ValidationError at load time.

daviddahl · 2026-05-01T15:43:31Z

Fixed CI failure -- conflict resolution during rebase dropped method: POST from the send_http_request blocks. Restored in latest push.

JakeDern · 2026-05-01T16:00:34Z

+                  fi;
+                  sleep 5;
+                done;
+                echo "WARNING: Load generator still running after 300s, proceeding anyway";


I'm probably missing something, but could you explain how this is this different from setting raise_for_status: false and bumping the drain timeout from 60s to 300s?

I'm also wondering under which circumstances is draining taking longer than a minute? That's a pretty long time to flush the queues when we know the df engine can process millions of records/sec 😜

I think the problem is on our side where we are testing with Kafka, etc

Is setting raise_for_status: false and bumping the timeout a valid solution to this problem instead of writing this shell script in all the polling steps?

You're right that raise_for_status: false + a longer timeout_secs would also prevent the suite abort. The polling approach adds live log feedback and early exit, but it's more code.

The drain timing issue came up in downstream deployments with heavier exporter configurations where flush times regularly exceeded the 60s default. The other common failure was 404 from the API path rename when testing against historical commits.

Happy to simplify to raise_for_status: false if the team prefers -- the polling is nice-to-have for observability but not strictly necessary.

Drop the run_command readyz polling loops in favor of the simpler raise_for_status: false + on_error: continue approach. The shutdown hooks now stay as close to the original template as possible -- just two fields added to make non-2xx responses non-fatal.

daviddahl · 2026-05-01T18:43:24Z

Simplified -- dropped the polling, just raise_for_status: false + on_error: { continue: true } on the existing hooks.

JakeDern

LGTM, thanks @daviddahl!

Copilot AI review requested due to automatic review settings April 30, 2026 19:45

daviddahl requested a review from a team as a code owner April 30, 2026 19:45

github-project-automation Bot added this to OTel-Arrow Apr 30, 2026

Copilot started reviewing on behalf of daviddahl April 30, 2026 19:49 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

daviddahl marked this pull request as draft April 30, 2026 20:09

cijothomas reviewed Apr 30, 2026

View reviewed changes

Comment thread ...pipeline_perf_test/test_suites/integration/templates/test_steps/df-loadgen-steps-docker.yaml Outdated

daviddahl added 2 commits May 1, 2026 10:58

daviddahl force-pushed the fix/perf-test-shutdown-hooks-2795 branch from b30fcbf to e6ecff8 Compare May 1, 2026 15:00

fix: restore missing method: POST on shutdown hooks

2df2fea

Conflict resolution during rebase dropped the method: POST field from the send_http_request shutdown hooks. The pydantic validator requires this field and rejected the config with ValidationError at load time.

JakeDern reviewed May 1, 2026

View reviewed changes

daviddahl marked this pull request as ready for review May 1, 2026 18:46

daviddahl added 2 commits May 1, 2026 14:46

Merge branch 'main' into fix/perf-test-shutdown-hooks-2795

4c118f4

Merge branch 'main' into fix/perf-test-shutdown-hooks-2795

cf6d074

JakeDern approved these changes May 4, 2026

View reviewed changes

Conversation

daviddahl commented Apr 30, 2026

Summary

Changes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

daviddahl commented May 1, 2026

Uh oh!

daviddahl commented May 1, 2026

Uh oh!

JakeDern May 1, 2026

Choose a reason for hiding this comment

Uh oh!

daviddahl May 1, 2026

Choose a reason for hiding this comment

Uh oh!

JakeDern May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daviddahl May 1, 2026

Choose a reason for hiding this comment

Uh oh!

daviddahl commented May 1, 2026

Uh oh!

JakeDern left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented May 1, 2026 •

edited

Loading

JakeDern May 1, 2026 •

edited

Loading