Skip to content

[PerfTest] Fix shutdown hooks: use wait=false with readyz polling (#2795)#2796

Open
daviddahl wants to merge 6 commits intoopen-telemetry:mainfrom
daviddahl:fix/perf-test-shutdown-hooks-2795
Open

[PerfTest] Fix shutdown hooks: use wait=false with readyz polling (#2795)#2796
daviddahl wants to merge 6 commits intoopen-telemetry:mainfrom
daviddahl:fix/perf-test-shutdown-hooks-2795

Conversation

@daviddahl
Copy link
Copy Markdown
Contributor

Summary

Fixes #2795.

The df-loadgen-steps-docker*.yaml test step templates used wait=true on shutdown hooks, which blocks until all pipelines terminate. The admin server stays alive after pipeline shutdown by design (run_forever parks the main thread), so wait=true returns 504 Gateway Timeout when drain takes longer than timeout_secs.

Since send_http_request defaults to raise_for_status: true with no on_error configured, this 504 aborts the entire suite before the report step -- losing all metrics collected during the observation window.

Changes

Switch from blocking wait=true to non-blocking wait=false with active readyz polling across all three affected template files:

File Hooks fixed
df-loadgen-steps-docker.yaml 3 (ports 8085, 8086, 8087)
df-loadgen-steps-docker-filtered.yaml 3 (ports 8085, 8086, 8087)
df-loadgen-steps-docker-otel.yaml 1 (port 8085)

Pattern for each shutdown hook:

  1. POST .../shutdown?wait=false -- returns 202 immediately, drain runs in background
  2. Poll /readyz every 5s until 503 (pipelines terminated) or endpoint stops responding
  3. on_error: continue: true so 404/connection errors (service already exited) are non-fatal
  4. Polling loop exits with a warning after 300s (container will be killed by destroy regardless)

Testing

Verified in downstream deployments using the orchestrator framework across a range of workloads. The polling approach provides clear log output ("Engine pipelines stopped after Ns") and never aborts the suite due to shutdown timing.

/cc @cijothomas

Copilot AI review requested due to automatic review settings April 30, 2026 19:45
@daviddahl daviddahl requested a review from a team as a code owner April 30, 2026 19:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the pipeline perf-test Docker step templates to avoid suite-aborting shutdown timeouts by switching shutdown hooks from blocking wait=true requests to non-blocking wait=false plus follow-up polling.

Changes:

  • Replace .../shutdown?wait=true&timeout_secs=... with .../shutdown?wait=false and add on_error: { continue: true } to prevent 504/404 from failing the suite.
  • Add polling loops intended to wait for shutdown progress (via /readyz or an endpoint probe) with a 300s cap and warning.
  • Apply the pattern across the “docker”, “docker-filtered”, and “docker-otel” template variants.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 14 comments.

File Description
tools/pipeline_perf_test/test_suites/integration/templates/test_steps/df-loadgen-steps-docker.yaml Convert loadgen/engine/backend shutdown hooks to wait=false with polling to avoid blocking timeouts.
tools/pipeline_perf_test/test_suites/integration/templates/test_steps/df-loadgen-steps-docker-filtered.yaml Same shutdown-hook changes as docker variant for the filtered suite.
tools/pipeline_perf_test/test_suites/integration/templates/test_steps/df-loadgen-steps-docker-otel.yaml Update load-generator shutdown hook to wait=false with polling in the OTel variant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@daviddahl daviddahl marked this pull request as draft April 30, 2026 20:09
daviddahl added 2 commits May 1, 2026 10:58
…en-telemetry#2795)

The df-loadgen-steps-docker*.yaml templates used wait=true on shutdown
hooks, which blocks until all pipelines terminate. The admin server
stays alive after pipeline shutdown by design (run_forever parks the
main thread), so wait=true returns 504 when drain takes longer than
timeout_secs. With raise_for_status=true (default) and no on_error
configured, this aborts the suite before the report step -- losing all
metrics collected during the observation window.

Fix: switch to wait=false (initiates drain, returns 202 immediately)
with a readyz polling loop that confirms termination without blocking.
Add on_error: continue: true so 404/connection errors are non-fatal.

Files changed:
- df-loadgen-steps-docker.yaml: 3 shutdown hooks (8085, 8086, 8087)
- df-loadgen-steps-docker-filtered.yaml: 3 shutdown hooks
- df-loadgen-steps-docker-otel.yaml: 1 shutdown hook (8085)

Fixes open-telemetry#2795
- Restore timeout_secs={{drain_timeout_secs | default(60)}} on all shutdown
  URLs: even with wait=false the server uses timeout_secs as the graceful
  drain deadline, so dropping it would silently override suite configuration
- Fix backend shutdown polling to use /readyz (consistent with load-gen and
  engine hooks) instead of /telemetry/metrics which is served process-wide
  and never returns non-200 after pipeline shutdown
- Fix elapsed time in polling log messages: loop counter i increments once
  per 5s sleep so elapsed seconds is i*5, not i
- Update log messages from 'stopped' to 'no longer ready' to accurately
  reflect that readyz=503 means 'not ready' (e.g. Draining) not necessarily
  'fully terminated'
@daviddahl daviddahl force-pushed the fix/perf-test-shutdown-hooks-2795 branch from b30fcbf to e6ecff8 Compare May 1, 2026 15:00
@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.03%. Comparing base (f018901) to head (cf6d074).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2796   +/-   ##
=======================================
  Coverage   86.03%   86.03%           
=======================================
  Files         704      704           
  Lines      264591   264591           
=======================================
  Hits       227654   227654           
  Misses      36413    36413           
  Partials      524      524           
Components Coverage Δ
otap-dataflow 86.98% <ø> (ø)
query_abstraction 80.61% <ø> (ø)
query_engine 90.76% <ø> (ø)
otel-arrow-go 52.45% <ø> (ø)
quiver 92.25% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@daviddahl
Copy link
Copy Markdown
Contributor Author

All feedback addressed in latest push (rebased on current main):

  1. timeout_secs restored on all URLs
  2. Backend polling uses /readyz (was /telemetry/metrics)
  3. Elapsed time: ${i}s$((i * 5))s
  4. Log messages: "stopped" → "no longer ready"
  5. URLs updated to /api/v1/groups/ (post-rebase)

@cijothomas ready for re-review when you have a chance.

Conflict resolution during rebase dropped the method: POST field from
the send_http_request shutdown hooks. The pydantic validator requires
this field and rejected the config with ValidationError at load time.
@daviddahl
Copy link
Copy Markdown
Contributor Author

Fixed CI failure -- conflict resolution during rebase dropped method: POST from the send_http_request blocks. Restored in latest push.

fi;
sleep 5;
done;
echo "WARNING: Load generator still running after 300s, proceeding anyway";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something, but could you explain how this is this different from setting raise_for_status: false and bumping the drain timeout from 60s to 300s?

I'm also wondering under which circumstances is draining taking longer than a minute? That's a pretty long time to flush the queues when we know the df engine can process millions of records/sec 😜

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is on our side where we are testing with Kafka, etc

Copy link
Copy Markdown
Contributor

@JakeDern JakeDern May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is setting raise_for_status: false and bumping the timeout a valid solution to this problem instead of writing this shell script in all the polling steps?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that raise_for_status: false + a longer timeout_secs would also prevent the suite abort. The polling approach adds live log feedback and early exit, but it's more code.

The drain timing issue came up in downstream deployments with heavier exporter configurations where flush times regularly exceeded the 60s default. The other common failure was 404 from the API path rename when testing against historical commits.

Happy to simplify to raise_for_status: false if the team prefers -- the polling is nice-to-have for observability but not strictly necessary.

Drop the run_command readyz polling loops in favor of the simpler
raise_for_status: false + on_error: continue approach. The shutdown
hooks now stay as close to the original template as possible -- just
two fields added to make non-2xx responses non-fatal.
@daviddahl
Copy link
Copy Markdown
Contributor Author

Simplified -- dropped the polling, just raise_for_status: false + on_error: { continue: true } on the existing hooks.

@daviddahl daviddahl marked this pull request as ready for review May 1, 2026 18:46
Copy link
Copy Markdown
Contributor

@JakeDern JakeDern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @daviddahl!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[PerfTest] Shutdown hooks with wait=true cause fatal suite abort when pipelines don't drain in time

4 participants