Add live pipeline reconfiguration and shutdown control to the admin API by lquerel · Pull Request #2618 · open-telemetry/otel-arrow

lquerel · 2026-04-09T03:16:22Z

Change Summary

This PR adds live pipeline lifecycle control to the OTAP Dataflow Engine admin API.

The controller now runs as a resident runtime manager and supports in-memory pipeline operations on a running engine:

create a new pipeline in an existing group
replace an existing pipeline with a health-gated serial rolling cutover
resize a pipeline when only the effective core allocation changes
detect identical updates and return a successful noop result
shut down an individual logical pipeline and track shutdown progress

The PR also adds rollout and shutdown status resources, extends runtime and observed-state tracking with deployment generations so overlapping instances remain distinguishable, updates the Rust admin SDK with typed live-control methods/outcomes, and documents the feature.

Design Decisions

This PR keeps live reconfiguration scoped to one logical pipeline at a time, keyed by (group, pipeline). Reconfiguration mutations go through one declarative endpoint, PUT /api/v1/groups/{group}/pipelines/{id}, while pipeline shutdown remains a separate operation through POST /api/v1/groups/{group}/pipelines/{id}/shutdown.

The controller classifies each reconfiguration request as create, noop, resize, or replace instead of exposing separate start, scale, and update APIs.

Topology and node-configuration changes use a serial rolling cutover with overlap: start the new instance on one core, wait for Admitted and Ready, then drain the old instance on that core. This preserves continuity while fitting the existing runtime model, without requiring a full second serving fleet or a separate traffic-switching layer.

Pure core-allocation changes use a dedicated internal resize path so scale up/down only starts or stops the delta cores and leaves unchanged cores running.

Runtime mutations remain intentionally narrow. This PR keeps changes in memory only and rejects updates that would require topic-broker reconfiguration or broader engine/group-level policy mutation. Runtime config persistence remains out of scope.

Operations are explicit and observable through rollout and shutdown ids. Terminal operation history is intentionally bounded in memory, so old rollout/shutdown ids may eventually return 404 or Ok(None) from the SDK after retention pruning.

Observed status is deployment-generation-aware. During overlapping rollouts, /status.instances preserves old and new runtime instances, while aggregate readiness/liveness still uses the selected serving generation per core. After controller work completes, superseded observed instances are compacted so status memory does not grow unbounded across rollouts.

The controller also hardens lifecycle cleanup around live operations: worker panics are converted into terminal rollout/shutdown failure states, runtime exits are reported through supervised bookkeeping, and shutdown drain completion waits for upstream closure rather than queue emptiness alone.

What issue does this PR close?

Closes Add live pipeline reconfiguration and shutdown control to the admin API #2617

How are these changes tested?

Commands run:

cargo xtask check

Automated coverage includes:

rollout planning for create, replace, resize, scale up/down, noop, topic-mutation rejection, and conflict handling
rollout execution success/failure paths, rollback behavior, worker panic cleanup, and bounded operation retention
per-pipeline shutdown planning, progress tracking, conflict handling, timeout behavior, and worker panic cleanup
observed-state generation selection, overlap-aware /status.instances, compaction of superseded generations, and shutdown-terminal status behavior
admin HTTP handlers for reconfigure, rollout status, shutdown, shutdown status, operation errors, wait timeouts, and missing retained operation ids
Rust admin SDK decoding for typed reconfigure/shutdown outcomes, operation rejection errors, pipeline details/status, and rollout/shutdown polling
engine shutdown-drain behavior where downstream nodes wait for upstream channel closure before completing shutdown drain
route/doc compatibility checks through the full workspace validation suite

Manual validation covered:

creating a new pipeline through the admin API
replacing a pipeline with a topology/config change
resizing a pipeline by changing only core allocation
submitting an identical config and observing a noop rollout
shutting down a pipeline and observing shutdown tracking/status
checking generation-aware status during overlapping rollout behavior

Are there any user-facing changes?

Yes.

The admin API now supports live pipeline create/update/resize/noop via PUT /api/v1/groups/{group}/pipelines/{id}.
Pipeline shutdown is available via POST /api/v1/groups/{group}/pipelines/{id}/shutdown.
Rollout and shutdown progress can be queried through dedicated status endpoints.
New pipeline-scoped admin routes are under /api/v1/groups/....
Pipeline status payloads now include deployment-generation-aware instance data and rollout metadata while preserving the existing aggregate status shape.
Terminal rollout and shutdown ids are retained only within a bounded in-memory window.
The Rust admin SDK now exposes typed clients, request options, operation outcomes, and error models for live reconfiguration and shutdown.
Operator documentation was added in rust/otap-dataflow/docs/admin/live-reconfiguration.md.
Support for the existing WebSocket log stream endpoint /api/v1/telemetry/logs/stream is preserved/restored in this branch.

codecov · 2026-04-09T03:47:56Z

Codecov Report

❌ Patch coverage is 70.44681% with 1389 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.06%. Comparing base (da7688b) to head (44015e8).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2618      +/-   ##
==========================================
- Coverage   88.22%   88.06%   -0.17%     
==========================================
  Files         639      644       +5     
  Lines      242572   246795    +4223     
==========================================
+ Hits       214018   217334    +3316     
- Misses      28030    28937     +907     
  Partials      524      524

Components	Coverage Δ
otap-dataflow	`89.64% <70.44%> (-0.25%)`	⬇️
query_abstraction	`80.61% <ø> (ø)`
query_engine	`90.75% <ø> (ø)`
otel-arrow-go	`52.45% <ø> (ø)`
quiver	`92.25% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

lquerel · 2026-04-23T00:23:08Z

@lalitb Thanks, for all this very valuable feedback.

lalitb

LGTM. I left one lifecycle cleanup question around partial resize/replace launch failures, but otherwise the engine/control-plane changes look thoughtfully designed and well covered. Looking forward to using live pipeline reconfiguration :)

lalitb · 2026-04-23T05:15:44Z

+                    *core_id,
+                    active_generation,
+                )
+                .map_err(|err| RolloutExecutionError::Failed(err.to_string()))?;


Shouldn't this go through rollback_resize_rollout instead of returning Failed directly? If an earlier core already started and became ready, it'll be left running when this fails.

Same pattern is there in run_replace_rollout too.

Good catch. I updated both run_resize_rollout and run_replace_rollout so a launch failure after earlier rollout progress now goes through the existing rollback helpers instead of returning Failed directly. That means already started cores are cleaned up rather than being left running. I also added regression tests covering resize rollback cleanup and replace rollback cleanup for previously activated added cores.

791ba2b

# Conflicts: # rust/otap-dataflow/crates/core-nodes/src/receivers/fake_data_generator/mod.rs

github-project-automation Bot added this to OTel-Arrow Apr 9, 2026

github-actions Bot added the rust Pull requests that update Rust code label Apr 9, 2026

lquerel changed the title ~~Add live pipeline reconfiguration and shutdown control to the admin API~~ [WIP] Add live pipeline reconfiguration and shutdown control to the admin API Apr 9, 2026

lquerel added 26 commits April 15, 2026 18:40

Add live pipeline reconfiguration and shutdown control

8862c6e

Add live-reconfiguration.md doc

5849f12

Refine live reconfiguration rollout docs

52dd870

Add noop rollouts and simplify group routes

7413257

Add live pipeline control to the public admin SDK

35747f2

Revert core-nodes test-only changes

e599155

Improve admin SDK live control docs

edfcb18

Add admin SDK rustdoc examples

adbc09b

Shorten pipeline operation status names

825f90d

Document live reconfig test scenarios

1414f34

Extract live control from controller lib

8833776

Bound live control retention and document methods

101906d

Improve receiver drain responsiveness

fb4e60b

Fix topic shutdown under backpressure

57faf2b

Optimize topic fast paths under backpressure

69b7d1b

Move topic wait semantics into the topic runtime

aa8412c

Optimize topic delivery lease storage

00e426d

Bound observed-state instance retention

c4d1c28

Restore overlap-aware pipeline status instances

562c095

Handle rollout and shutdown worker panics

71c28bc

Improve controller panic diagnostics

ffea85f

Make mixed topic try_publish all-or-nothing

24d3af4

Avoid mixed topic permit convoys

af4d9c1

Report pipeline exits without watcher threads

0b57117

Fix resize-down pipeline status selection

2bf981c

Split live control into focused modules

8942d5f

lquerel added 5 commits April 22, 2026 15:53

Fox doc issue

68f510c

Merge branch 'main' into live-reconfig

acfe52d

Document live reconfiguration consistency model

6d9ec31

Clean up partial create rollout launches

b2f5907

Clean up rollout panic candidates

a76e150

lquerel added 5 commits April 22, 2026 17:32

Clarify live reconfiguration terminology

5014850

Merge branch 'main' into live-reconfig

1fb2db6

Fix merge issue

7184319

Merge remote-tracking branch 'origin/live-reconfig' into live-reconfig

321ebfb

Fix fmt issue

831a479

lalitb approved these changes Apr 23, 2026

View reviewed changes

lquerel added 2 commits April 23, 2026 10:51

Rollback partial resize and replace launches

791ba2b

Merge remote-tracking branch 'upstream/main' into live-reconfig

d6a1906

# Conflicts: # rust/otap-dataflow/crates/core-nodes/src/receivers/fake_data_generator/mod.rs

lquerel enabled auto-merge April 23, 2026 18:14

gouslu reviewed Apr 23, 2026

View reviewed changes

Comment thread rust/otap-dataflow/crates/controller/src/live_control/planning.rs Outdated

gouslu reviewed Apr 23, 2026

View reviewed changes

Comment thread rust/otap-dataflow/crates/controller/src/live_control/planning.rs Outdated

gouslu reviewed Apr 23, 2026

View reviewed changes

Comment thread rust/otap-dataflow/crates/controller/src/live_control/runtime.rs

jmacd approved these changes Apr 23, 2026

View reviewed changes

Comment thread rust/otap-dataflow/src/main.rs

lquerel added 5 commits April 23, 2026 17:15

Move rollout shape comparison into config types

0e9cd12

Use condvar for instance exit waits

bfcfc0f

Merge branch 'main' into live-reconfig

1fe6227

Merge branch 'main' into live-reconfig

e893462

Fix terminal retention TTL test on Windows

44015e8

lquerel added this pull request to the merge queue Apr 24, 2026

Merged via the queue into open-telemetry:main with commit 5507512 Apr 24, 2026
83 of 84 checks passed

lquerel deleted the live-reconfig branch April 24, 2026 06:24

github-project-automation Bot moved this to Done in OTel-Arrow Apr 24, 2026

daviddahl mentioned this pull request Apr 24, 2026

Build failure for downstream otap-dataflow consumers after weaver v0.23.0 update #2750

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add live pipeline reconfiguration and shutdown control to the admin API#2618

Add live pipeline reconfiguration and shutdown control to the admin API#2618
lquerel merged 54 commits intoopen-telemetry:mainfrom
lquerel:live-reconfig

lquerel commented Apr 9, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

lquerel commented Apr 23, 2026

Uh oh!

lalitb left a comment

Uh oh!

lalitb Apr 23, 2026

Uh oh!

lquerel Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lquerel commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Summary

Design Decisions

What issue does this PR close?

How are these changes tested?

Are there any user-facing changes?

Uh oh!

codecov Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lquerel commented Apr 23, 2026

Uh oh!

lalitb left a comment

Choose a reason for hiding this comment

Uh oh!

lalitb Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

lquerel Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lquerel commented Apr 9, 2026 •

edited

Loading

codecov Bot commented Apr 9, 2026 •

edited

Loading