Skip to content

Add live pipeline reconfiguration and shutdown control to the admin API#2618

Merged
lquerel merged 54 commits intoopen-telemetry:mainfrom
lquerel:live-reconfig
Apr 24, 2026
Merged

Add live pipeline reconfiguration and shutdown control to the admin API#2618
lquerel merged 54 commits intoopen-telemetry:mainfrom
lquerel:live-reconfig

Conversation

@lquerel
Copy link
Copy Markdown
Contributor

@lquerel lquerel commented Apr 9, 2026

Change Summary

This PR adds live pipeline lifecycle control to the OTAP Dataflow Engine admin API.

The controller now runs as a resident runtime manager and supports in-memory pipeline operations on a running engine:

  • create a new pipeline in an existing group
  • replace an existing pipeline with a health-gated serial rolling cutover
  • resize a pipeline when only the effective core allocation changes
  • detect identical updates and return a successful noop result
  • shut down an individual logical pipeline and track shutdown progress

The PR also adds rollout and shutdown status resources, extends runtime and observed-state tracking with deployment generations so overlapping instances remain distinguishable, updates the Rust admin SDK with typed live-control methods/outcomes, and documents the feature.

Design Decisions

This PR keeps live reconfiguration scoped to one logical pipeline at a time, keyed by (group, pipeline). Reconfiguration mutations go through one declarative endpoint, PUT /api/v1/groups/{group}/pipelines/{id}, while pipeline shutdown remains a separate operation through POST /api/v1/groups/{group}/pipelines/{id}/shutdown.

The controller classifies each reconfiguration request as create, noop, resize, or replace instead of exposing separate start, scale, and update APIs.

Topology and node-configuration changes use a serial rolling cutover with overlap: start the new instance on one core, wait for Admitted and Ready, then drain the old instance on that core. This preserves continuity while fitting the existing runtime model, without requiring a full second serving fleet or a separate traffic-switching layer.

Pure core-allocation changes use a dedicated internal resize path so scale up/down only starts or stops the delta cores and leaves unchanged cores running.

Runtime mutations remain intentionally narrow. This PR keeps changes in memory only and rejects updates that would require topic-broker reconfiguration or broader engine/group-level policy mutation. Runtime config persistence remains out of scope.

Operations are explicit and observable through rollout and shutdown ids. Terminal operation history is intentionally bounded in memory, so old rollout/shutdown ids may eventually return 404 or Ok(None) from the SDK after retention pruning.

Observed status is deployment-generation-aware. During overlapping rollouts, /status.instances preserves old and new runtime instances, while aggregate readiness/liveness still uses the selected serving generation per core. After controller work completes, superseded observed instances are compacted so status memory does not grow unbounded across rollouts.

The controller also hardens lifecycle cleanup around live operations: worker panics are converted into terminal rollout/shutdown failure states, runtime exits are reported through supervised bookkeeping, and shutdown drain completion waits for upstream closure rather than queue emptiness alone.

What issue does this PR close?

How are these changes tested?

Commands run:

  • cargo xtask check

Automated coverage includes:

  • rollout planning for create, replace, resize, scale up/down, noop, topic-mutation rejection, and conflict handling
  • rollout execution success/failure paths, rollback behavior, worker panic cleanup, and bounded operation retention
  • per-pipeline shutdown planning, progress tracking, conflict handling, timeout behavior, and worker panic cleanup
  • observed-state generation selection, overlap-aware /status.instances, compaction of superseded generations, and shutdown-terminal status behavior
  • admin HTTP handlers for reconfigure, rollout status, shutdown, shutdown status, operation errors, wait timeouts, and missing retained operation ids
  • Rust admin SDK decoding for typed reconfigure/shutdown outcomes, operation rejection errors, pipeline details/status, and rollout/shutdown polling
  • engine shutdown-drain behavior where downstream nodes wait for upstream channel closure before completing shutdown drain
  • route/doc compatibility checks through the full workspace validation suite

Manual validation covered:

  • creating a new pipeline through the admin API
  • replacing a pipeline with a topology/config change
  • resizing a pipeline by changing only core allocation
  • submitting an identical config and observing a noop rollout
  • shutting down a pipeline and observing shutdown tracking/status
  • checking generation-aware status during overlapping rollout behavior

Are there any user-facing changes?

Yes.

  • The admin API now supports live pipeline create/update/resize/noop via PUT /api/v1/groups/{group}/pipelines/{id}.
  • Pipeline shutdown is available via POST /api/v1/groups/{group}/pipelines/{id}/shutdown.
  • Rollout and shutdown progress can be queried through dedicated status endpoints.
  • New pipeline-scoped admin routes are under /api/v1/groups/....
  • Pipeline status payloads now include deployment-generation-aware instance data and rollout metadata while preserving the existing aggregate status shape.
  • Terminal rollout and shutdown ids are retained only within a bounded in-memory window.
  • The Rust admin SDK now exposes typed clients, request options, operation outcomes, and error models for live reconfiguration and shutdown.
  • Operator documentation was added in rust/otap-dataflow/docs/admin/live-reconfiguration.md.
  • Support for the existing WebSocket log stream endpoint /api/v1/telemetry/logs/stream is preserved/restored in this branch.

@github-actions github-actions Bot added the rust Pull requests that update Rust code label Apr 9, 2026
@lquerel lquerel changed the title Add live pipeline reconfiguration and shutdown control to the admin API [WIP] Add live pipeline reconfiguration and shutdown control to the admin API Apr 9, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 70.44681% with 1389 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.06%. Comparing base (da7688b) to head (44015e8).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2618      +/-   ##
==========================================
- Coverage   88.22%   88.06%   -0.17%     
==========================================
  Files         639      644       +5     
  Lines      242572   246795    +4223     
==========================================
+ Hits       214018   217334    +3316     
- Misses      28030    28937     +907     
  Partials      524      524              
Components Coverage Δ
otap-dataflow 89.64% <70.44%> (-0.25%) ⬇️
query_abstraction 80.61% <ø> (ø)
query_engine 90.75% <ø> (ø)
otel-arrow-go 52.45% <ø> (ø)
quiver 92.25% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lquerel
Copy link
Copy Markdown
Contributor Author

lquerel commented Apr 23, 2026

@lalitb Thanks, for all this very valuable feedback.

Copy link
Copy Markdown
Member

@lalitb lalitb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left one lifecycle cleanup question around partial resize/replace launch failures, but otherwise the engine/control-plane changes look thoughtfully designed and well covered. Looking forward to using live pipeline reconfiguration :)

*core_id,
active_generation,
)
.map_err(|err| RolloutExecutionError::Failed(err.to_string()))?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this go through rollback_resize_rollout instead of returning Failed directly? If an earlier core already started and became ready, it'll be left running when this fails.

Same pattern is there in run_replace_rollout too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I updated both run_resize_rollout and run_replace_rollout so a launch failure after earlier rollout progress now goes through the existing rollback helpers instead of returning Failed directly. That means already started cores are cleaned up rather than being left running. I also added regression tests covering resize rollback cleanup and replace rollback cleanup for previously activated added cores.

791ba2b

lquerel added 2 commits April 23, 2026 10:51
# Conflicts:
#	rust/otap-dataflow/crates/core-nodes/src/receivers/fake_data_generator/mod.rs
@lquerel lquerel enabled auto-merge April 23, 2026 18:14
Comment thread rust/otap-dataflow/crates/controller/src/live_control/planning.rs Outdated
Comment thread rust/otap-dataflow/crates/controller/src/live_control/planning.rs Outdated
Comment thread rust/otap-dataflow/crates/controller/src/live_control/runtime.rs
Comment thread rust/otap-dataflow/src/main.rs
@lquerel lquerel added this pull request to the merge queue Apr 24, 2026
Merged via the queue into open-telemetry:main with commit 5507512 Apr 24, 2026
83 of 84 checks passed
@lquerel lquerel deleted the live-reconfig branch April 24, 2026 06:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Add live pipeline reconfiguration and shutdown control to the admin API

4 participants