Add live pipeline reconfiguration and shutdown control to the admin API#2618
Add live pipeline reconfiguration and shutdown control to the admin API#2618lquerel merged 54 commits intoopen-telemetry:mainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2618 +/- ##
==========================================
- Coverage 88.22% 88.06% -0.17%
==========================================
Files 639 644 +5
Lines 242572 246795 +4223
==========================================
+ Hits 214018 217334 +3316
- Misses 28030 28937 +907
Partials 524 524
🚀 New features to boost your workflow:
|
|
@lalitb Thanks, for all this very valuable feedback. |
lalitb
left a comment
There was a problem hiding this comment.
LGTM. I left one lifecycle cleanup question around partial resize/replace launch failures, but otherwise the engine/control-plane changes look thoughtfully designed and well covered. Looking forward to using live pipeline reconfiguration :)
| *core_id, | ||
| active_generation, | ||
| ) | ||
| .map_err(|err| RolloutExecutionError::Failed(err.to_string()))?; |
There was a problem hiding this comment.
Shouldn't this go through rollback_resize_rollout instead of returning Failed directly? If an earlier core already started and became ready, it'll be left running when this fails.
Same pattern is there in run_replace_rollout too.
There was a problem hiding this comment.
Good catch. I updated both run_resize_rollout and run_replace_rollout so a launch failure after earlier rollout progress now goes through the existing rollback helpers instead of returning Failed directly. That means already started cores are cleaned up rather than being left running. I also added regression tests covering resize rollback cleanup and replace rollback cleanup for previously activated added cores.
# Conflicts: # rust/otap-dataflow/crates/core-nodes/src/receivers/fake_data_generator/mod.rs
5507512
Change Summary
This PR adds live pipeline lifecycle control to the OTAP Dataflow Engine admin API.
The controller now runs as a resident runtime manager and supports in-memory pipeline operations on a running engine:
noopresultThe PR also adds rollout and shutdown status resources, extends runtime and observed-state tracking with deployment generations so overlapping instances remain distinguishable, updates the Rust admin SDK with typed live-control methods/outcomes, and documents the feature.
Design Decisions
This PR keeps live reconfiguration scoped to one logical pipeline at a time, keyed by
(group, pipeline). Reconfiguration mutations go through one declarative endpoint,PUT /api/v1/groups/{group}/pipelines/{id}, while pipeline shutdown remains a separate operation throughPOST /api/v1/groups/{group}/pipelines/{id}/shutdown.The controller classifies each reconfiguration request as
create,noop,resize, orreplaceinstead of exposing separate start, scale, and update APIs.Topology and node-configuration changes use a serial rolling cutover with overlap: start the new instance on one core, wait for
AdmittedandReady, then drain the old instance on that core. This preserves continuity while fitting the existing runtime model, without requiring a full second serving fleet or a separate traffic-switching layer.Pure core-allocation changes use a dedicated internal resize path so scale up/down only starts or stops the delta cores and leaves unchanged cores running.
Runtime mutations remain intentionally narrow. This PR keeps changes in memory only and rejects updates that would require topic-broker reconfiguration or broader engine/group-level policy mutation. Runtime config persistence remains out of scope.
Operations are explicit and observable through rollout and shutdown ids. Terminal operation history is intentionally bounded in memory, so old rollout/shutdown ids may eventually return
404orOk(None)from the SDK after retention pruning.Observed status is deployment-generation-aware. During overlapping rollouts,
/status.instancespreserves old and new runtime instances, while aggregate readiness/liveness still uses the selected serving generation per core. After controller work completes, superseded observed instances are compacted so status memory does not grow unbounded across rollouts.The controller also hardens lifecycle cleanup around live operations: worker panics are converted into terminal rollout/shutdown failure states, runtime exits are reported through supervised bookkeeping, and shutdown drain completion waits for upstream closure rather than queue emptiness alone.
What issue does this PR close?
How are these changes tested?
Commands run:
cargo xtask checkAutomated coverage includes:
/status.instances, compaction of superseded generations, and shutdown-terminal status behaviorManual validation covered:
Are there any user-facing changes?
Yes.
PUT /api/v1/groups/{group}/pipelines/{id}.POST /api/v1/groups/{group}/pipelines/{id}/shutdown./api/v1/groups/....rust/otap-dataflow/docs/admin/live-reconfiguration.md./api/v1/telemetry/logs/streamis preserved/restored in this branch.