Skip to content
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
8862c6e
Add live pipeline reconfiguration and shutdown control
lquerel Mar 14, 2026
5849f12
Add live-reconfiguration.md doc
lquerel Mar 15, 2026
52dd870
Refine live reconfiguration rollout docs
lquerel Mar 15, 2026
7413257
Add noop rollouts and simplify group routes
lquerel Mar 15, 2026
35747f2
Add live pipeline control to the public admin SDK
lquerel Apr 9, 2026
e599155
Revert core-nodes test-only changes
lquerel Apr 9, 2026
edfcb18
Improve admin SDK live control docs
lquerel Apr 9, 2026
adbc09b
Add admin SDK rustdoc examples
lquerel Apr 9, 2026
825f90d
Shorten pipeline operation status names
lquerel Apr 9, 2026
1414f34
Document live reconfig test scenarios
lquerel Apr 9, 2026
8833776
Extract live control from controller lib
lquerel Apr 10, 2026
101906d
Bound live control retention and document methods
lquerel Apr 10, 2026
fb4e60b
Improve receiver drain responsiveness
lquerel Apr 10, 2026
57faf2b
Fix topic shutdown under backpressure
lquerel Apr 10, 2026
69b7d1b
Optimize topic fast paths under backpressure
lquerel Apr 10, 2026
aa8412c
Move topic wait semantics into the topic runtime
lquerel Apr 10, 2026
00e426d
Optimize topic delivery lease storage
lquerel Apr 10, 2026
c4d1c28
Bound observed-state instance retention
lquerel Apr 10, 2026
562c095
Restore overlap-aware pipeline status instances
lquerel Apr 10, 2026
71c28bc
Handle rollout and shutdown worker panics
lquerel Apr 10, 2026
ffea85f
Improve controller panic diagnostics
lquerel Apr 10, 2026
24d3af4
Make mixed topic try_publish all-or-nothing
lquerel Apr 10, 2026
af4d9c1
Avoid mixed topic permit convoys
lquerel Apr 11, 2026
0b57117
Report pipeline exits without watcher threads
lquerel Apr 11, 2026
2bf981c
Fix resize-down pipeline status selection
lquerel Apr 12, 2026
8942d5f
Split live control into focused modules
lquerel Apr 12, 2026
cdc2b23
Fix shutdown drain to wait for upstream closure
lquerel Apr 12, 2026
79b79c0
Add topic review comments
lquerel Apr 14, 2026
cb17333
Clarify topic delivery permits
lquerel Apr 15, 2026
76ff6ea
Align live reconfig with current main APIs
lquerel Apr 16, 2026
c982984
Merge remote-tracking branch 'upstream/main' into live-reconfig
lquerel Apr 20, 2026
bcbb149
Restore telemetry log stream endpoint
lquerel Apr 22, 2026
447bffc
Document live control module
lquerel Apr 22, 2026
71d2a59
Merge branch 'main' into live-reconfig
lquerel Apr 22, 2026
2f7ac1a
Fix Clippy issue
lquerel Apr 22, 2026
9d4018f
Added back the conditional dhat start call
lquerel Apr 22, 2026
49ea270
Make shutdown-all dispatch best-effort
lquerel Apr 22, 2026
68f510c
Fox doc issue
lquerel Apr 22, 2026
acfe52d
Merge branch 'main' into live-reconfig
lquerel Apr 22, 2026
6d9ec31
Document live reconfiguration consistency model
lquerel Apr 22, 2026
b2f5907
Clean up partial create rollout launches
lquerel Apr 22, 2026
a76e150
Clean up rollout panic candidates
lquerel Apr 23, 2026
5014850
Clarify live reconfiguration terminology
lquerel Apr 23, 2026
1fb2db6
Merge branch 'main' into live-reconfig
lquerel Apr 23, 2026
7184319
Fix merge issue
lquerel Apr 23, 2026
321ebfb
Merge remote-tracking branch 'origin/live-reconfig' into live-reconfig
lquerel Apr 23, 2026
831a479
Fix fmt issue
lquerel Apr 23, 2026
791ba2b
Rollback partial resize and replace launches
lquerel Apr 23, 2026
d6a1906
Merge remote-tracking branch 'upstream/main' into live-reconfig
lquerel Apr 23, 2026
0e9cd12
Move rollout shape comparison into config types
lquerel Apr 24, 2026
bfcfc0f
Use condvar for instance exit waits
lquerel Apr 24, 2026
1fe6227
Merge branch 'main' into live-reconfig
lquerel Apr 24, 2026
e893462
Merge branch 'main' into live-reconfig
lquerel Apr 24, 2026
44015e8
Fix terminal retention TTL test on Windows
lquerel Apr 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 33 additions & 52 deletions rust/otap-dataflow/crates/admin-api/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,11 +244,16 @@ method and its operational purpose.
| `GET /api/v1/status` | `engine().status()` | Full engine status snapshot across pipelines and cores. |
| `GET /api/v1/livez` | `engine().livez()` | Engine liveness probe with structured failure details. |
| `GET /api/v1/readyz` | `engine().readyz()` | Readiness probe for orchestration or traffic gating. |
| `GET /api/v1/pipeline-groups/status` | `pipeline_groups().status()` | Fleet-style pipeline status view. |
| `POST /api/v1/pipeline-groups/shutdown` | `pipeline_groups().shutdown(...)` | Coordinated shutdown request across running pipelines. |
| `GET /api/v1/pipeline-groups/{pipeline_group_id}/pipelines/{pipeline_id}/status` | `pipelines().status(...)` | Detailed status for a single pipeline. |
| `GET /api/v1/pipeline-groups/{pipeline_group_id}/pipelines/{pipeline_id}/livez` | `pipelines().livez(...)` | Semantic liveness probe result for a single pipeline. |
| `GET /api/v1/pipeline-groups/{pipeline_group_id}/pipelines/{pipeline_id}/readyz` | `pipelines().readyz(...)` | Semantic readiness probe result for a single pipeline. |
| `GET /api/v1/groups/status` | `groups().status()` | Fleet-style pipeline status view. |
| `POST /api/v1/groups/shutdown` | `groups().shutdown(...)` | Coordinated shutdown request across running pipelines. |
| `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}` | `pipelines().details(...)` | Live committed configuration and any active rollout summary for one logical pipeline. |
| `PUT /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}` | `pipelines().reconfigure(...)` | Submit a live pipeline reconfiguration request and get an accepted, completed, failed, or timed-out operation outcome. |
| `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/rollouts/{rollout_id}` | `pipelines().rollout_status(...)` | Detailed status for one rollout operation. |
| `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/status` | `pipelines().status(...)` | Detailed status for a single pipeline. |
| `POST /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/shutdown` | `pipelines().shutdown(...)` | Shut down one logical pipeline and get an accepted, completed, failed, or timed-out operation outcome. |
| `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/shutdowns/{shutdown_id}` | `pipelines().shutdown_status(...)` | Detailed status for one pipeline shutdown operation. |
| `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/livez` | `pipelines().livez(...)` | Semantic liveness probe result for a single pipeline. |
| `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/readyz` | `pipelines().readyz(...)` | Semantic readiness probe result for a single pipeline. |
| `GET /api/v1/telemetry/logs` | `telemetry().logs(...)` | Retained admin logs when log retention is enabled. |
| `GET /api/v1/telemetry/metrics` | `telemetry().metrics(...)`, `telemetry().metrics_compact(...)` | Current engine metrics as structured JSON, using either the full or compact response shape. |

Expand All @@ -257,50 +262,26 @@ method and its operational purpose.
canonical `telemetry().metrics(...)` and `telemetry().metrics_compact(...)`
methods.

## Future evolution: live reconfiguration

Future live reconfiguration work is expected to extend the admin SDK from a
status-and-observability client into a richer control-plane client for
long-lived engine instances. The details are not stabilized yet, but the work
in progress already helps frame the direction for advanced integrators building
external controllers.

Main capabilities expected from this area of the admin API:

- read the live committed configuration for a single logical pipeline;
- create, replace, resize, or accept a `noop` update for one logical pipeline;
- track rollout progress through a dedicated rollout resource;
- track per-pipeline shutdown progress through a dedicated shutdown resource;
- expose generation-aware pipeline status during overlapping cutover.

The current SDK is intentionally narrower, and the main future extensions for
live reconfiguration are expected to center on:

- resource model: adding live pipeline details, rollout status, and shutdown
status as first-class SDK resources instead of exposing only snapshots and
probes;
- status shape: extending pipeline status with generation-aware fields such as
`activeGeneration`, `servingGenerations`, rollout summaries, and
per-generation instance views;
- operation semantics: treating create, replace, resize, and shutdown as
long-running admin operations with both immediate-return and wait-or-poll
interaction patterns;
- error and outcome modeling: representing rollout conflicts, validation
failures, and timeout outcomes as typed SDK results rather than leaving them
as transport-level concerns.

The intended integration direction is to keep `AdminClient` as the stable
entrypoint and absorb those changes behind typed client methods rather than
exposing raw route strings as the public contract. In practice, that likely
means:

- keeping transport and route-version differences behind backend adapters;
- adding job-oriented client methods for live pipeline read, update, rollout
status, and per-pipeline shutdown tracking;
- supporting both immediate-return and wait-or-poll interaction patterns for
long-running admin operations;
- continuing to treat experimental endpoints as opt-in additions only after
their semantics and wire format settle.
## Live pipeline control

The SDK exposes the live pipeline control surface behind typed methods:

- `pipelines().details(...)` reads the committed pipeline config and active
rollout summary.
- `pipelines().reconfigure(...)` submits create, `noop`, resize, and replace
operations and returns a typed outcome.
- `pipelines().rollout_status(...)` polls a rollout by id.
- `pipelines().shutdown(...)` requests shutdown for one logical pipeline and
returns a typed outcome.
- `pipelines().shutdown_status(...)` polls a shutdown operation by id.

Terminal rollout and shutdown ids are retained only within a bounded in-memory
window. Older ids may return `Ok(None)` after the controller evicts historical
operation snapshots.

Waited operations return typed terminal outcomes instead of surfacing rollout
or shutdown failures as transport-level errors. Request rejection remains a
typed SDK error via `Error::AdminOperation`.

## Client API cookbook

Expand All @@ -327,7 +308,7 @@ println!("readyz={:?}", readyz.status);
# }
```

### Pipeline group status and coordinated shutdown
### Group status and coordinated shutdown

Use this when an operator or control plane needs a fleet view and a single
engine-wide shutdown entrypoint.
Expand All @@ -343,11 +324,11 @@ let client = AdminClient::builder()
.http(HttpAdminClientSettings::new(AdminEndpoint::http("127.0.0.1", 8080)))
.build()?;

let groups = client.pipeline_groups().status().await?;
let groups = client.groups().status().await?;
println!("pipelines={}", groups.pipelines.len());

let shutdown = client
.pipeline_groups()
.groups()
.shutdown(&OperationOptions {
wait: true,
timeout_secs: 30,
Expand Down
Loading
Loading