diff --git a/rust/otap-dataflow/crates/admin-api/README.md b/rust/otap-dataflow/crates/admin-api/README.md
index 763b7596a4..048c465018 100644
--- a/rust/otap-dataflow/crates/admin-api/README.md
+++ b/rust/otap-dataflow/crates/admin-api/README.md
@@ -244,11 +244,16 @@ method and its operational purpose.
 | `GET /api/v1/status` | `engine().status()` | Full engine status snapshot across pipelines and cores. |
 | `GET /api/v1/livez` | `engine().livez()` | Engine liveness probe with structured failure details. |
 | `GET /api/v1/readyz` | `engine().readyz()` | Readiness probe for orchestration or traffic gating. |
-| `GET /api/v1/pipeline-groups/status` | `pipeline_groups().status()` | Fleet-style pipeline status view. |
-| `POST /api/v1/pipeline-groups/shutdown` | `pipeline_groups().shutdown(...)` | Coordinated shutdown request across running pipelines. |
-| `GET /api/v1/pipeline-groups/{pipeline_group_id}/pipelines/{pipeline_id}/status` | `pipelines().status(...)` | Detailed status for a single pipeline. |
-| `GET /api/v1/pipeline-groups/{pipeline_group_id}/pipelines/{pipeline_id}/livez` | `pipelines().livez(...)` | Semantic liveness probe result for a single pipeline. |
-| `GET /api/v1/pipeline-groups/{pipeline_group_id}/pipelines/{pipeline_id}/readyz` | `pipelines().readyz(...)` | Semantic readiness probe result for a single pipeline. |
+| `GET /api/v1/groups/status` | `groups().status()` | Fleet-style pipeline status view. |
+| `POST /api/v1/groups/shutdown` | `groups().shutdown(...)` | Coordinated shutdown request across running pipelines. |
+| `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}` | `pipelines().details(...)` | Live committed configuration and any active rollout summary for one logical pipeline. |
+| `PUT /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}` | `pipelines().reconfigure(...)` | Submit a live pipeline reconfiguration request and get an accepted, completed, failed, or timed-out operation outcome. |
+| `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/rollouts/{rollout_id}` | `pipelines().rollout_status(...)` | Detailed status for one rollout operation. |
+| `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/status` | `pipelines().status(...)` | Detailed status for a single pipeline. |
+| `POST /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/shutdown` | `pipelines().shutdown(...)` | Shut down one logical pipeline and get an accepted, completed, failed, or timed-out operation outcome. |
+| `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/shutdowns/{shutdown_id}` | `pipelines().shutdown_status(...)` | Detailed status for one pipeline shutdown operation. |
+| `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/livez` | `pipelines().livez(...)` | Semantic liveness probe result for a single pipeline. |
+| `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/readyz` | `pipelines().readyz(...)` | Semantic readiness probe result for a single pipeline. |
 | `GET /api/v1/telemetry/logs` | `telemetry().logs(...)` | Retained admin logs when log retention is enabled. |
 | `GET /api/v1/telemetry/metrics` | `telemetry().metrics(...)`, `telemetry().metrics_compact(...)` | Current engine metrics as structured JSON, using either the full or compact response shape. |
 
@@ -257,50 +262,26 @@ method and its operational purpose.
 canonical `telemetry().metrics(...)` and `telemetry().metrics_compact(...)`
 methods.
 
-## Future evolution: live reconfiguration
-
-Future live reconfiguration work is expected to extend the admin SDK from a
-status-and-observability client into a richer control-plane client for
-long-lived engine instances. The details are not stabilized yet, but the work
-in progress already helps frame the direction for advanced integrators building
-external controllers.
-
-Main capabilities expected from this area of the admin API:
-
-- read the live committed configuration for a single logical pipeline;
-- create, replace, resize, or accept a `noop` update for one logical pipeline;
-- track rollout progress through a dedicated rollout resource;
-- track per-pipeline shutdown progress through a dedicated shutdown resource;
-- expose generation-aware pipeline status during overlapping cutover.
-
-The current SDK is intentionally narrower, and the main future extensions for
-live reconfiguration are expected to center on:
-
-- resource model: adding live pipeline details, rollout status, and shutdown
-  status as first-class SDK resources instead of exposing only snapshots and
-  probes;
-- status shape: extending pipeline status with generation-aware fields such as
-  `activeGeneration`, `servingGenerations`, rollout summaries, and
-  per-generation instance views;
-- operation semantics: treating create, replace, resize, and shutdown as
-  long-running admin operations with both immediate-return and wait-or-poll
-  interaction patterns;
-- error and outcome modeling: representing rollout conflicts, validation
-  failures, and timeout outcomes as typed SDK results rather than leaving them
-  as transport-level concerns.
-
-The intended integration direction is to keep `AdminClient` as the stable
-entrypoint and absorb those changes behind typed client methods rather than
-exposing raw route strings as the public contract. In practice, that likely
-means:
-
-- keeping transport and route-version differences behind backend adapters;
-- adding job-oriented client methods for live pipeline read, update, rollout
-  status, and per-pipeline shutdown tracking;
-- supporting both immediate-return and wait-or-poll interaction patterns for
-  long-running admin operations;
-- continuing to treat experimental endpoints as opt-in additions only after
-  their semantics and wire format settle.
+## Live pipeline control
+
+The SDK exposes the live pipeline control surface behind typed methods:
+
+- `pipelines().details(...)` reads the committed pipeline config and active
+  rollout summary.
+- `pipelines().reconfigure(...)` submits create, `noop`, resize, and replace
+  operations and returns a typed outcome.
+- `pipelines().rollout_status(...)` polls a rollout by id.
+- `pipelines().shutdown(...)` requests shutdown for one logical pipeline and
+  returns a typed outcome.
+- `pipelines().shutdown_status(...)` polls a shutdown operation by id.
+
+Terminal rollout and shutdown ids are retained only within a bounded in-memory
+window. Older ids may return `Ok(None)` after the controller evicts historical
+operation snapshots.
+
+Waited operations return typed terminal outcomes instead of surfacing rollout
+or shutdown failures as transport-level errors. Request rejection remains a
+typed SDK error via `Error::AdminOperation`.
 
 ## Client API cookbook
 
@@ -327,7 +308,7 @@ println!("readyz={:?}", readyz.status);
 # }
 ```
 
-### Pipeline group status and coordinated shutdown
+### Group status and coordinated shutdown
 
 Use this when an operator or control plane needs a fleet view and a single
 engine-wide shutdown entrypoint.
@@ -343,11 +324,11 @@ let client = AdminClient::builder()
     .http(HttpAdminClientSettings::new(AdminEndpoint::http("127.0.0.1", 8080)))
     .build()?;
 
-let groups = client.pipeline_groups().status().await?;
+let groups = client.groups().status().await?;
 println!("pipelines={}", groups.pipelines.len());
 
 let shutdown = client
-    .pipeline_groups()
+    .groups()
     .shutdown(&OperationOptions {
         wait: true,
         timeout_secs: 30,
diff --git a/rust/otap-dataflow/crates/admin-api/src/client.rs b/rust/otap-dataflow/crates/admin-api/src/client.rs
index 6888145b4f..db0ebfd056 100644
--- a/rust/otap-dataflow/crates/admin-api/src/client.rs
+++ b/rust/otap-dataflow/crates/admin-api/src/client.rs
@@ -5,7 +5,7 @@
 
 use crate::endpoint::{AdminAuth, AdminEndpoint};
 use crate::http_backend::HttpBackend;
-use crate::{Error, engine, operations, pipeline_groups, pipelines, telemetry};
+use crate::{Error, engine, groups, operations, pipelines, telemetry};
 use async_trait::async_trait;
 use std::sync::Arc;
 use std::time::Duration;
@@ -38,7 +38,10 @@ pub struct HttpAdminClientSettings {
 }
 
 impl HttpAdminClientSettings {
-    /// Creates new HTTP client settings.
+    /// Creates HTTP client settings with the SDK defaults for connection behavior.
+    ///
+    /// Use the builder-style `with_*` methods to override auth, timeout,
+    /// keepalive, or TLS behavior.
     #[must_use]
     pub fn new(endpoint: AdminEndpoint) -> Self {
         Self {
@@ -53,49 +56,53 @@ impl HttpAdminClientSettings {
         }
     }
 
-    /// Sets the auth mode.
+    /// Sets the authentication mode for requests sent by this client.
     #[must_use]
     pub fn with_auth(mut self, auth: AdminAuth) -> Self {
         self.auth = auth;
         self
     }
 
-    /// Sets the TCP connect timeout.
+    /// Sets the TCP connect timeout for establishing new connections.
     #[must_use]
     pub fn with_connect_timeout(mut self, connect_timeout: Duration) -> Self {
         self.connect_timeout = connect_timeout;
         self
     }
 
-    /// Sets the request timeout.
+    /// Sets a per-request timeout for admin calls.
+    ///
+    /// This is separate from [`operations::OperationOptions::timeout_secs`],
+    /// which controls how long the server should wait on long-running
+    /// operations such as reconfigure or shutdown.
     #[must_use]
     pub fn with_timeout(mut self, timeout: Duration) -> Self {
         self.timeout = Some(timeout);
         self
     }
 
-    /// Clears any request timeout.
+    /// Disables the client-side per-request timeout.
     #[must_use]
     pub fn without_timeout(mut self) -> Self {
         self.timeout = None;
         self
     }
 
-    /// Sets whether to enable `TCP_NODELAY`.
+    /// Sets whether outbound TCP sockets should use `TCP_NODELAY`.
     #[must_use]
     pub fn with_tcp_nodelay(mut self, tcp_nodelay: bool) -> Self {
         self.tcp_nodelay = tcp_nodelay;
         self
     }
 
-    /// Sets the TCP keepalive timeout.
+    /// Sets the TCP keepalive timeout for outbound connections.
     #[must_use]
     pub fn with_tcp_keepalive(mut self, tcp_keepalive: Option<Duration>) -> Self {
         self.tcp_keepalive = tcp_keepalive;
         self
     }
 
-    /// Sets the TCP keepalive probe interval.
+    /// Sets the interval between TCP keepalive probes when keepalive is enabled.
     #[must_use]
     pub fn with_tcp_keepalive_interval(mut self, tcp_keepalive_interval: Option<Duration>) -> Self {
         self.tcp_keepalive_interval = tcp_keepalive_interval;
@@ -103,6 +110,10 @@ impl HttpAdminClientSettings {
     }
 
     /// Sets the TLS or mTLS configuration for HTTPS endpoints.
+    ///
+    /// This is ignored for plaintext HTTP endpoints and required only when the
+    /// target endpoint needs custom CA trust, client certificates, or other TLS
+    /// overrides.
     #[must_use]
     pub fn with_tls(mut self, tls: TlsClientConfig) -> Self {
         self.tls = Some(tls);
@@ -121,13 +132,13 @@ pub struct AdminClientBuilder {
 }
 
 impl AdminClientBuilder {
-    /// Creates a new builder.
+    /// Creates a new admin client builder with no backend configured yet.
     #[must_use]
     pub fn new() -> Self {
         Self::default()
     }
 
-    /// Configures the client to use the HTTP admin backend.
+    /// Configures the client to use the HTTP admin transport.
     #[must_use]
     pub fn http(mut self, settings: HttpAdminClientSettings) -> Self {
         self.backend = Some(BackendConfig::Http(settings));
@@ -135,6 +146,9 @@ impl AdminClientBuilder {
     }
 
     /// Builds the configured admin client.
+    ///
+    /// Returns an error when no backend has been configured or when the HTTP
+    /// transport settings are invalid.
     pub fn build(self) -> Result<AdminClient, Error> {
         let backend = match self.backend {
             Some(BackendConfig::Http(settings)) => {
@@ -158,13 +172,30 @@ pub struct AdminClient {
 }
 
 impl AdminClient {
-    /// Creates a new client builder.
+    /// Creates a builder for constructing an [`AdminClient`].
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{AdminClient, AdminEndpoint, HttpAdminClientSettings};
+    /// # fn example() -> Result<(), otap_df_admin_api::Error> {
+    /// let client = AdminClient::builder()
+    ///     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    ///         "engine-a.internal.example",
+    ///         8080,
+    ///     )))
+    ///     .build()?;
+    ///
+    /// # let _ = client;
+    /// # Ok(())
+    /// # }
+    /// ```
     #[must_use]
     pub fn builder() -> AdminClientBuilder {
         AdminClientBuilder::new()
     }
 
-    /// Returns the engine-scoped resource client.
+    /// Returns the engine-scoped resource client for engine-wide status and probes.
     #[must_use]
     pub fn engine(&self) -> EngineClient<'_> {
         EngineClient {
@@ -172,15 +203,15 @@ impl AdminClient {
         }
     }
 
-    /// Returns the pipeline-group-scoped resource client.
+    /// Returns the group-scoped resource client for fleet-style status and shutdown operations.
     #[must_use]
-    pub fn pipeline_groups(&self) -> PipelineGroupsClient<'_> {
-        PipelineGroupsClient {
+    pub fn groups(&self) -> GroupsClient<'_> {
+        GroupsClient {
             backend: self.backend.as_ref(),
         }
     }
 
-    /// Returns the pipeline-scoped resource client.
+    /// Returns the pipeline-scoped resource client for per-pipeline status and live control.
     #[must_use]
     pub fn pipelines(&self) -> PipelinesClient<'_> {
         PipelinesClient {
@@ -188,7 +219,7 @@ impl AdminClient {
         }
     }
 
-    /// Returns the telemetry-scoped resource client.
+    /// Returns the telemetry-scoped resource client for logs and structured metrics.
     #[must_use]
     pub fn telemetry(&self) -> TelemetryClient<'_> {
         TelemetryClient {
@@ -204,40 +235,154 @@ pub struct EngineClient<'a> {
 }
 
 impl EngineClient<'_> {
-    /// Returns global pipeline status.
+    /// Returns the current engine-wide status snapshot.
+    ///
+    /// Use this when you need a cross-pipeline view of the running engine.
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{AdminClient, AdminEndpoint, HttpAdminClientSettings};
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// let status = client.engine().status().await?;
+    /// println!("pipelines={}", status.pipelines.len());
+    /// # Ok(())
+    /// # }
+    /// ```
     pub async fn status(&self) -> Result<engine::Status, Error> {
         self.backend.engine_status().await
     }
 
-    /// Returns the global liveness probe response.
+    /// Returns the engine liveness probe result.
+    ///
+    /// This is the SDK equivalent of checking whether the engine process is
+    /// live enough to keep serving admin traffic.
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{AdminClient, AdminEndpoint, HttpAdminClientSettings};
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// let probe = client.engine().livez().await?;
+    /// println!("livez={:?}", probe.status);
+    /// # Ok(())
+    /// # }
+    /// ```
     pub async fn livez(&self) -> Result<engine::ProbeResponse, Error> {
         self.backend.engine_livez().await
     }
 
-    /// Returns the global readiness probe response.
+    /// Returns the engine readiness probe result.
+    ///
+    /// Use this when orchestration or callers need to know whether the engine
+    /// currently considers itself ready.
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{AdminClient, AdminEndpoint, HttpAdminClientSettings};
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// let probe = client.engine().readyz().await?;
+    /// println!("readyz={:?}", probe.status);
+    /// # Ok(())
+    /// # }
+    /// ```
     pub async fn readyz(&self) -> Result<engine::ProbeResponse, Error> {
         self.backend.engine_readyz().await
     }
 }
 
-/// Pipeline-group-scoped admin client.
+/// Group-scoped admin client.
 #[derive(Clone, Copy)]
-pub struct PipelineGroupsClient<'a> {
+pub struct GroupsClient<'a> {
     backend: &'a dyn AdminBackend,
 }
 
-impl PipelineGroupsClient<'_> {
-    /// Returns pipeline-group status.
-    pub async fn status(&self) -> Result<pipeline_groups::Status, Error> {
-        self.backend.pipeline_groups_status().await
+impl GroupsClient<'_> {
+    /// Returns a group-wide status snapshot across logical pipelines.
+    ///
+    /// Use this as a fleet-style overview when you do not need full
+    /// engine-wide detail from [`EngineClient::status`].
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{AdminClient, AdminEndpoint, HttpAdminClientSettings};
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// let status = client.groups().status().await?;
+    /// println!("pipelines={}", status.pipelines.len());
+    /// # Ok(())
+    /// # }
+    /// ```
+    pub async fn status(&self) -> Result<groups::Status, Error> {
+        self.backend.groups_status().await
     }
 
-    /// Requests shutdown for all pipelines.
+    /// Requests coordinated shutdown for all running logical pipelines.
+    ///
+    /// Use `options.wait` to choose whether the call should return immediately
+    /// with the server's current shutdown response or wait up to
+    /// `options.timeout_secs` for a terminal shutdown result.
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{
+    /// #     groups, operations, AdminClient, AdminEndpoint, HttpAdminClientSettings,
+    /// # };
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// let response = client
+    ///     .groups()
+    ///     .shutdown(&operations::OperationOptions {
+    ///         wait: true,
+    ///         timeout_secs: 30,
+    ///     })
+    ///     .await?;
+    ///
+    /// if matches!(
+    ///     response.status,
+    ///     groups::ShutdownStatus::Failed | groups::ShutdownStatus::Timeout
+    /// ) {
+    ///     eprintln!("shutdown issues: {:?}", response.errors);
+    /// }
+    /// # Ok(())
+    /// # }
+    /// ```
     pub async fn shutdown(
         &self,
         options: &operations::OperationOptions,
-    ) -> Result<pipeline_groups::ShutdownResponse, Error> {
-        self.backend.pipeline_groups_shutdown(options).await
+    ) -> Result<groups::ShutdownResponse, Error> {
+        self.backend.groups_shutdown(options).await
     }
 }
 
@@ -248,7 +393,202 @@ pub struct PipelinesClient<'a> {
 }
 
 impl PipelinesClient<'_> {
-    /// Returns status for one pipeline.
+    /// Returns the committed live configuration for one logical pipeline.
+    ///
+    /// Use this when you need the configuration that the controller currently
+    /// treats as active. This does not include per-core runtime progress or
+    /// overlapping-instance state; use [`Self::status`] for runtime status.
+    ///
+    /// Returns `Ok(None)` when the logical pipeline is not found.
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{AdminClient, AdminEndpoint, HttpAdminClientSettings};
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// if let Some(details) = client
+    ///     .pipelines()
+    ///     .details("tenant-a", "ingest")
+    ///     .await?
+    /// {
+    ///     println!("active_generation={:?}", details.active_generation);
+    /// }
+    /// # Ok(())
+    /// # }
+    /// ```
+    pub async fn details(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+    ) -> Result<Option<pipelines::PipelineDetails>, Error> {
+        self.backend
+            .pipeline_details(pipeline_group_id, pipeline_id)
+            .await
+    }
+
+    /// Submits a live reconfiguration request for one logical pipeline.
+    ///
+    /// The controller may treat the request as a create, resize, replace, or
+    /// no-op depending on how the submitted configuration differs from the
+    /// current committed pipeline.
+    ///
+    /// With `options.wait = false`, this returns as soon as the request has
+    /// either been accepted for background execution or already completed,
+    /// yielding [`pipelines::ReconfigureOutcome::Accepted`] or
+    /// [`pipelines::ReconfigureOutcome::Completed`].
+    ///
+    /// With `options.wait = true`, this waits up to `options.timeout_secs` for
+    /// a terminal result and returns the latest rollout snapshot as
+    /// [`pipelines::ReconfigureOutcome::Completed`],
+    /// [`pipelines::ReconfigureOutcome::Failed`], or
+    /// [`pipelines::ReconfigureOutcome::TimedOut`].
+    ///
+    /// If the server rejects the request before a rollout starts, this returns
+    /// [`Error::AdminOperation`].
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{
+    /// #     config::pipeline::{PipelineConfigBuilder, PipelineType},
+    /// #     operations, pipelines, AdminClient, AdminEndpoint, HttpAdminClientSettings,
+    /// # };
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// # let request = pipelines::ReconfigureRequest {
+    /// #     pipeline: PipelineConfigBuilder::new()
+    /// #         .add_receiver("ingress", "receiver:otlp", None)
+    /// #         .add_exporter("egress", "exporter:debug", None)
+    /// #         .to("ingress", "egress")
+    /// #         .build(PipelineType::Otap, "tenant-a", "ingest")?,
+    /// #     step_timeout_secs: 60,
+    /// #     drain_timeout_secs: 60,
+    /// # };
+    /// let outcome = client
+    ///     .pipelines()
+    ///     .reconfigure(
+    ///         "tenant-a",
+    ///         "ingest",
+    ///         &request,
+    ///         &operations::OperationOptions {
+    ///             wait: true,
+    ///             timeout_secs: 120,
+    ///         },
+    ///     )
+    ///     .await?;
+    ///
+    /// match outcome {
+    ///     pipelines::ReconfigureOutcome::Completed(status) => {
+    ///         println!("rolled out generation {}", status.target_generation);
+    ///     }
+    ///     pipelines::ReconfigureOutcome::Accepted(status) => {
+    ///         println!("poll rollout {}", status.rollout_id);
+    ///     }
+    ///     pipelines::ReconfigureOutcome::Failed(status)
+    ///     | pipelines::ReconfigureOutcome::TimedOut(status) => {
+    ///         eprintln!("rollout state: {:?}", status.state);
+    ///     }
+    /// }
+    /// # Ok(())
+    /// # }
+    /// ```
+    pub async fn reconfigure(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        request: &pipelines::ReconfigureRequest,
+        options: &operations::OperationOptions,
+    ) -> Result<pipelines::ReconfigureOutcome, Error> {
+        self.backend
+            .pipeline_reconfigure(pipeline_group_id, pipeline_id, request, options)
+            .await
+    }
+
+    /// Returns the latest known status for one previously created rollout.
+    ///
+    /// Use the `rollout_id` returned from [`Self::reconfigure`] to poll an
+    /// asynchronous reconfiguration operation after an
+    /// [`pipelines::ReconfigureOutcome::Accepted`] result.
+    ///
+    /// Returns `Ok(None)` when the requested rollout status resource is not
+    /// found. Terminal rollout history is retained only within a bounded
+    /// in-memory window, so older rollout ids may also return `Ok(None)` after
+    /// eviction.
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{AdminClient, AdminEndpoint, HttpAdminClientSettings};
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// let rollout_id = "rollout-42";
+    ///
+    /// if let Some(status) = client
+    ///     .pipelines()
+    ///     .rollout_status("tenant-a", "ingest", rollout_id)
+    ///     .await?
+    /// {
+    ///     println!("rollout_state={:?}", status.state);
+    /// }
+    /// # Ok(())
+    /// # }
+    /// ```
+    pub async fn rollout_status(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        rollout_id: &str,
+    ) -> Result<Option<pipelines::RolloutStatus>, Error> {
+        self.backend
+            .pipeline_rollout_status(pipeline_group_id, pipeline_id, rollout_id)
+            .await
+    }
+
+    /// Returns the current runtime status for one logical pipeline.
+    ///
+    /// Use this when you need per-core phase, overlapping-instance state,
+    /// rollout summaries, or other runtime progress. Use [`Self::details`] when
+    /// you need the committed live configuration instead.
+    ///
+    /// Returns `Ok(None)` when the logical pipeline is not found.
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{AdminClient, AdminEndpoint, HttpAdminClientSettings};
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// if let Some(status) = client
+    ///     .pipelines()
+    ///     .status("tenant-a", "ingest")
+    ///     .await?
+    /// {
+    ///     println!("running_cores={}", status.running_cores);
+    /// }
+    /// # Ok(())
+    /// # }
+    /// ```
     pub async fn status(
         &self,
         pipeline_group_id: &str,
@@ -259,7 +599,137 @@ impl PipelinesClient<'_> {
             .await
     }
 
-    /// Returns the liveness probe for one pipeline.
+    /// Requests shutdown of the currently running instances for one logical pipeline.
+    ///
+    /// With `options.wait = false`, this returns as soon as the shutdown has
+    /// either been accepted for background execution or already completed,
+    /// yielding [`pipelines::ShutdownOutcome::Accepted`] or
+    /// [`pipelines::ShutdownOutcome::Completed`].
+    ///
+    /// With `options.wait = true`, this waits up to `options.timeout_secs` for
+    /// a terminal result and returns the latest shutdown snapshot as
+    /// [`pipelines::ShutdownOutcome::Completed`],
+    /// [`pipelines::ShutdownOutcome::Failed`], or
+    /// [`pipelines::ShutdownOutcome::TimedOut`].
+    ///
+    /// If the server rejects the request before shutdown work starts, this
+    /// returns [`Error::AdminOperation`].
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{operations, pipelines, AdminClient, AdminEndpoint, HttpAdminClientSettings};
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// let outcome = client
+    ///     .pipelines()
+    ///     .shutdown(
+    ///         "tenant-a",
+    ///         "ingest",
+    ///         &operations::OperationOptions {
+    ///             wait: true,
+    ///             timeout_secs: 60,
+    ///         },
+    ///     )
+    ///     .await?;
+    ///
+    /// match outcome {
+    ///     pipelines::ShutdownOutcome::Completed(status) => {
+    ///         println!("shutdown completed: {}", status.shutdown_id);
+    ///     }
+    ///     pipelines::ShutdownOutcome::Accepted(status) => {
+    ///         println!("poll shutdown {}", status.shutdown_id);
+    ///     }
+    ///     pipelines::ShutdownOutcome::Failed(status)
+    ///     | pipelines::ShutdownOutcome::TimedOut(status) => {
+    ///         eprintln!("shutdown state: {}", status.state);
+    ///     }
+    /// }
+    /// # Ok(())
+    /// # }
+    /// ```
+    pub async fn shutdown(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        options: &operations::OperationOptions,
+    ) -> Result<pipelines::ShutdownOutcome, Error> {
+        self.backend
+            .pipeline_shutdown(pipeline_group_id, pipeline_id, options)
+            .await
+    }
+
+    /// Returns the latest known status for one previously created shutdown operation.
+    ///
+    /// Use the `shutdown_id` returned from [`Self::shutdown`] to poll an
+    /// asynchronous shutdown after an
+    /// [`pipelines::ShutdownOutcome::Accepted`] result.
+    ///
+    /// Returns `Ok(None)` when the requested shutdown status resource is not
+    /// found. Terminal shutdown history is retained only within a bounded
+    /// in-memory window, so older shutdown ids may also return `Ok(None)` after
+    /// eviction.
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{AdminClient, AdminEndpoint, HttpAdminClientSettings};
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// let shutdown_id = "shutdown-42";
+    ///
+    /// if let Some(status) = client
+    ///     .pipelines()
+    ///     .shutdown_status("tenant-a", "ingest", shutdown_id)
+    ///     .await?
+    /// {
+    ///     println!("shutdown_state={}", status.state);
+    /// }
+    /// # Ok(())
+    /// # }
+    /// ```
+    pub async fn shutdown_status(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        shutdown_id: &str,
+    ) -> Result<Option<pipelines::ShutdownStatus>, Error> {
+        self.backend
+            .pipeline_shutdown_status(pipeline_group_id, pipeline_id, shutdown_id)
+            .await
+    }
+
+    /// Returns the liveness probe result for one logical pipeline.
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{pipelines, AdminClient, AdminEndpoint, HttpAdminClientSettings};
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// let probe = client.pipelines().livez("tenant-a", "ingest").await?;
+    ///
+    /// if probe.status == pipelines::ProbeStatus::Failed {
+    ///     eprintln!("pipeline is not live: {:?}", probe.message);
+    /// }
+    /// # Ok(())
+    /// # }
+    /// ```
     pub async fn livez(
         &self,
         pipeline_group_id: &str,
@@ -270,7 +740,27 @@ impl PipelinesClient<'_> {
             .await
     }
 
-    /// Returns the readiness probe for one pipeline.
+    /// Returns the readiness probe result for one logical pipeline.
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{pipelines, AdminClient, AdminEndpoint, HttpAdminClientSettings};
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// let probe = client.pipelines().readyz("tenant-a", "ingest").await?;
+    ///
+    /// if probe.status == pipelines::ProbeStatus::Failed {
+    ///     eprintln!("pipeline is not ready: {:?}", probe.message);
+    /// }
+    /// # Ok(())
+    /// # }
+    /// ```
     pub async fn readyz(
         &self,
         pipeline_group_id: &str,
@@ -289,7 +779,39 @@ pub struct TelemetryClient<'a> {
 }
 
 impl TelemetryClient<'_> {
-    /// Returns retained logs or `None` when the logs endpoint is unavailable.
+    /// Returns retained admin logs.
+    ///
+    /// Use [`telemetry::LogsQuery`] to request only entries newer than a known
+    /// sequence number or to cap the number of returned entries.
+    ///
+    /// Returns `Ok(None)` when retained logs are not available on the target
+    /// engine.
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{telemetry, AdminClient, AdminEndpoint, HttpAdminClientSettings};
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// let logs = client
+    ///     .telemetry()
+    ///     .logs(&telemetry::LogsQuery {
+    ///         after: Some(1_000),
+    ///         limit: Some(200),
+    ///     })
+    ///     .await?;
+    ///
+    /// if let Some(logs) = logs {
+    ///     println!("next_seq={}", logs.next_seq);
+    /// }
+    /// # Ok(())
+    /// # }
+    /// ```
     pub async fn logs(
         &self,
         query: &telemetry::LogsQuery,
@@ -297,7 +819,35 @@ impl TelemetryClient<'_> {
         self.backend.telemetry_logs(query).await
     }
 
-    /// Returns full structured metrics.
+    /// Returns structured metrics with descriptor metadata for each metric field.
+    ///
+    /// Use this form when callers need metric names, units, instrument kinds,
+    /// or temporality alongside metric values.
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{telemetry, AdminClient, AdminEndpoint, HttpAdminClientSettings};
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// let metrics = client
+    ///     .telemetry()
+    ///     .metrics(&telemetry::MetricsOptions::default())
+    ///     .await?;
+    ///
+    /// if let Some(metric_set) = metrics.metric_sets.first() {
+    ///     for point in &metric_set.metrics {
+    ///         println!("{} {}", point.metadata.name, point.metadata.unit);
+    ///     }
+    /// }
+    /// # Ok(())
+    /// # }
+    /// ```
     pub async fn metrics(
         &self,
         options: &telemetry::MetricsOptions,
@@ -305,7 +855,33 @@ impl TelemetryClient<'_> {
         self.backend.telemetry_metrics(options).await
     }
 
-    /// Returns compact structured metrics.
+    /// Returns structured metrics without per-field descriptor metadata.
+    ///
+    /// Use this form when callers only need current metric values and want a
+    /// smaller response payload than [`Self::metrics`].
+    ///
+    /// # Examples
+    ///
+    /// ```rust
+    /// # use otap_df_admin_api::{telemetry, AdminClient, AdminEndpoint, HttpAdminClientSettings};
+    /// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
+    /// # let client = AdminClient::builder()
+    /// #     .http(HttpAdminClientSettings::new(AdminEndpoint::http(
+    /// #         "engine-a.internal.example",
+    /// #         8080,
+    /// #     )))
+    /// #     .build()?;
+    /// let metrics = client
+    ///     .telemetry()
+    ///     .metrics_compact(&telemetry::MetricsOptions::default())
+    ///     .await?;
+    ///
+    /// if let Some(metric_set) = metrics.metric_sets.first() {
+    ///     println!("value_count={}", metric_set.metrics.len());
+    /// }
+    /// # Ok(())
+    /// # }
+    /// ```
     pub async fn metrics_compact(
         &self,
         options: &telemetry::MetricsOptions,
@@ -320,17 +896,35 @@ pub(crate) trait AdminBackend: Send + Sync {
     async fn engine_livez(&self) -> Result<engine::ProbeResponse, Error>;
     async fn engine_readyz(&self) -> Result<engine::ProbeResponse, Error>;
 
-    async fn pipeline_groups_status(&self) -> Result<pipeline_groups::Status, Error>;
-    async fn pipeline_groups_shutdown(
+    async fn groups_status(&self) -> Result<groups::Status, Error>;
+    async fn groups_shutdown(
         &self,
         options: &operations::OperationOptions,
-    ) -> Result<pipeline_groups::ShutdownResponse, Error>;
+    ) -> Result<groups::ShutdownResponse, Error>;
 
     async fn pipeline_status(
         &self,
         pipeline_group_id: &str,
         pipeline_id: &str,
     ) -> Result<Option<pipelines::Status>, Error>;
+    async fn pipeline_details(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+    ) -> Result<Option<pipelines::PipelineDetails>, Error>;
+    async fn pipeline_reconfigure(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        request: &pipelines::ReconfigureRequest,
+        options: &operations::OperationOptions,
+    ) -> Result<pipelines::ReconfigureOutcome, Error>;
+    async fn pipeline_rollout_status(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        rollout_id: &str,
+    ) -> Result<Option<pipelines::RolloutStatus>, Error>;
     async fn pipeline_livez(
         &self,
         pipeline_group_id: &str,
@@ -341,6 +935,18 @@ pub(crate) trait AdminBackend: Send + Sync {
         pipeline_group_id: &str,
         pipeline_id: &str,
     ) -> Result<pipelines::ProbeResult, Error>;
+    async fn pipeline_shutdown(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        options: &operations::OperationOptions,
+    ) -> Result<pipelines::ShutdownOutcome, Error>;
+    async fn pipeline_shutdown_status(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        shutdown_id: &str,
+    ) -> Result<Option<pipelines::ShutdownStatus>, Error>;
 
     async fn telemetry_logs(
         &self,
diff --git a/rust/otap-dataflow/crates/admin-api/src/endpoint.rs b/rust/otap-dataflow/crates/admin-api/src/endpoint.rs
index 455dd1d728..87e31bf5ce 100644
--- a/rust/otap-dataflow/crates/admin-api/src/endpoint.rs
+++ b/rust/otap-dataflow/crates/admin-api/src/endpoint.rs
@@ -17,7 +17,7 @@ pub enum AdminScheme {
 }
 
 impl AdminScheme {
-    /// Returns the URL scheme string.
+    /// Returns the URL scheme string used when building admin endpoint URLs.
     #[must_use]
     pub const fn as_str(self) -> &'static str {
         match self {
@@ -49,7 +49,9 @@ pub struct AdminEndpoint {
 }
 
 impl AdminEndpoint {
-    /// Creates a new endpoint.
+    /// Creates an endpoint from explicit scheme, host, and port components.
+    ///
+    /// This validates the endpoint fields before returning.
     pub fn new(
         scheme: AdminScheme,
         host: impl Into<String>,
@@ -65,7 +67,10 @@ impl AdminEndpoint {
         Ok(endpoint)
     }
 
-    /// Creates an HTTP endpoint.
+    /// Creates an HTTP endpoint for direct plaintext admin access.
+    ///
+    /// This constructor does not fail; validation happens later when the client
+    /// is built or when the endpoint is used to construct URLs.
     #[must_use]
     pub fn http(host: impl Into<String>, port: u16) -> Self {
         Self {
@@ -76,7 +81,10 @@ impl AdminEndpoint {
         }
     }
 
-    /// Creates an HTTPS endpoint.
+    /// Creates an HTTPS endpoint for admin access over TLS.
+    ///
+    /// Pair this with [`HttpAdminClientSettings::with_tls`](crate::HttpAdminClientSettings::with_tls)
+    /// when the server or upstream gateway requires custom CA trust or mTLS.
     #[must_use]
     pub fn https(host: impl Into<String>, port: u16) -> Self {
         Self {
@@ -87,13 +95,22 @@ impl AdminEndpoint {
         }
     }
 
-    /// Creates an endpoint from a socket address using HTTP.
+    /// Creates an HTTP endpoint from a socket address.
+    ///
+    /// This is mainly useful for local engines discovered as a concrete bind
+    /// address.
     #[must_use]
     pub fn from_socket_addr(addr: SocketAddr) -> Self {
         Self::http(addr.ip().to_string(), addr.port())
     }
 
     /// Creates an endpoint from a full base URL.
+    ///
+    /// Use this when the admin API is exposed behind a gateway or reverse proxy
+    /// and you want to preserve the URL prefix in `base_path`.
+    ///
+    /// Query strings and fragments are rejected because SDK routes are built by
+    /// appending `/api/v1/...` path segments to this base URL.
     pub fn from_url(url: &str) -> Result<Self, EndpointError> {
         let parsed = Url::parse(url).map_err(|err| EndpointError::UrlParse {
             url: url.to_string(),
@@ -136,14 +153,20 @@ impl AdminEndpoint {
         Ok(endpoint)
     }
 
-    /// Sets the base path used for URL construction.
+    /// Sets the URL path prefix used when building admin request URLs.
+    ///
+    /// This is useful when the engine is published behind a path-prefixed
+    /// gateway such as `/engine-a`.
     pub fn with_base_path(mut self, base_path: impl Into<String>) -> Result<Self, EndpointError> {
         self.base_path = Some(base_path.into());
         self.validate()?;
         Ok(self)
     }
 
-    /// Validates the endpoint fields.
+    /// Validates the endpoint fields without building a client.
+    ///
+    /// Most callers do not need to call this directly because client creation
+    /// and URL construction validate automatically.
     pub fn validate(&self) -> Result<(), EndpointError> {
         if self.host.trim().is_empty() {
             return Err(EndpointError::EmptyHost);
@@ -159,7 +182,11 @@ impl AdminEndpoint {
         Ok(())
     }
 
-    /// Builds a URL for the provided path segments.
+    /// Builds a concrete URL by appending path segments to this endpoint.
+    ///
+    /// Most SDK callers do not need this directly because the built-in HTTP
+    /// transport uses it internally. It is mainly useful for custom transports,
+    /// tests, or diagnostics.
     pub fn url_for_segments<'a, I>(&self, segments: I) -> Result<Url, EndpointError>
     where
         I: IntoIterator<Item = &'a str>,
diff --git a/rust/otap-dataflow/crates/admin-api/src/error.rs b/rust/otap-dataflow/crates/admin-api/src/error.rs
index 41947b57e3..02a4f3b0f0 100644
--- a/rust/otap-dataflow/crates/admin-api/src/error.rs
+++ b/rust/otap-dataflow/crates/admin-api/src/error.rs
@@ -3,6 +3,7 @@
 
 //! Error types for the public admin SDK.
 
+use crate::operations::OperationError;
 use thiserror::Error;
 
 /// Endpoint validation and URL construction errors.
@@ -89,6 +90,19 @@ pub enum Error {
         details: String,
     },
 
+    /// The server rejected a live admin operation request before work started.
+    ///
+    /// This wraps a typed [`OperationError`] for request-level rejections such
+    /// as not found, conflict, or invalid request. Use the operation outcome
+    /// enums for requests that were accepted and later failed or timed out.
+    #[error("admin operation rejected with status {status}: {error:?}")]
+    AdminOperation {
+        /// HTTP status code.
+        status: u16,
+        /// Typed control-plane rejection details.
+        error: OperationError,
+    },
+
     /// Remote endpoint returned an unexpected HTTP status.
     #[error("admin endpoint returned unexpected status {status} for {method} {url}")]
     RemoteStatus {
diff --git a/rust/otap-dataflow/crates/admin-api/src/http_backend.rs b/rust/otap-dataflow/crates/admin-api/src/http_backend.rs
index fa7d370d70..c7487a9c8a 100644
--- a/rust/otap-dataflow/crates/admin-api/src/http_backend.rs
+++ b/rust/otap-dataflow/crates/admin-api/src/http_backend.rs
@@ -5,7 +5,7 @@
 
 use crate::client::{AdminBackend, HttpAdminClientSettings};
 use crate::endpoint::{AdminAuth, AdminEndpoint, AdminScheme};
-use crate::{Error, engine, operations, pipeline_groups, pipelines, telemetry};
+use crate::{Error, engine, groups, operations, pipelines, telemetry};
 use async_trait::async_trait;
 use reqwest::{Certificate, ClientBuilder, Identity, Method, Url};
 use serde::de::DeserializeOwned;
@@ -16,6 +16,7 @@ use std::sync::OnceLock;
 struct RawRequest {
     method: Method,
     url: Url,
+    body: Option<Vec<u8>>,
 }
 
 struct RawResponse {
@@ -50,7 +51,7 @@ impl HttpBackend {
         expected_statuses: &[u16],
     ) -> Result<(u16, T), Error> {
         let (status, body) = self
-            .request_raw(method, segments, query, expected_statuses)
+            .request_raw(method, segments, query, None, expected_statuses)
             .await?;
         Ok((status, self.decode_json(&body)?))
     }
@@ -63,7 +64,7 @@ impl HttpBackend {
         expected_statuses: &[u16],
     ) -> Result<pipelines::ProbeResult, Error> {
         let (status_code, body) = self
-            .request_raw(method, segments, query, expected_statuses)
+            .request_raw(method, segments, query, None, expected_statuses)
             .await?;
         let status = match status_code {
             200 => pipelines::ProbeStatus::Ok,
@@ -81,6 +82,7 @@ impl HttpBackend {
         method: Method,
         segments: &[&str],
         query: &[(&str, String)],
+        body: Option<Vec<u8>>,
         expected_statuses: &[u16],
     ) -> Result<(u16, Vec<u8>), Error> {
         let mut url = self.endpoint.url_for_segments(segments.iter().copied())?;
@@ -95,6 +97,7 @@ impl HttpBackend {
             .send(RawRequest {
                 method: method.clone(),
                 url,
+                body,
             })
             .await?;
 
@@ -111,13 +114,19 @@ impl HttpBackend {
     }
 
     async fn send(&self, request: RawRequest) -> Result<RawResponse, Error> {
-        let RawRequest { method, url } = request;
-        let builder = self.client.request(method, url.clone());
+        let RawRequest { method, url, body } = request;
+        let mut builder = self.client.request(method, url.clone());
 
         match self.auth {
             AdminAuth::None => {}
         }
 
+        if let Some(body) = body {
+            builder = builder
+                .header(reqwest::header::CONTENT_TYPE, "application/json")
+                .body(body);
+        }
+
         let response = builder.send().await.map_err(|err| Error::Transport {
             details: err.to_string(),
         })?;
@@ -145,6 +154,11 @@ impl HttpBackend {
             details: err.to_string(),
         })
     }
+
+    fn decode_operation_error(&self, status: u16, body: &[u8]) -> Result<Error, Error> {
+        let error = self.decode_json::<operations::OperationError>(body)?;
+        Ok(Error::AdminOperation { status, error })
+    }
 }
 
 #[async_trait]
@@ -167,25 +181,20 @@ impl AdminBackend for HttpBackend {
             .map(|(_, body)| body)
     }
 
-    async fn pipeline_groups_status(&self) -> Result<pipeline_groups::Status, Error> {
-        self.request_json(
-            Method::GET,
-            &["api", "v1", "pipeline-groups", "status"],
-            &[],
-            &[200],
-        )
-        .await
-        .map(|(_, body)| body)
+    async fn groups_status(&self) -> Result<groups::Status, Error> {
+        self.request_json(Method::GET, &["api", "v1", "groups", "status"], &[], &[200])
+            .await
+            .map(|(_, body)| body)
     }
 
-    async fn pipeline_groups_shutdown(
+    async fn groups_shutdown(
         &self,
         options: &operations::OperationOptions,
-    ) -> Result<pipeline_groups::ShutdownResponse, Error> {
+    ) -> Result<groups::ShutdownResponse, Error> {
         let query = options.to_query_pairs();
         self.request_json(
             Method::POST,
-            &["api", "v1", "pipeline-groups", "shutdown"],
+            &["api", "v1", "groups", "shutdown"],
             &query,
             &[200, 202, 500, 504],
         )
@@ -193,6 +202,111 @@ impl AdminBackend for HttpBackend {
         .map(|(_, body)| body)
     }
 
+    async fn pipeline_details(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+    ) -> Result<Option<pipelines::PipelineDetails>, Error> {
+        let (status, body) = self
+            .request_raw(
+                Method::GET,
+                &[
+                    "api",
+                    "v1",
+                    "groups",
+                    pipeline_group_id,
+                    "pipelines",
+                    pipeline_id,
+                ],
+                &[],
+                None,
+                &[200, 404],
+            )
+            .await?;
+        if status == 404 {
+            return Ok(None);
+        }
+        self.decode_json(&body).map(Some)
+    }
+
+    async fn pipeline_reconfigure(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        request: &pipelines::ReconfigureRequest,
+        options: &operations::OperationOptions,
+    ) -> Result<pipelines::ReconfigureOutcome, Error> {
+        let query = options.to_query_pairs();
+        let (status, body) = self
+            .request_raw(
+                Method::PUT,
+                &[
+                    "api",
+                    "v1",
+                    "groups",
+                    pipeline_group_id,
+                    "pipelines",
+                    pipeline_id,
+                ],
+                &query,
+                Some(
+                    serde_json::to_vec(request).map_err(|err| Error::ClientConfig {
+                        details: format!("failed to encode reconfigure request: {err}"),
+                    })?,
+                ),
+                &[200, 202, 404, 409, 422, 500, 504],
+            )
+            .await?;
+
+        match status {
+            200 => self
+                .decode_json(&body)
+                .map(pipelines::ReconfigureOutcome::Completed),
+            202 => self
+                .decode_json(&body)
+                .map(pipelines::ReconfigureOutcome::Accepted),
+            409 => match self.decode_json::<pipelines::RolloutStatus>(&body) {
+                Ok(status) => Ok(pipelines::ReconfigureOutcome::Failed(status)),
+                Err(_) => Err(self.decode_operation_error(status, &body)?),
+            },
+            504 => self
+                .decode_json(&body)
+                .map(pipelines::ReconfigureOutcome::TimedOut),
+            404 | 422 | 500 => Err(self.decode_operation_error(status, &body)?),
+            _ => unreachable!("request_raw should have filtered unexpected statuses"),
+        }
+    }
+
+    async fn pipeline_rollout_status(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        rollout_id: &str,
+    ) -> Result<Option<pipelines::RolloutStatus>, Error> {
+        let (status, body) = self
+            .request_raw(
+                Method::GET,
+                &[
+                    "api",
+                    "v1",
+                    "groups",
+                    pipeline_group_id,
+                    "pipelines",
+                    pipeline_id,
+                    "rollouts",
+                    rollout_id,
+                ],
+                &[],
+                None,
+                &[200, 404],
+            )
+            .await?;
+        if status == 404 {
+            return Ok(None);
+        }
+        self.decode_json(&body).map(Some)
+    }
+
     async fn pipeline_status(
         &self,
         pipeline_group_id: &str,
@@ -203,7 +317,7 @@ impl AdminBackend for HttpBackend {
             &[
                 "api",
                 "v1",
-                "pipeline-groups",
+                "groups",
                 pipeline_group_id,
                 "pipelines",
                 pipeline_id,
@@ -226,7 +340,7 @@ impl AdminBackend for HttpBackend {
             &[
                 "api",
                 "v1",
-                "pipeline-groups",
+                "groups",
                 pipeline_group_id,
                 "pipelines",
                 pipeline_id,
@@ -248,7 +362,7 @@ impl AdminBackend for HttpBackend {
             &[
                 "api",
                 "v1",
-                "pipeline-groups",
+                "groups",
                 pipeline_group_id,
                 "pipelines",
                 pipeline_id,
@@ -260,6 +374,80 @@ impl AdminBackend for HttpBackend {
         .await
     }
 
+    async fn pipeline_shutdown(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        options: &operations::OperationOptions,
+    ) -> Result<pipelines::ShutdownOutcome, Error> {
+        let query = options.to_query_pairs();
+        let (status, body) = self
+            .request_raw(
+                Method::POST,
+                &[
+                    "api",
+                    "v1",
+                    "groups",
+                    pipeline_group_id,
+                    "pipelines",
+                    pipeline_id,
+                    "shutdown",
+                ],
+                &query,
+                None,
+                &[200, 202, 404, 409, 422, 500, 504],
+            )
+            .await?;
+
+        match status {
+            200 => self
+                .decode_json(&body)
+                .map(pipelines::ShutdownOutcome::Completed),
+            202 => self
+                .decode_json(&body)
+                .map(pipelines::ShutdownOutcome::Accepted),
+            409 => match self.decode_json::<pipelines::ShutdownStatus>(&body) {
+                Ok(status) => Ok(pipelines::ShutdownOutcome::Failed(status)),
+                Err(_) => Err(self.decode_operation_error(status, &body)?),
+            },
+            504 => self
+                .decode_json(&body)
+                .map(pipelines::ShutdownOutcome::TimedOut),
+            404 | 422 | 500 => Err(self.decode_operation_error(status, &body)?),
+            _ => unreachable!("request_raw should have filtered unexpected statuses"),
+        }
+    }
+
+    async fn pipeline_shutdown_status(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        shutdown_id: &str,
+    ) -> Result<Option<pipelines::ShutdownStatus>, Error> {
+        let (status, body) = self
+            .request_raw(
+                Method::GET,
+                &[
+                    "api",
+                    "v1",
+                    "groups",
+                    pipeline_group_id,
+                    "pipelines",
+                    pipeline_id,
+                    "shutdowns",
+                    shutdown_id,
+                ],
+                &[],
+                None,
+                &[200, 404],
+            )
+            .await?;
+        if status == 404 {
+            return Ok(None);
+        }
+        self.decode_json(&body).map(Some)
+    }
+
     async fn telemetry_logs(
         &self,
         query: &telemetry::LogsQuery,
@@ -270,6 +458,7 @@ impl AdminBackend for HttpBackend {
                 Method::GET,
                 &["api", "v1", "telemetry", "logs"],
                 &query_pairs,
+                None,
                 &[200, 404],
             )
             .await?;
@@ -533,15 +722,16 @@ fn ensure_crypto_provider() -> Result<(), Error> {
 mod tests {
     use super::*;
     use crate::config::tls::{TlsClientConfig, TlsConfig};
-    use crate::{AdminClient, engine, operations, pipeline_groups, pipelines, telemetry};
+    use crate::{AdminClient, engine, groups, operations, pipelines, telemetry};
     use otap_test_tls_certs::{ExtendedKeyUsage, generate_ca};
     use rustls_pki_types::{CertificateDer, PrivateKeyDer, pem::PemObject};
+    use serde_json::json;
     use std::sync::Arc;
     use tempfile::tempdir;
     use tokio::io::{AsyncReadExt, AsyncWriteExt};
     use tokio::net::TcpListener;
     use tokio_rustls::TlsAcceptor;
-    use wiremock::matchers::{method, path, query_param};
+    use wiremock::matchers::{body_json, method, path, query_param};
     use wiremock::{Mock, MockServer, ResponseTemplate};
 
     fn client(server: &MockServer) -> AdminClient {
@@ -552,6 +742,18 @@ mod tests {
             .expect("client should build")
     }
 
+    fn minimal_pipeline_json() -> serde_json::Value {
+        json!({
+            "type": "otap",
+            "nodes": {
+                "recv": {
+                    "type": "receiver:fake",
+                    "config": {}
+                }
+            }
+        })
+    }
+
     async fn start_https_json_server(
         server_cert_pem: &str,
         server_key_pem: &str,
@@ -682,25 +884,29 @@ mod tests {
         assert_eq!(response.status, engine::ProbeStatus::Failed);
     }
 
+    /// Scenario: the SDK calls the group shutdown endpoint with wait/query
+    /// options and the server returns a non-200 success body.
+    /// Guarantees: the HTTP backend targets `/api/v1/groups/shutdown`,
+    /// forwards the query parameters, and still decodes the accepted response.
     #[tokio::test]
-    async fn pipeline_groups_shutdown_accepts_query_and_non_200_success_shapes() {
+    async fn groups_shutdown_accepts_query_and_non_200_success_shapes() {
         let server = MockServer::start().await;
         Mock::given(method("POST"))
-            .and(path("/api/v1/pipeline-groups/shutdown"))
+            .and(path("/api/v1/groups/shutdown"))
             .and(query_param("wait", "true"))
             .and(query_param("timeout_secs", "30"))
-            .respond_with(ResponseTemplate::new(202).set_body_json(
-                pipeline_groups::ShutdownResponse {
-                    status: pipeline_groups::ShutdownStatus::Accepted,
+            .respond_with(
+                ResponseTemplate::new(202).set_body_json(groups::ShutdownResponse {
+                    status: groups::ShutdownStatus::Accepted,
                     errors: None,
                     duration_ms: None,
-                },
-            ))
+                }),
+            )
             .mount(&server)
             .await;
 
         let response = client(&server)
-            .pipeline_groups()
+            .groups()
             .shutdown(&operations::OperationOptions {
                 wait: true,
                 timeout_secs: 30,
@@ -708,22 +914,47 @@ mod tests {
             .await
             .expect("shutdown should decode");
 
-        assert_eq!(response.status, pipeline_groups::ShutdownStatus::Accepted);
+        assert_eq!(response.status, groups::ShutdownStatus::Accepted);
+    }
+
+    /// Scenario: a caller requests group status through the public SDK.
+    /// Guarantees: the HTTP backend uses the `/api/v1/groups/status` route
+    /// instead of the older pipeline-groups path and decodes the payload.
+    #[tokio::test]
+    async fn groups_status_uses_groups_route() {
+        let server = MockServer::start().await;
+        Mock::given(method("GET"))
+            .and(path("/api/v1/groups/status"))
+            .respond_with(ResponseTemplate::new(200).set_body_json(groups::Status {
+                generated_at: "2026-01-01T00:00:00Z".to_string(),
+                pipelines: Default::default(),
+            }))
+            .mount(&server)
+            .await;
+
+        let response = client(&server)
+            .groups()
+            .status()
+            .await
+            .expect("group status should decode");
+        assert_eq!(response.generated_at, "2026-01-01T00:00:00Z");
     }
 
     #[tokio::test]
     async fn pipeline_status_decodes_optional_payload() {
         let server = MockServer::start().await;
         Mock::given(method("GET"))
-            .and(path(
-                "/api/v1/pipeline-groups/default/pipelines/main/status",
-            ))
+            .and(path("/api/v1/groups/default/pipelines/main/status"))
             .respond_with(
                 ResponseTemplate::new(200).set_body_json(Some(pipelines::Status {
                     conditions: vec![],
                     total_cores: 1,
                     running_cores: 1,
                     cores: Default::default(),
+                    instances: None,
+                    active_generation: None,
+                    serving_generations: None,
+                    rollout: None,
                 })),
             )
             .mount(&server)
@@ -738,11 +969,307 @@ mod tests {
         assert!(response.is_some());
     }
 
+    /// Scenario: the server returns a committed pipeline details payload for an
+    /// existing logical pipeline.
+    /// Guarantees: the SDK surfaces that payload as `Some(...)` rather than
+    /// treating it as an optional or missing resource.
+    #[tokio::test]
+    async fn pipeline_details_returns_some_on_200() {
+        let server = MockServer::start().await;
+        Mock::given(method("GET"))
+            .and(path("/api/v1/groups/default/pipelines/main"))
+            .respond_with(ResponseTemplate::new(200).set_body_json(json!({
+                "pipelineGroupId": "default",
+                "pipelineId": "main",
+                "activeGeneration": 3,
+                "pipeline": minimal_pipeline_json(),
+                "rollout": {
+                    "rolloutId": "rollout-3",
+                    "state": "running",
+                    "targetGeneration": 3,
+                    "startedAt": "2026-01-01T00:00:00Z",
+                    "updatedAt": "2026-01-01T00:00:01Z"
+                }
+            })))
+            .mount(&server)
+            .await;
+
+        let response = client(&server)
+            .pipelines()
+            .details("default", "main")
+            .await
+            .expect("pipeline details should decode");
+
+        assert!(response.is_some());
+    }
+
+    /// Scenario: a caller submits an asynchronous reconfigure request through
+    /// the public SDK.
+    /// Guarantees: the backend serializes the request body and query options
+    /// correctly and maps an accepted rollout response to `Accepted`.
+    #[tokio::test]
+    async fn pipeline_reconfigure_encodes_request_and_decodes_accepted() {
+        let server = MockServer::start().await;
+        let request = pipelines::ReconfigureRequest {
+            pipeline: serde_json::from_value(minimal_pipeline_json())
+                .expect("fixture pipeline should deserialize"),
+            step_timeout_secs: 45,
+            drain_timeout_secs: 30,
+        };
+        Mock::given(method("PUT"))
+            .and(path("/api/v1/groups/default/pipelines/main"))
+            .and(query_param("wait", "false"))
+            .and(query_param("timeout_secs", "120"))
+            .and(body_json(
+                serde_json::to_value(&request).expect("request should serialize"),
+            ))
+            .respond_with(ResponseTemplate::new(202).set_body_json(json!({
+                "rolloutId": "rollout-3",
+                "pipelineGroupId": "default",
+                "pipelineId": "main",
+                "action": "replace",
+                "state": "running",
+                "targetGeneration": 3,
+                "previousGeneration": 2,
+                "startedAt": "2026-01-01T00:00:00Z",
+                "updatedAt": "2026-01-01T00:00:01Z",
+                "cores": []
+            })))
+            .mount(&server)
+            .await;
+
+        let response = client(&server)
+            .pipelines()
+            .reconfigure(
+                "default",
+                "main",
+                &request,
+                &operations::OperationOptions {
+                    wait: false,
+                    timeout_secs: 120,
+                },
+            )
+            .await
+            .expect("reconfigure should decode");
+
+        match response {
+            pipelines::ReconfigureOutcome::Accepted(status) => {
+                assert_eq!(status.rollout_id, "rollout-3");
+                assert_eq!(status.state, pipelines::PipelineRolloutState::Running);
+            }
+            other => panic!("unexpected outcome: {other:?}"),
+        }
+    }
+
+    /// Scenario: a waited reconfigure request reaches a terminal failed rollout
+    /// and the server reports that state with a 409 status body.
+    /// Guarantees: the backend treats this as an operation outcome, not a typed
+    /// request rejection, and returns `ReconfigureOutcome::Failed`.
+    #[tokio::test]
+    async fn pipeline_reconfigure_decodes_failed_outcome_from_409_status_body() {
+        let server = MockServer::start().await;
+        Mock::given(method("PUT"))
+            .and(path("/api/v1/groups/default/pipelines/main"))
+            .and(query_param("wait", "true"))
+            .and(query_param("timeout_secs", "60"))
+            .respond_with(ResponseTemplate::new(409).set_body_json(json!({
+                "rolloutId": "rollout-4",
+                "pipelineGroupId": "default",
+                "pipelineId": "main",
+                "action": "replace",
+                "state": "failed",
+                "targetGeneration": 4,
+                "previousGeneration": 3,
+                "startedAt": "2026-01-01T00:00:00Z",
+                "updatedAt": "2026-01-01T00:00:10Z",
+                "failureReason": "candidate failed admission",
+                "cores": []
+            })))
+            .mount(&server)
+            .await;
+
+        let request = pipelines::ReconfigureRequest {
+            pipeline: serde_json::from_value(minimal_pipeline_json())
+                .expect("fixture pipeline should deserialize"),
+            step_timeout_secs: 60,
+            drain_timeout_secs: 60,
+        };
+
+        let response = client(&server)
+            .pipelines()
+            .reconfigure(
+                "default",
+                "main",
+                &request,
+                &operations::OperationOptions {
+                    wait: true,
+                    timeout_secs: 60,
+                },
+            )
+            .await
+            .expect("failed outcome should decode");
+
+        match response {
+            pipelines::ReconfigureOutcome::Failed(status) => {
+                assert_eq!(status.rollout_id, "rollout-4");
+                assert_eq!(status.state, pipelines::PipelineRolloutState::Failed);
+            }
+            other => panic!("unexpected outcome: {other:?}"),
+        }
+    }
+
+    /// Scenario: the server rejects a reconfigure request before any rollout
+    /// work starts and returns a structured operation error body.
+    /// Guarantees: the backend preserves that rejection as
+    /// `Error::AdminOperation` so callers can distinguish it from transport
+    /// failures and terminal rollout outcomes.
+    #[tokio::test]
+    async fn pipeline_reconfigure_decodes_admin_operation_error() {
+        let server = MockServer::start().await;
+        Mock::given(method("PUT"))
+            .and(path("/api/v1/groups/default/pipelines/main"))
+            .respond_with(ResponseTemplate::new(422).set_body_json(json!({
+                "kind": "invalid_request",
+                "message": "topic runtime mutation is not supported"
+            })))
+            .mount(&server)
+            .await;
+
+        let request = pipelines::ReconfigureRequest {
+            pipeline: serde_json::from_value(minimal_pipeline_json())
+                .expect("fixture pipeline should deserialize"),
+            step_timeout_secs: 60,
+            drain_timeout_secs: 60,
+        };
+
+        let err = client(&server)
+            .pipelines()
+            .reconfigure(
+                "default",
+                "main",
+                &request,
+                &operations::OperationOptions::default(),
+            )
+            .await
+            .expect_err("request rejection should be typed");
+
+        match err {
+            Error::AdminOperation { status, error } => {
+                assert_eq!(status, 422);
+                assert_eq!(error.kind, operations::OperationErrorKind::InvalidRequest);
+                assert_eq!(
+                    error.message.as_deref(),
+                    Some("topic runtime mutation is not supported")
+                );
+            }
+            other => panic!("unexpected error: {other}"),
+        }
+    }
+
+    /// Scenario: a caller polls a rollout id that no longer exists or was never
+    /// created.
+    /// Guarantees: the backend maps HTTP 404 to `Ok(None)` for rollout status
+    /// lookups instead of treating it as an SDK error.
+    #[tokio::test]
+    async fn pipeline_rollout_status_returns_none_on_404() {
+        let server = MockServer::start().await;
+        Mock::given(method("GET"))
+            .and(path(
+                "/api/v1/groups/default/pipelines/main/rollouts/rollout-9",
+            ))
+            .respond_with(ResponseTemplate::new(404))
+            .mount(&server)
+            .await;
+
+        let response = client(&server)
+            .pipelines()
+            .rollout_status("default", "main", "rollout-9")
+            .await
+            .expect("rollout status should decode");
+
+        assert!(response.is_none());
+    }
+
+    /// Scenario: a caller waits on pipeline shutdown and the server times out
+    /// the wait while returning the latest shutdown snapshot.
+    /// Guarantees: the backend decodes that response as
+    /// `ShutdownOutcome::TimedOut` and preserves the embedded status.
+    #[tokio::test]
+    async fn pipeline_shutdown_decodes_timed_out_outcome() {
+        let server = MockServer::start().await;
+        Mock::given(method("POST"))
+            .and(path("/api/v1/groups/default/pipelines/main/shutdown"))
+            .and(query_param("wait", "true"))
+            .and(query_param("timeout_secs", "30"))
+            .respond_with(ResponseTemplate::new(504).set_body_json(json!({
+                "shutdownId": "shutdown-2",
+                "pipelineGroupId": "default",
+                "pipelineId": "main",
+                "state": "running",
+                "startedAt": "2026-01-01T00:00:00Z",
+                "updatedAt": "2026-01-01T00:00:30Z",
+                "cores": []
+            })))
+            .mount(&server)
+            .await;
+
+        let response = client(&server)
+            .pipelines()
+            .shutdown(
+                "default",
+                "main",
+                &operations::OperationOptions {
+                    wait: true,
+                    timeout_secs: 30,
+                },
+            )
+            .await
+            .expect("shutdown outcome should decode");
+
+        match response {
+            pipelines::ShutdownOutcome::TimedOut(status) => {
+                assert_eq!(status.shutdown_id, "shutdown-2");
+            }
+            other => panic!("unexpected outcome: {other:?}"),
+        }
+    }
+
+    /// Scenario: a caller polls a known pipeline shutdown operation by id.
+    /// Guarantees: the backend decodes the returned shutdown snapshot and
+    /// surfaces it as `Some(...)`.
+    #[tokio::test]
+    async fn pipeline_shutdown_status_returns_some_on_200() {
+        let server = MockServer::start().await;
+        Mock::given(method("GET"))
+            .and(path(
+                "/api/v1/groups/default/pipelines/main/shutdowns/shutdown-2",
+            ))
+            .respond_with(ResponseTemplate::new(200).set_body_json(json!({
+                "shutdownId": "shutdown-2",
+                "pipelineGroupId": "default",
+                "pipelineId": "main",
+                "state": "succeeded",
+                "startedAt": "2026-01-01T00:00:00Z",
+                "updatedAt": "2026-01-01T00:00:05Z",
+                "cores": []
+            })))
+            .mount(&server)
+            .await;
+
+        let response = client(&server)
+            .pipelines()
+            .shutdown_status("default", "main", "shutdown-2")
+            .await
+            .expect("shutdown status should decode");
+
+        assert!(response.is_some());
+    }
+
     #[tokio::test]
     async fn pipeline_livez_maps_failed_probe_and_message() {
         let server = MockServer::start().await;
         Mock::given(method("GET"))
-            .and(path("/api/v1/pipeline-groups/default/pipelines/main/livez"))
+            .and(path("/api/v1/groups/default/pipelines/main/livez"))
             .respond_with(ResponseTemplate::new(500).set_body_string("NOT OK"))
             .mount(&server)
             .await;
@@ -760,7 +1287,7 @@ mod tests {
     async fn pipeline_livez_maps_ok_probe_without_message() {
         let server = MockServer::start().await;
         Mock::given(method("GET"))
-            .and(path("/api/v1/pipeline-groups/default/pipelines/main/livez"))
+            .and(path("/api/v1/groups/default/pipelines/main/livez"))
             .respond_with(ResponseTemplate::new(200).set_body_string(""))
             .mount(&server)
             .await;
@@ -779,9 +1306,7 @@ mod tests {
     async fn pipeline_readyz_maps_service_unavailable_to_failed_probe() {
         let server = MockServer::start().await;
         Mock::given(method("GET"))
-            .and(path(
-                "/api/v1/pipeline-groups/default/pipelines/main/readyz",
-            ))
+            .and(path("/api/v1/groups/default/pipelines/main/readyz"))
             .respond_with(ResponseTemplate::new(503).set_body_string("NOT OK"))
             .mount(&server)
             .await;
@@ -800,7 +1325,7 @@ mod tests {
     async fn pipeline_probe_unexpected_status_is_remote_status() {
         let server = MockServer::start().await;
         Mock::given(method("GET"))
-            .and(path("/api/v1/pipeline-groups/default/pipelines/main/livez"))
+            .and(path("/api/v1/groups/default/pipelines/main/livez"))
             .respond_with(ResponseTemplate::new(418).set_body_string("teapot"))
             .mount(&server)
             .await;
diff --git a/rust/otap-dataflow/crates/admin-api/src/lib.rs b/rust/otap-dataflow/crates/admin-api/src/lib.rs
index 8cb7e635f0..79162b0607 100644
--- a/rust/otap-dataflow/crates/admin-api/src/lib.rs
+++ b/rust/otap-dataflow/crates/admin-api/src/lib.rs
@@ -11,12 +11,12 @@ mod client;
 #[cfg(feature = "http-client")]
 mod http_backend;
 
-pub use otap_df_admin_types::{engine, operations, pipeline_groups, pipelines, telemetry};
+pub use otap_df_admin_types::{engine, groups, operations, pipelines, telemetry};
 pub use otap_df_config as config;
 
 #[cfg(feature = "http-client")]
 pub use crate::client::{
-    AdminClient, AdminClientBuilder, EngineClient, HttpAdminClientSettings, PipelineGroupsClient,
+    AdminClient, AdminClientBuilder, EngineClient, GroupsClient, HttpAdminClientSettings,
     PipelinesClient, TelemetryClient,
 };
 pub use crate::endpoint::{AdminAuth, AdminEndpoint, AdminScheme};
diff --git a/rust/otap-dataflow/crates/admin-types/Cargo.toml b/rust/otap-dataflow/crates/admin-types/Cargo.toml
index 08681242f7..453ed40cbb 100644
--- a/rust/otap-dataflow/crates/admin-types/Cargo.toml
+++ b/rust/otap-dataflow/crates/admin-types/Cargo.toml
@@ -13,5 +13,7 @@ rust-version.workspace = true
 workspace = true
 
 [dependencies]
+otap-df-config = { workspace = true }
+
 serde = { workspace = true, features = ["derive"] }
 serde_json = { workspace = true }
diff --git a/rust/otap-dataflow/crates/admin-types/src/pipeline_groups.rs b/rust/otap-dataflow/crates/admin-types/src/groups.rs
similarity index 96%
rename from rust/otap-dataflow/crates/admin-types/src/pipeline_groups.rs
rename to rust/otap-dataflow/crates/admin-types/src/groups.rs
index 61558ae8e1..3da80352e4 100644
--- a/rust/otap-dataflow/crates/admin-types/src/pipeline_groups.rs
+++ b/rust/otap-dataflow/crates/admin-types/src/groups.rs
@@ -1,13 +1,13 @@
 // Copyright The OpenTelemetry Authors
 // SPDX-License-Identifier: Apache-2.0
 
-//! Shared pipeline-group-scoped admin models.
+//! Shared group-scoped admin models.
 
 use crate::pipelines::Status as PipelineStatus;
 use serde::{Deserialize, Serialize};
 use std::collections::BTreeMap;
 
-/// Pipeline-group status response.
+/// Group status response.
 #[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
 #[serde(rename_all = "camelCase")]
 pub struct Status {
diff --git a/rust/otap-dataflow/crates/admin-types/src/lib.rs b/rust/otap-dataflow/crates/admin-types/src/lib.rs
index 7524741dc5..9000275476 100644
--- a/rust/otap-dataflow/crates/admin-types/src/lib.rs
+++ b/rust/otap-dataflow/crates/admin-types/src/lib.rs
@@ -4,7 +4,7 @@
 //! Shared admin request, response, query, and model types.
 
 pub mod engine;
+pub mod groups;
 pub mod operations;
-pub mod pipeline_groups;
 pub mod pipelines;
 pub mod telemetry;
diff --git a/rust/otap-dataflow/crates/admin-types/src/operations.rs b/rust/otap-dataflow/crates/admin-types/src/operations.rs
index 76468d6e97..d4bcc6afad 100644
--- a/rust/otap-dataflow/crates/admin-types/src/operations.rs
+++ b/rust/otap-dataflow/crates/admin-types/src/operations.rs
@@ -5,13 +5,17 @@
 
 use serde::{Deserialize, Serialize};
 
-/// Generic options for long-running admin operations.
+/// Wait behavior for long-running admin operations such as reconfigure and shutdown.
+///
+/// By default operations are asynchronous: the SDK returns as soon as the
+/// request has been accepted for background execution or has already completed.
+/// Set `wait = true` to wait up to `timeout_secs` for a terminal result.
 #[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
 pub struct OperationOptions {
-    /// Whether to wait for completion.
+    /// Whether the SDK should wait for the operation to reach a terminal result.
     #[serde(default)]
     pub wait: bool,
-    /// Wait timeout in seconds.
+    /// Maximum number of seconds to wait when `wait` is `true`.
     #[serde(default = "default_timeout_secs")]
     pub timeout_secs: u64,
 }
@@ -30,7 +34,10 @@ const fn default_timeout_secs() -> u64 {
 }
 
 impl OperationOptions {
-    /// Converts this request into URL query pairs.
+    /// Converts these options into URL query pairs for SDK transports.
+    ///
+    /// Most callers do not need this directly because the built-in HTTP
+    /// transport uses it automatically.
     #[must_use]
     pub fn to_query_pairs(&self) -> Vec<(&'static str, String)> {
         vec![
@@ -39,3 +46,80 @@ impl OperationOptions {
         ]
     }
 }
+
+/// Typed request rejection for live admin operations.
+///
+/// This is returned when the server refuses to start the requested operation at
+/// all. It is different from an accepted operation that later reports
+/// `Failed(...)` or `TimedOut(...)`.
+#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
+#[serde(rename_all = "camelCase")]
+pub struct OperationError {
+    /// Machine-readable rejection kind.
+    pub kind: OperationErrorKind,
+    /// Optional human-readable detail.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub message: Option<String>,
+}
+
+impl OperationError {
+    /// Creates a typed operation rejection without a human-readable message.
+    #[must_use]
+    pub const fn new(kind: OperationErrorKind) -> Self {
+        Self {
+            kind,
+            message: None,
+        }
+    }
+
+    /// Attaches a human-readable detail message to the rejection.
+    #[must_use]
+    pub fn with_message(mut self, message: impl Into<String>) -> Self {
+        self.message = Some(message.into());
+        self
+    }
+}
+
+/// Machine-readable rejection kinds for live admin operations.
+#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
+#[serde(rename_all = "snake_case")]
+pub enum OperationErrorKind {
+    /// The requested pipeline group does not exist.
+    GroupNotFound,
+    /// The requested pipeline does not exist.
+    PipelineNotFound,
+    /// The requested rollout does not exist.
+    RolloutNotFound,
+    /// The requested shutdown does not exist.
+    ShutdownNotFound,
+    /// Another incompatible live operation is active in the server's consistency scope.
+    Conflict,
+    /// The request was rejected as invalid.
+    InvalidRequest,
+    /// The server failed while processing the request.
+    Internal,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use serde_json::json;
+
+    /// Scenario: the server returns a structured admin operation rejection in
+    /// the shared public wire format.
+    /// Guarantees: the SDK-owned `OperationError` model round-trips without
+    /// renaming fields or changing enum values.
+    #[test]
+    fn operation_error_roundtrips() {
+        let value = json!({
+            "kind": "invalid_request",
+            "message": "core allocation change is not supported"
+        });
+        let parsed: OperationError =
+            serde_json::from_value(value.clone()).expect("fixture should deserialize");
+        assert_eq!(
+            serde_json::to_value(parsed).expect("model should serialize"),
+            value
+        );
+    }
+}
diff --git a/rust/otap-dataflow/crates/admin-types/src/pipelines.rs b/rust/otap-dataflow/crates/admin-types/src/pipelines.rs
index 2ad1e7ea53..1140757c32 100644
--- a/rust/otap-dataflow/crates/admin-types/src/pipelines.rs
+++ b/rust/otap-dataflow/crates/admin-types/src/pipelines.rs
@@ -3,10 +3,64 @@
 
 //! Shared pipeline-scoped admin models.
 
+use otap_df_config::{PipelineGroupId, PipelineId, pipeline::PipelineConfig};
 use serde::{Deserialize, Deserializer, Serialize, Serializer};
 use serde_json::Value;
 use std::collections::BTreeMap;
 
+const fn default_rollout_timeout_secs() -> u64 {
+    60
+}
+
+/// Rollout state summary exposed on pipeline status snapshots and rollout resources.
+#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
+#[serde(rename_all = "snake_case")]
+pub enum PipelineRolloutState {
+    /// Rollout has been accepted but work has not started yet.
+    Pending,
+    /// Rollout is actively applying changes.
+    Running,
+    /// Rollout completed successfully and the target generation is serving.
+    Succeeded,
+    /// Rollout failed before completion.
+    Failed,
+    /// Automatic rollback is in progress.
+    RollingBack,
+    /// Rollback could not restore a fully healthy serving set.
+    RollbackFailed,
+}
+
+/// Lightweight rollout summary embedded into pipeline status payloads.
+#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
+#[serde(rename_all = "camelCase")]
+pub struct PipelineRolloutSummary {
+    /// Controller-assigned rollout identifier.
+    pub rollout_id: String,
+    /// Current rollout lifecycle state.
+    pub state: PipelineRolloutState,
+    /// Candidate generation being rolled out.
+    pub target_generation: u64,
+    /// RFC3339 timestamp for rollout creation.
+    pub started_at: String,
+    /// RFC3339 timestamp for the latest rollout state transition.
+    pub updated_at: String,
+    /// Human-readable failure or rollback reason when present.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub failure_reason: Option<String>,
+}
+
+/// Per-instance runtime status entry for generation-aware pipeline status payloads.
+#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
+#[serde(rename_all = "camelCase")]
+pub struct RuntimeInstanceStatus {
+    /// CPU core hosting this runtime instance.
+    pub core_id: usize,
+    /// Deployment generation for this runtime instance.
+    pub deployment_generation: u64,
+    /// Runtime status for this instance.
+    pub status: CoreStatus,
+}
+
 /// Pipeline status across all cores.
 #[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
 #[serde(rename_all = "camelCase")]
@@ -19,6 +73,198 @@ pub struct Status {
     pub running_cores: usize,
     /// Per-core details.
     pub cores: BTreeMap<usize, CoreStatus>,
+    /// Per-instance details when overlapping generations are present.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub instances: Option<Vec<RuntimeInstanceStatus>>,
+    /// Last committed active generation, if known.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub active_generation: Option<u64>,
+    /// Serving generation selected per core by the controller during rollout.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub serving_generations: Option<BTreeMap<usize, u64>>,
+    /// Optional rollout summary mirrored into `/status`.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub rollout: Option<PipelineRolloutSummary>,
+}
+
+/// Committed live definition of one logical pipeline.
+///
+/// This is the configuration that the controller currently treats as active for
+/// the logical pipeline. It is not a runtime status snapshot; use [`Status`]
+/// when you need per-core progress or overlapping-instance state.
+#[derive(Debug, Clone, Serialize, Deserialize)]
+#[serde(rename_all = "camelCase")]
+pub struct PipelineDetails {
+    /// Logical pipeline group id.
+    pub pipeline_group_id: PipelineGroupId,
+    /// Logical pipeline id.
+    pub pipeline_id: PipelineId,
+    /// Last committed active generation, if known.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub active_generation: Option<u64>,
+    /// Current live pipeline configuration.
+    pub pipeline: PipelineConfig,
+    /// Optional rollout summary mirrored into `/status`.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub rollout: Option<PipelineRolloutSummary>,
+}
+
+/// Desired pipeline definition and timing options for a live reconfiguration request.
+#[derive(Debug, Clone, Serialize, Deserialize)]
+#[serde(rename_all = "camelCase")]
+pub struct ReconfigureRequest {
+    /// Candidate pipeline configuration to create or roll out.
+    pub pipeline: PipelineConfig,
+    /// Per-core admission/ready timeout in seconds.
+    #[serde(default = "default_rollout_timeout_secs")]
+    pub step_timeout_secs: u64,
+    /// Graceful drain timeout in seconds when shutting down the old generation.
+    #[serde(default = "default_rollout_timeout_secs")]
+    pub drain_timeout_secs: u64,
+}
+
+/// Detailed per-core rollout progress.
+#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
+#[serde(rename_all = "camelCase")]
+pub struct RolloutCoreStatus {
+    /// Target core for this step.
+    pub core_id: usize,
+    /// Previously serving generation on this core, if any.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub previous_generation: Option<u64>,
+    /// Candidate generation being launched for this core.
+    pub target_generation: u64,
+    /// Current lifecycle state for this core step.
+    pub state: String,
+    /// RFC3339 timestamp for the latest step transition.
+    pub updated_at: String,
+    /// Optional human-readable detail for failures or waits.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub detail: Option<String>,
+}
+
+/// Snapshot of one live reconfiguration operation.
+///
+/// This describes the current state of a specific rollout id. It is operation
+/// status, not a stable pipeline definition. These snapshots are retained in
+/// controller memory only for a bounded window.
+#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
+#[serde(rename_all = "camelCase")]
+pub struct RolloutStatus {
+    /// Controller-assigned rollout identifier.
+    pub rollout_id: String,
+    /// Logical target pipeline group id.
+    pub pipeline_group_id: PipelineGroupId,
+    /// Logical target pipeline id.
+    pub pipeline_id: PipelineId,
+    /// `create`, `noop`, `replace`, or `resize`.
+    pub action: String,
+    /// Current rollout lifecycle state.
+    pub state: PipelineRolloutState,
+    /// Candidate generation targeted by this rollout.
+    pub target_generation: u64,
+    /// Previously committed generation, if any.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub previous_generation: Option<u64>,
+    /// RFC3339 timestamp for rollout creation.
+    pub started_at: String,
+    /// RFC3339 timestamp for the latest rollout transition.
+    pub updated_at: String,
+    /// Optional failure or rollback reason.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub failure_reason: Option<String>,
+    /// Per-core rollout progress entries.
+    pub cores: Vec<RolloutCoreStatus>,
+}
+
+/// Detailed per-core shutdown progress.
+#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
+#[serde(rename_all = "camelCase")]
+pub struct ShutdownCoreStatus {
+    /// Target core being drained.
+    pub core_id: usize,
+    /// Deployment generation targeted for shutdown on this core.
+    pub deployment_generation: u64,
+    /// Current lifecycle state for this core shutdown step.
+    pub state: String,
+    /// RFC3339 timestamp for the latest step transition.
+    pub updated_at: String,
+    /// Optional human-readable detail for failures or waits.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub detail: Option<String>,
+}
+
+/// Snapshot of one pipeline shutdown operation.
+///
+/// This describes the current state of a specific shutdown id. It is operation
+/// status, not a stable pipeline definition. These snapshots are retained in
+/// controller memory only for a bounded window.
+#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
+#[serde(rename_all = "camelCase")]
+pub struct ShutdownStatus {
+    /// Controller-assigned shutdown identifier.
+    pub shutdown_id: String,
+    /// Logical target pipeline group id.
+    pub pipeline_group_id: PipelineGroupId,
+    /// Logical target pipeline id.
+    pub pipeline_id: PipelineId,
+    /// Current shutdown lifecycle state.
+    pub state: String,
+    /// RFC3339 timestamp for shutdown creation.
+    pub started_at: String,
+    /// RFC3339 timestamp for the latest shutdown transition.
+    pub updated_at: String,
+    /// Optional failure reason when shutdown does not complete cleanly.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub failure_reason: Option<String>,
+    /// Per-core shutdown progress entries.
+    pub cores: Vec<ShutdownCoreStatus>,
+}
+
+/// Caller-facing outcome of a live reconfiguration request.
+///
+/// The variant tells you whether the request was only accepted, reached a
+/// terminal state within the requested wait window, or outlived that wait
+/// window.
+#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
+pub enum ReconfigureOutcome {
+    /// The request was accepted and the rollout continues asynchronously.
+    ///
+    /// Poll [`RolloutStatus`] later if you need progress or a terminal
+    /// result.
+    Accepted(RolloutStatus),
+    /// The rollout reached a successful terminal state within the requested wait window.
+    Completed(RolloutStatus),
+    /// The rollout reached a failed terminal state within the requested wait window.
+    Failed(RolloutStatus),
+    /// The requested wait window elapsed before the rollout reached a terminal state.
+    ///
+    /// The included snapshot is the latest known rollout status. The rollout
+    /// may still continue running in the engine.
+    TimedOut(RolloutStatus),
+}
+
+/// Caller-facing outcome of a pipeline shutdown request.
+///
+/// The variant tells you whether the request was only accepted, reached a
+/// terminal state within the requested wait window, or outlived that wait
+/// window.
+#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
+pub enum ShutdownOutcome {
+    /// The request was accepted and the shutdown continues asynchronously.
+    ///
+    /// Poll [`ShutdownStatus`] later if you need progress or a terminal
+    /// result.
+    Accepted(ShutdownStatus),
+    /// The shutdown reached a successful terminal state within the requested wait window.
+    Completed(ShutdownStatus),
+    /// The shutdown reached a failed terminal state within the requested wait window.
+    Failed(ShutdownStatus),
+    /// The requested wait window elapsed before the shutdown reached a terminal state.
+    ///
+    /// The included snapshot is the latest known shutdown status. The shutdown
+    /// may still continue running in the engine.
+    TimedOut(ShutdownStatus),
 }
 
 /// Per-core pipeline status.
@@ -573,6 +819,172 @@ mod tests {
                         }
                     ]
                 }
+            },
+            "instances": [
+                {
+                    "coreId": 0,
+                    "deploymentGeneration": 7,
+                    "status": {
+                        "phase": "running",
+                        "lastHeartbeatTime": "2026-01-01T00:00:00Z",
+                        "conditions": [],
+                        "deletePending": false
+                    }
+                }
+            ],
+            "activeGeneration": 7,
+            "servingGenerations": {
+                "0": 7
+            },
+            "rollout": {
+                "rolloutId": "rollout-7",
+                "state": "running",
+                "targetGeneration": 7,
+                "startedAt": "2026-01-01T00:00:00Z",
+                "updatedAt": "2026-01-01T00:00:01Z"
+            }
+        }));
+    }
+
+    /// Scenario: a caller serializes or deserializes the public live
+    /// reconfiguration request body.
+    /// Guarantees: the shared SDK model preserves the committed camelCase wire
+    /// shape for pipeline config and timeout fields.
+    #[test]
+    fn reconfigure_request_roundtrips_current_wire_shape() {
+        assert_roundtrip::<ReconfigureRequest>(json!({
+            "pipeline": {
+                "type": "otap",
+                "nodes": {
+                    "recv": {
+                        "type": "urn:otel:receiver:fake",
+                        "config": {}
+                    }
+                }
+            },
+            "stepTimeoutSecs": 45,
+            "drainTimeoutSecs": 30
+        }));
+    }
+
+    /// Scenario: a caller reads the committed pipeline-details resource through
+    /// the public SDK.
+    /// Guarantees: the shared model preserves the current wire shape for the
+    /// committed config, active generation, and embedded rollout summary.
+    #[test]
+    fn pipeline_details_roundtrips_current_wire_shape() {
+        assert_roundtrip::<PipelineDetails>(json!({
+            "pipelineGroupId": "default",
+            "pipelineId": "main",
+            "activeGeneration": 2,
+            "pipeline": {
+                "type": "otap",
+                "nodes": {
+                    "recv": {
+                        "type": "urn:otel:receiver:fake",
+                        "config": {}
+                    }
+                }
+            },
+            "rollout": {
+                "rolloutId": "rollout-2",
+                "state": "succeeded",
+                "targetGeneration": 2,
+                "startedAt": "2026-01-01T00:00:00Z",
+                "updatedAt": "2026-01-01T00:00:05Z"
+            }
+        }));
+    }
+
+    /// Scenario: a caller polls the status of one rollout operation by id.
+    /// Guarantees: the shared rollout snapshot model round-trips the current
+    /// wire shape, including action, lifecycle state, and per-core progress.
+    #[test]
+    fn pipeline_rollout_status_roundtrips_current_wire_shape() {
+        assert_roundtrip::<RolloutStatus>(json!({
+            "rolloutId": "rollout-2",
+            "pipelineGroupId": "default",
+            "pipelineId": "main",
+            "action": "replace",
+            "state": "rolling_back",
+            "targetGeneration": 2,
+            "previousGeneration": 1,
+            "startedAt": "2026-01-01T00:00:00Z",
+            "updatedAt": "2026-01-01T00:00:05Z",
+            "failureReason": "candidate failed admission",
+            "cores": [
+                {
+                    "coreId": 0,
+                    "previousGeneration": 1,
+                    "targetGeneration": 2,
+                    "state": "waiting_ready",
+                    "updatedAt": "2026-01-01T00:00:03Z"
+                }
+            ]
+        }));
+    }
+
+    /// Scenario: a caller polls the status of one pipeline shutdown operation
+    /// by id.
+    /// Guarantees: the shared shutdown snapshot model round-trips the current
+    /// wire shape, including failure detail and per-core progress.
+    #[test]
+    fn pipeline_shutdown_status_roundtrips_current_wire_shape() {
+        assert_roundtrip::<ShutdownStatus>(json!({
+            "shutdownId": "shutdown-1",
+            "pipelineGroupId": "default",
+            "pipelineId": "main",
+            "state": "failed",
+            "startedAt": "2026-01-01T00:00:00Z",
+            "updatedAt": "2026-01-01T00:00:05Z",
+            "failureReason": "drain deadline exceeded",
+            "cores": [
+                {
+                    "coreId": 0,
+                    "deploymentGeneration": 2,
+                    "state": "draining",
+                    "updatedAt": "2026-01-01T00:00:05Z"
+                }
+            ]
+        }));
+    }
+
+    /// Scenario: the SDK receives a waited reconfigure result that completed
+    /// within the requested wait window.
+    /// Guarantees: the caller-facing outcome enum preserves the external wire
+    /// encoding for completed rollout results.
+    #[test]
+    fn reconfigure_outcome_roundtrips() {
+        assert_roundtrip::<ReconfigureOutcome>(json!({
+            "Completed": {
+                "rolloutId": "rollout-2",
+                "pipelineGroupId": "default",
+                "pipelineId": "main",
+                "action": "noop",
+                "state": "succeeded",
+                "targetGeneration": 2,
+                "startedAt": "2026-01-01T00:00:00Z",
+                "updatedAt": "2026-01-01T00:00:00Z",
+                "cores": []
+            }
+        }));
+    }
+
+    /// Scenario: the SDK receives a waited shutdown result whose wait window
+    /// expired before the operation finished.
+    /// Guarantees: the caller-facing outcome enum preserves the external wire
+    /// encoding for timed-out shutdown results.
+    #[test]
+    fn shutdown_outcome_roundtrips() {
+        assert_roundtrip::<ShutdownOutcome>(json!({
+            "TimedOut": {
+                "shutdownId": "shutdown-1",
+                "pipelineGroupId": "default",
+                "pipelineId": "main",
+                "state": "running",
+                "startedAt": "2026-01-01T00:00:00Z",
+                "updatedAt": "2026-01-01T00:00:05Z",
+                "cores": []
             }
         }));
     }
diff --git a/rust/otap-dataflow/crates/admin-types/src/telemetry.rs b/rust/otap-dataflow/crates/admin-types/src/telemetry.rs
index cc9fd01444..3a49c7bb8e 100644
--- a/rust/otap-dataflow/crates/admin-types/src/telemetry.rs
+++ b/rust/otap-dataflow/crates/admin-types/src/telemetry.rs
@@ -18,7 +18,10 @@ pub struct MetricsOptions {
 }
 
 impl MetricsOptions {
-    /// Converts these options into URL query pairs.
+    /// Converts these options into URL query pairs for SDK transports.
+    ///
+    /// Most callers do not need this directly because the built-in HTTP
+    /// transport uses it automatically.
     #[must_use]
     pub fn to_query_pairs(&self) -> Vec<(&'static str, String)> {
         vec![
@@ -189,7 +192,10 @@ pub struct LogsQuery {
 }
 
 impl LogsQuery {
-    /// Converts this request into URL query pairs.
+    /// Converts this query into URL query pairs for SDK transports.
+    ///
+    /// Most callers do not need this directly because the built-in HTTP
+    /// transport uses it automatically.
     #[must_use]
     pub fn to_query_pairs(&self) -> Vec<(&'static str, String)> {
         let mut pairs = Vec::new();
diff --git a/rust/otap-dataflow/crates/admin/README.md b/rust/otap-dataflow/crates/admin/README.md
index 7cd40c4472..c121e83f5e 100644
--- a/rust/otap-dataflow/crates/admin/README.md
+++ b/rust/otap-dataflow/crates/admin/README.md
@@ -3,12 +3,16 @@
 `otap-df-admin` provides:
 
 - admin, health, status, and telemetry HTTP endpoints;
+- live pipeline mutation endpoints for create, replace, resize, rollout
+  tracking, and shutdown tracking;
 - an embedded single-page UI served from the same process and origin.
 
 For architecture and runtime behavior details, see
 [`docs/admin/architecture.md`](../../docs/admin/architecture.md).
 For the admin docs landing page, see
 [`docs/admin/README.md`](../../docs/admin/README.md).
+For the operator guide to live pipeline mutation, see
+[`docs/admin/live-reconfiguration.md`](../../docs/admin/live-reconfiguration.md).
 
 ## Main routes
 
@@ -30,11 +34,19 @@ For the admin docs landing page, see
 - `GET /api/v1/status`
 - `GET /api/v1/livez`
 - `GET /api/v1/readyz`
-- `GET /api/v1/pipeline-groups/status`
-- `GET /api/v1/pipeline-groups/{pipeline_group_id}/pipelines/{pipeline_id}/status`
-- `GET /api/v1/pipeline-groups/{pipeline_group_id}/pipelines/{pipeline_id}/livez`
-- `GET /api/v1/pipeline-groups/{pipeline_group_id}/pipelines/{pipeline_id}/readyz`
-- `POST /api/v1/pipeline-groups/shutdown`
+- `GET /api/v1/groups/status`
+- `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}`
+- `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/status`
+- `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/rollouts/{rollout_id}`
+- `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/shutdowns/{shutdown_id}`
+- `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/livez`
+- `GET /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/readyz`
+
+### Pipeline lifecycle
+
+- `PUT /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}`
+- `POST /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/shutdown`
+- `POST /api/v1/groups/shutdown`
 
 ## Embedded UI layout (crate-relative)
 
@@ -86,8 +98,11 @@ guidance, see [`docs/admin/architecture.md`](../../docs/admin/architecture.md).
   through an enforced integration layer).
 - [ ] Add TLS support in-process or enforce TLS at a mandatory front proxy
   boundary.
-- [ ] Protect `POST /pipeline-groups/shutdown` with stricter access controls
-  than read-only endpoints.
+- [ ] Protect mutating endpoints such as
+  `PUT /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}`,
+  `POST /api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/shutdown`,
+  and `POST /api/v1/groups/shutdown` with stricter access controls than
+  read-only endpoints.
 - [ ] Apply the same hardened response headers to API endpoints
   (`/api/v1/status`, `/api/v1/livez`, `/api/v1/readyz`,
   `/api/v1/telemetry/*`, `/api/v1/metrics`), not only UI/static.
@@ -105,4 +120,6 @@ guidance, see [`docs/admin/architecture.md`](../../docs/admin/architecture.md).
   - strong authentication/authorization
   - network ACLs / source allow-listing
   - route-level restrictions for mutating endpoints such as
-    `/api/v1/pipeline-groups/shutdown`
+    `/api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}`,
+    `/api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/shutdown`,
+    and `/api/v1/groups/shutdown`
diff --git a/rust/otap-dataflow/crates/admin/src/convert.rs b/rust/otap-dataflow/crates/admin/src/convert.rs
index 73aca5a920..e7043b289a 100644
--- a/rust/otap-dataflow/crates/admin/src/convert.rs
+++ b/rust/otap-dataflow/crates/admin/src/convert.rs
@@ -3,8 +3,6 @@
 
 //! Conversion helpers from internal admin/server types to public SDK models.
 
-use otap_df_admin_types::telemetry as api;
-use otap_df_telemetry::attributes::AttributeValue;
 use serde::Serialize;
 use serde::de::DeserializeOwned;
 
@@ -18,19 +16,3 @@ where
     )
     .expect("public admin model should deserialize from the current wire shape")
 }
-
-/// Convert an engine `AttributeValue` to the public admin API representation.
-pub(crate) fn convert_attribute_value(value: &AttributeValue) -> api::AttributeValue {
-    match value {
-        AttributeValue::String(s) => api::AttributeValue::String(s.clone()),
-        AttributeValue::Int(v) => api::AttributeValue::Int(*v),
-        AttributeValue::UInt(v) => api::AttributeValue::UInt(*v),
-        AttributeValue::Double(v) => api::AttributeValue::Double(*v),
-        AttributeValue::Boolean(v) => api::AttributeValue::Boolean(*v),
-        AttributeValue::Map(m) => api::AttributeValue::Map(
-            m.iter()
-                .map(|(k, v)| (k.clone(), convert_attribute_value(v)))
-                .collect(),
-        ),
-    }
-}
diff --git a/rust/otap-dataflow/crates/admin/src/lib.rs b/rust/otap-dataflow/crates/admin/src/lib.rs
index 0b20e1c668..f99600a8ce 100644
--- a/rust/otap-dataflow/crates/admin/src/lib.rs
+++ b/rust/otap-dataflow/crates/admin/src/lib.rs
@@ -12,22 +12,118 @@ mod pipeline_group;
 mod telemetry;
 
 use axum::Router;
+use otap_df_admin_types::operations::{OperationError, OperationErrorKind};
+pub use otap_df_admin_types::pipelines::{
+    PipelineDetails, PipelineRolloutState, PipelineRolloutSummary, ReconfigureRequest,
+    RolloutCoreStatus, RolloutStatus, ShutdownCoreStatus, ShutdownStatus,
+};
+use serde::Serialize;
 use std::net::SocketAddr;
 use std::sync::Arc;
 use tokio::net::TcpListener;
-use tokio::sync::Mutex;
 use tokio_util::sync::CancellationToken;
 use tower::ServiceBuilder;
 
 use crate::error::Error;
 use otap_df_config::engine::HttpAdminSettings;
-use otap_df_engine::control::PipelineAdminSender;
 use otap_df_engine::memory_limiter::MemoryPressureState;
 use otap_df_state::store::ObservedStateHandle;
 use otap_df_telemetry::log_tap::InternalLogTapHandle;
 use otap_df_telemetry::registry::TelemetryRegistryHandle;
 use otap_df_telemetry::{otel_info, otel_warn};
 
+/// Control-plane error surfaced to admin handlers.
+#[derive(Debug, Clone, Serialize, PartialEq, Eq)]
+#[serde(tag = "kind", rename_all = "snake_case")]
+pub enum ControlPlaneError {
+    /// The requested pipeline group does not exist.
+    GroupNotFound,
+    /// The requested pipeline does not exist.
+    PipelineNotFound,
+    /// Another incompatible live operation is active in the current consistency scope.
+    RolloutConflict,
+    /// Submitted pipeline configuration failed validation or violated a runtime boundary.
+    InvalidRequest {
+        /// Human-readable validation failure detail.
+        message: String,
+    },
+    /// The requested rollout could not be found.
+    RolloutNotFound,
+    /// The requested shutdown could not be found.
+    ShutdownNotFound,
+    /// Unexpected internal failure while processing the request.
+    Internal {
+        /// Human-readable internal failure detail.
+        message: String,
+    },
+}
+
+impl ControlPlaneError {
+    /// Converts a control-plane error into the public operation rejection model.
+    #[must_use]
+    pub fn as_operation_error(&self) -> OperationError {
+        match self {
+            Self::GroupNotFound => OperationError::new(OperationErrorKind::GroupNotFound),
+            Self::PipelineNotFound => OperationError::new(OperationErrorKind::PipelineNotFound),
+            Self::RolloutConflict => OperationError::new(OperationErrorKind::Conflict),
+            Self::InvalidRequest { message } => {
+                OperationError::new(OperationErrorKind::InvalidRequest)
+                    .with_message(message.clone())
+            }
+            Self::RolloutNotFound => OperationError::new(OperationErrorKind::RolloutNotFound),
+            Self::ShutdownNotFound => OperationError::new(OperationErrorKind::ShutdownNotFound),
+            Self::Internal { message } => {
+                OperationError::new(OperationErrorKind::Internal).with_message(message.clone())
+            }
+        }
+    }
+}
+
+/// Control-plane interface implemented by the controller runtime.
+pub trait ControlPlane: Send + Sync {
+    /// Requests shutdown of all currently running runtime instances.
+    fn shutdown_all(&self, timeout_secs: u64) -> Result<(), ControlPlaneError>;
+
+    /// Requests shutdown of all currently running runtime instances for one logical pipeline.
+    fn shutdown_pipeline(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        timeout_secs: u64,
+    ) -> Result<ShutdownStatus, ControlPlaneError>;
+
+    /// Reconfigures a logical pipeline and returns the rollout job snapshot.
+    fn reconfigure_pipeline(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        request: ReconfigureRequest,
+    ) -> Result<RolloutStatus, ControlPlaneError>;
+
+    /// Returns the live active config for a logical pipeline.
+    fn pipeline_details(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+    ) -> Result<Option<PipelineDetails>, ControlPlaneError>;
+
+    /// Returns the detailed status for a rollout job.
+    fn rollout_status(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        rollout_id: &str,
+    ) -> Result<Option<RolloutStatus>, ControlPlaneError>;
+
+    /// Returns the detailed status for a shutdown job.
+    fn shutdown_status(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        shutdown_id: &str,
+    ) -> Result<Option<ShutdownStatus>, ControlPlaneError>;
+}
+
 /// Shared state for the HTTP admin server.
 #[derive(Clone)]
 struct AppState {
@@ -37,12 +133,12 @@ struct AppState {
     /// The metrics registry for querying current metrics.
     metrics_registry: TelemetryRegistryHandle,
 
+    /// Resident controller control plane for runtime mutations.
+    controller: Arc<dyn ControlPlane>,
+
     /// Optional internal log tap for querying retained internal logs.
     log_tap: Option<InternalLogTapHandle>,
 
-    /// The control message senders for controlling pipelines.
-    ctrl_msg_senders: Arc<Mutex<Vec<Arc<dyn PipelineAdminSender>>>>,
-
     /// Shared process-wide memory pressure state.
     memory_pressure_state: MemoryPressureState,
 }
@@ -51,7 +147,7 @@ struct AppState {
 pub async fn run(
     config: HttpAdminSettings,
     observed_store: ObservedStateHandle,
-    ctrl_msg_senders: Vec<Arc<dyn PipelineAdminSender>>,
+    controller: Arc<dyn ControlPlane>,
     metrics_registry: TelemetryRegistryHandle,
     memory_pressure_state: MemoryPressureState,
     log_tap: Option<InternalLogTapHandle>,
@@ -60,8 +156,8 @@ pub async fn run(
     let app_state = AppState {
         observed_state_store: observed_store,
         metrics_registry,
+        controller,
         log_tap,
-        ctrl_msg_senders: Arc::new(Mutex::new(ctrl_msg_senders)),
         memory_pressure_state,
     };
 
diff --git a/rust/otap-dataflow/crates/admin/src/pipeline.rs b/rust/otap-dataflow/crates/admin/src/pipeline.rs
index b86d9bd3a1..a9a6dec221 100644
--- a/rust/otap-dataflow/crates/admin/src/pipeline.rs
+++ b/rust/otap-dataflow/crates/admin/src/pipeline.rs
@@ -2,49 +2,376 @@
 // SPDX-License-Identifier: Apache-2.0
 
 //! Pipeline endpoints.
-//! Status: Not implemented.
 //!
-//! - GET `/api/v1/pipeline-groups/{pipeline_group_id}/pipelines/{pipeline_id}`
+//! - GET `/api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}`
 //!   Get the configuration of the specified pipeline.
-//! - GET `/api/v1/pipeline-groups/{pipeline_group_id}/pipelines/{pipeline_id}/status`
+//! - GET `/api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/status`
 //!   Get the status of the specified pipeline.
-//! - POST `/api/v1/pipeline-groups/{pipeline_group_id}/pipelines/{pipeline_id}/shutdown`
-//!   Shutdown a specific pipeline
+//! - GET `/api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/rollouts/{rollout_id}`
+//!   Get the status of a specific rollout job for the logical pipeline.
+//!   Older rollout ids may return `404 Not Found` after bounded in-memory
+//!   retention evicts terminal history.
+//! - GET `/api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/shutdowns/{shutdown_id}`
+//!   Get the status of a specific shutdown job for the logical pipeline.
+//!   Older shutdown ids may return `404 Not Found` after bounded in-memory
+//!   retention evicts terminal history.
+//! - PUT `/api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}`
+//!   Create or replace a pipeline and return a rollout job status snapshot.
+//! - POST `/api/v1/groups/{pipeline_group_id}/pipelines/{pipeline_id}/shutdown`
+//!   Shutdown a specific logical pipeline and return a shutdown job status snapshot.
+//!   - Query parameters:
+//!     - `wait` (bool, default: false) - if true, block until the pipeline stops
+//!     - `timeout_secs` (u64, default: 60) - maximum seconds to wait when `wait=true`
+//!   - 200 OK if `wait=true` and the pipeline stopped successfully
 //!   - 202 Accepted if the stop request was accepted and is being processed (async operation)
 //!   - 400 Bad Request if the pipeline is already stopped
-//!   - 404 Not Found if the pipeline does not exist
+//!   - 404 Not Found if the group or pipeline does not exist
+//!   - 409 Conflict if a rollout or shutdown is active for the pipeline, or if a waited
+//!     shutdown fails
+//!   - 500 Internal Server Error if the stop request could not be processed
+//!   - 504 Gateway Timeout if `wait=true` and the pipeline did not stop within timeout
 //!
 //! ToDo Alternative -> avoid verb-y subpaths and support PATCH /.../pipelines/{pipelineId} with a body like {"status":"stopped"}. Use 409 if already stopping/stopped.
 
 use crate::AppState;
 use crate::convert::json_shape;
-use axum::extract::{Path, State};
+use axum::extract::{Path, Query, State};
 use axum::http::StatusCode;
-use axum::response::IntoResponse;
-use axum::routing::get;
+use axum::response::{IntoResponse, Response};
+use axum::routing::{get, post};
 use axum::{Json, Router};
-use otap_df_admin_types::pipelines::Status as ApiPipelineStatus;
+use otap_df_admin_types::pipelines::{PipelineRolloutState, Status as ApiPipelineStatus};
 use otap_df_config::PipelineKey;
+use otap_df_telemetry::otel_info;
+use serde::Deserialize;
+use std::time::{Duration, Instant};
 
 /// All the routes for pipelines.
 pub(crate) fn routes() -> Router<AppState> {
     Router::new()
+        .route(
+            "/groups/{pipeline_group_id}/pipelines/{pipeline_id}",
+            get(show_pipeline).put(put_pipeline),
+        )
         // Returns the status of a specific pipeline.
         .route(
-            "/pipeline-groups/{pipeline_group_id}/pipelines/{pipeline_id}/status",
+            "/groups/{pipeline_group_id}/pipelines/{pipeline_id}/status",
             get(show_status),
         )
+        .route(
+            "/groups/{pipeline_group_id}/pipelines/{pipeline_id}/rollouts/{rollout_id}",
+            get(show_rollout),
+        )
+        .route(
+            "/groups/{pipeline_group_id}/pipelines/{pipeline_id}/shutdowns/{shutdown_id}",
+            get(show_shutdown),
+        )
+        .route(
+            "/groups/{pipeline_group_id}/pipelines/{pipeline_id}/shutdown",
+            post(shutdown_pipeline),
+        )
         // liveness and readiness probes.
         .route(
-            "/pipeline-groups/{pipeline_group_id}/pipelines/{pipeline_id}/livez",
+            "/groups/{pipeline_group_id}/pipelines/{pipeline_id}/livez",
             get(liveness),
         )
         .route(
-            "/pipeline-groups/{pipeline_group_id}/pipelines/{pipeline_id}/readyz",
+            "/groups/{pipeline_group_id}/pipelines/{pipeline_id}/readyz",
             get(readiness),
         )
 }
 
+#[derive(Deserialize)]
+pub(crate) struct WaitParams {
+    #[serde(default)]
+    wait: bool,
+    #[serde(default = "default_timeout_secs")]
+    timeout_secs: u64,
+}
+
+const fn default_timeout_secs() -> u64 {
+    60
+}
+
+/// Converts a typed control-plane rejection into the shared HTTP error shape.
+fn operation_error_response(status: StatusCode, error: crate::ControlPlaneError) -> Response {
+    (status, Json(error.as_operation_error())).into_response()
+}
+
+/// Returns whether a rollout status is already in a terminal state.
+fn rollout_is_terminal(state: PipelineRolloutState) -> bool {
+    matches!(
+        state,
+        PipelineRolloutState::Succeeded
+            | PipelineRolloutState::Failed
+            | PipelineRolloutState::RollbackFailed
+    )
+}
+
+/// Returns whether a terminal rollout finished successfully.
+fn rollout_is_success(state: PipelineRolloutState) -> bool {
+    state == PipelineRolloutState::Succeeded
+}
+
+/// Returns whether a shutdown status string represents a terminal state.
+fn shutdown_is_terminal(state: &str) -> bool {
+    matches!(state, "succeeded" | "failed")
+}
+
+/// Returns whether a terminal shutdown finished successfully.
+fn shutdown_is_success(state: &str) -> bool {
+    state == "succeeded"
+}
+
+/// Returns committed configuration details for one logical pipeline.
+pub async fn show_pipeline(
+    Path((pipeline_group_id, pipeline_id)): Path<(String, String)>,
+    State(state): State<AppState>,
+) -> Result<Json<crate::PipelineDetails>, StatusCode> {
+    match state
+        .controller
+        .pipeline_details(&pipeline_group_id, &pipeline_id)
+    {
+        Ok(Some(details)) => Ok(Json(details)),
+        Ok(None) => Err(StatusCode::NOT_FOUND),
+        Err(
+            crate::ControlPlaneError::PipelineNotFound | crate::ControlPlaneError::GroupNotFound,
+        ) => Err(StatusCode::NOT_FOUND),
+        Err(_) => Err(StatusCode::INTERNAL_SERVER_ERROR),
+    }
+}
+
+/// Starts a pipeline reconfiguration and optionally waits for its terminal result.
+pub async fn put_pipeline(
+    Path((pipeline_group_id, pipeline_id)): Path<(String, String)>,
+    Query(params): Query<WaitParams>,
+    State(state): State<AppState>,
+    Json(request): Json<crate::ReconfigureRequest>,
+) -> impl IntoResponse {
+    let rollout =
+        match state
+            .controller
+            .reconfigure_pipeline(&pipeline_group_id, &pipeline_id, request)
+        {
+            Ok(rollout) => rollout,
+            Err(crate::ControlPlaneError::GroupNotFound) => {
+                return operation_error_response(
+                    StatusCode::NOT_FOUND,
+                    crate::ControlPlaneError::GroupNotFound,
+                );
+            }
+            Err(crate::ControlPlaneError::RolloutConflict) => {
+                return operation_error_response(
+                    StatusCode::CONFLICT,
+                    crate::ControlPlaneError::RolloutConflict,
+                );
+            }
+            Err(crate::ControlPlaneError::InvalidRequest { message }) => {
+                return operation_error_response(
+                    StatusCode::UNPROCESSABLE_ENTITY,
+                    crate::ControlPlaneError::InvalidRequest { message },
+                );
+            }
+            Err(other) => {
+                return operation_error_response(StatusCode::INTERNAL_SERVER_ERROR, other);
+            }
+        };
+
+    if !params.wait {
+        let status = if rollout_is_terminal(rollout.state) {
+            if rollout_is_success(rollout.state) {
+                StatusCode::OK
+            } else {
+                StatusCode::CONFLICT
+            }
+        } else {
+            StatusCode::ACCEPTED
+        };
+        return (status, Json(rollout)).into_response();
+    }
+
+    let deadline = Instant::now() + Duration::from_secs(params.timeout_secs);
+    let mut last_status = Some(rollout);
+    loop {
+        let Some(rollout_id) = last_status.as_ref().map(|status| status.rollout_id.clone()) else {
+            return operation_error_response(
+                StatusCode::INTERNAL_SERVER_ERROR,
+                crate::ControlPlaneError::Internal {
+                    message: "initial rollout status disappeared while waiting".to_string(),
+                },
+            );
+        };
+        match state
+            .controller
+            .rollout_status(&pipeline_group_id, &pipeline_id, &rollout_id)
+        {
+            Ok(Some(current)) if rollout_is_terminal(current.state) => {
+                let status = if rollout_is_success(current.state) {
+                    StatusCode::OK
+                } else {
+                    StatusCode::CONFLICT
+                };
+                return (status, Json(current)).into_response();
+            }
+            Ok(Some(current)) => {
+                last_status = Some(current);
+            }
+            Ok(None) | Err(crate::ControlPlaneError::RolloutNotFound) => {
+                return operation_error_response(
+                    StatusCode::NOT_FOUND,
+                    crate::ControlPlaneError::RolloutNotFound,
+                );
+            }
+            Err(other) => {
+                return operation_error_response(StatusCode::INTERNAL_SERVER_ERROR, other);
+            }
+        }
+
+        if Instant::now() >= deadline {
+            return match last_status {
+                Some(status) => (StatusCode::GATEWAY_TIMEOUT, Json(status)).into_response(),
+                None => operation_error_response(
+                    StatusCode::INTERNAL_SERVER_ERROR,
+                    crate::ControlPlaneError::Internal {
+                        message: "rollout status disappeared before timeout response".to_string(),
+                    },
+                ),
+            };
+        }
+        tokio::time::sleep(Duration::from_millis(100)).await;
+    }
+}
+
+/// Returns the latest snapshot for one rollout operation id.
+pub async fn show_rollout(
+    Path((pipeline_group_id, pipeline_id, rollout_id)): Path<(String, String, String)>,
+    State(state): State<AppState>,
+) -> Result<Json<crate::RolloutStatus>, StatusCode> {
+    match state
+        .controller
+        .rollout_status(&pipeline_group_id, &pipeline_id, &rollout_id)
+    {
+        Ok(Some(status)) => Ok(Json(status)),
+        Ok(None) => Err(StatusCode::NOT_FOUND),
+        Err(crate::ControlPlaneError::RolloutNotFound) => Err(StatusCode::NOT_FOUND),
+        Err(_) => Err(StatusCode::INTERNAL_SERVER_ERROR),
+    }
+}
+
+/// Returns the latest snapshot for one shutdown operation id.
+pub async fn show_shutdown(
+    Path((pipeline_group_id, pipeline_id, shutdown_id)): Path<(String, String, String)>,
+    State(state): State<AppState>,
+) -> Result<Json<crate::ShutdownStatus>, StatusCode> {
+    match state
+        .controller
+        .shutdown_status(&pipeline_group_id, &pipeline_id, &shutdown_id)
+    {
+        Ok(Some(status)) => Ok(Json(status)),
+        Ok(None) => Err(StatusCode::NOT_FOUND),
+        Err(crate::ControlPlaneError::ShutdownNotFound) => Err(StatusCode::NOT_FOUND),
+        Err(_) => Err(StatusCode::INTERNAL_SERVER_ERROR),
+    }
+}
+
+/// Starts a tracked shutdown for one logical pipeline and optionally waits.
+pub async fn shutdown_pipeline(
+    Path((pipeline_group_id, pipeline_id)): Path<(String, String)>,
+    Query(params): Query<WaitParams>,
+    State(state): State<AppState>,
+) -> impl IntoResponse {
+    otel_info!(
+        "pipeline.shutdown.requested",
+        pipeline_group_id = pipeline_group_id.as_str(),
+        pipeline_id = pipeline_id.as_str(),
+        wait = params.wait,
+        timeout_secs = params.timeout_secs
+    );
+
+    match state
+        .controller
+        .shutdown_pipeline(&pipeline_group_id, &pipeline_id, params.timeout_secs)
+    {
+        Ok(shutdown) => {
+            if !params.wait {
+                return (StatusCode::ACCEPTED, Json(shutdown)).into_response();
+            }
+
+            let deadline = Instant::now() + Duration::from_secs(params.timeout_secs);
+            let mut last_status = Some(shutdown);
+            loop {
+                let Some(shutdown_id) = last_status
+                    .as_ref()
+                    .map(|status| status.shutdown_id.clone())
+                else {
+                    return operation_error_response(
+                        StatusCode::INTERNAL_SERVER_ERROR,
+                        crate::ControlPlaneError::Internal {
+                            message: "initial shutdown status disappeared while waiting"
+                                .to_string(),
+                        },
+                    );
+                };
+                match state.controller.shutdown_status(
+                    &pipeline_group_id,
+                    &pipeline_id,
+                    &shutdown_id,
+                ) {
+                    Ok(Some(current)) if shutdown_is_terminal(&current.state) => {
+                        let status = if shutdown_is_success(&current.state) {
+                            StatusCode::OK
+                        } else {
+                            StatusCode::CONFLICT
+                        };
+                        return (status, Json(current)).into_response();
+                    }
+                    Ok(Some(current)) => {
+                        last_status = Some(current);
+                    }
+                    Ok(None) | Err(crate::ControlPlaneError::ShutdownNotFound) => {
+                        return operation_error_response(
+                            StatusCode::NOT_FOUND,
+                            crate::ControlPlaneError::ShutdownNotFound,
+                        );
+                    }
+                    Err(other) => {
+                        return operation_error_response(StatusCode::INTERNAL_SERVER_ERROR, other);
+                    }
+                }
+
+                if Instant::now() >= deadline {
+                    return match last_status {
+                        Some(status) => (StatusCode::GATEWAY_TIMEOUT, Json(status)).into_response(),
+                        None => operation_error_response(
+                            StatusCode::INTERNAL_SERVER_ERROR,
+                            crate::ControlPlaneError::Internal {
+                                message: "shutdown status disappeared before timeout response"
+                                    .to_string(),
+                            },
+                        ),
+                    };
+                }
+
+                tokio::time::sleep(Duration::from_millis(100)).await;
+            }
+        }
+        Err(error @ crate::ControlPlaneError::GroupNotFound)
+        | Err(error @ crate::ControlPlaneError::PipelineNotFound) => {
+            operation_error_response(StatusCode::NOT_FOUND, error)
+        }
+        Err(crate::ControlPlaneError::RolloutConflict) => operation_error_response(
+            StatusCode::CONFLICT,
+            crate::ControlPlaneError::RolloutConflict,
+        ),
+        Err(crate::ControlPlaneError::InvalidRequest { message }) => operation_error_response(
+            StatusCode::UNPROCESSABLE_ENTITY,
+            crate::ControlPlaneError::InvalidRequest { message },
+        ),
+        Err(other) => operation_error_response(StatusCode::INTERNAL_SERVER_ERROR, other),
+    }
+}
+
+/// Returns aggregated runtime status for one logical pipeline.
 pub async fn show_status(
     Path((pipeline_group_id, pipeline_id)): Path<(String, String)>,
     State(state): State<AppState>,
@@ -66,6 +393,8 @@ pub async fn show_status(
 /// - Should be cheap and internal (not dependent on external systems).
 ///
 /// ToDo Implement heartbeat checks.
+///
+/// Serves the liveness probe for one logical pipeline.
 async fn liveness(
     Path((pipeline_group_id, pipeline_id)): Path<(String, String)>,
     State(state): State<AppState>,
@@ -86,6 +415,8 @@ async fn liveness(
 /// - Gate traffic until startup work is done (pipeline deployed and running).
 /// - Temporarily remove the Pod from load balancing when it can't serve correctly.
 /// - Can check key dependencies, but avoid making it too fragile.
+///
+/// Serves the readiness probe for one logical pipeline.
 async fn readiness(
     Path((pipeline_group_id, pipeline_id)): Path<(String, String)>,
     State(state): State<AppState>,
@@ -98,3 +429,317 @@ async fn readiness(
         (StatusCode::SERVICE_UNAVAILABLE, "NOT OK")
     }
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use crate::{ControlPlane, ControlPlaneError, PipelineDetails, RolloutStatus, ShutdownStatus};
+    use axum::body::to_bytes;
+    use otap_df_admin_types::operations::{OperationError, OperationErrorKind};
+    use otap_df_config::observed_state::ObservedStateSettings;
+    use otap_df_engine::memory_limiter::MemoryPressureState;
+    use otap_df_state::store::ObservedStateStore;
+    use otap_df_telemetry::registry::TelemetryRegistryHandle;
+    use serde_json::json;
+    use std::sync::Arc;
+
+    #[derive(Clone)]
+    struct StubControlPlane {
+        replace_result: Result<RolloutStatus, ControlPlaneError>,
+        rollout_status_result: Result<Option<RolloutStatus>, ControlPlaneError>,
+        shutdown_result: Result<ShutdownStatus, ControlPlaneError>,
+        shutdown_status_result: Result<Option<ShutdownStatus>, ControlPlaneError>,
+    }
+
+    impl ControlPlane for StubControlPlane {
+        fn shutdown_all(&self, _timeout_secs: u64) -> Result<(), ControlPlaneError> {
+            Ok(())
+        }
+
+        fn shutdown_pipeline(
+            &self,
+            _pipeline_group_id: &str,
+            _pipeline_id: &str,
+            _timeout_secs: u64,
+        ) -> Result<ShutdownStatus, ControlPlaneError> {
+            self.shutdown_result.clone()
+        }
+
+        fn reconfigure_pipeline(
+            &self,
+            _pipeline_group_id: &str,
+            _pipeline_id: &str,
+            _request: crate::ReconfigureRequest,
+        ) -> Result<RolloutStatus, ControlPlaneError> {
+            self.replace_result.clone()
+        }
+
+        fn pipeline_details(
+            &self,
+            _pipeline_group_id: &str,
+            _pipeline_id: &str,
+        ) -> Result<Option<PipelineDetails>, ControlPlaneError> {
+            Ok(None)
+        }
+
+        fn rollout_status(
+            &self,
+            _pipeline_group_id: &str,
+            _pipeline_id: &str,
+            _rollout_id: &str,
+        ) -> Result<Option<RolloutStatus>, ControlPlaneError> {
+            self.rollout_status_result.clone()
+        }
+
+        fn shutdown_status(
+            &self,
+            _pipeline_group_id: &str,
+            _pipeline_id: &str,
+            _shutdown_id: &str,
+        ) -> Result<Option<ShutdownStatus>, ControlPlaneError> {
+            self.shutdown_status_result.clone()
+        }
+    }
+
+    fn test_app_state(controller: Arc<dyn ControlPlane>) -> AppState {
+        let metrics_registry = TelemetryRegistryHandle::new();
+        let observed_state_store =
+            ObservedStateStore::new(&ObservedStateSettings::default(), metrics_registry.clone());
+
+        AppState {
+            observed_state_store: observed_state_store.handle(),
+            metrics_registry,
+            controller,
+            log_tap: None,
+            memory_pressure_state: MemoryPressureState::default(),
+        }
+    }
+
+    fn request() -> crate::ReconfigureRequest {
+        crate::ReconfigureRequest {
+            pipeline: serde_json::from_value(json!({
+                "type": "otap",
+                "nodes": {
+                    "recv": {
+                        "type": "receiver:fake",
+                        "config": {}
+                    }
+                }
+            }))
+            .expect("fixture pipeline should deserialize"),
+            step_timeout_secs: 60,
+            drain_timeout_secs: 60,
+        }
+    }
+
+    fn rollout_status(state: PipelineRolloutState) -> RolloutStatus {
+        serde_json::from_value(json!({
+            "rolloutId": "rollout-1",
+            "pipelineGroupId": "default",
+            "pipelineId": "main",
+            "action": "replace",
+            "state": state,
+            "targetGeneration": 1,
+            "previousGeneration": 0,
+            "startedAt": "2026-01-01T00:00:00Z",
+            "updatedAt": "2026-01-01T00:00:01Z",
+            "cores": []
+        }))
+        .expect("fixture rollout status should deserialize")
+    }
+
+    fn shutdown_status(state: &str) -> ShutdownStatus {
+        serde_json::from_value(json!({
+            "shutdownId": "shutdown-1",
+            "pipelineGroupId": "default",
+            "pipelineId": "main",
+            "state": state,
+            "startedAt": "2026-01-01T00:00:00Z",
+            "updatedAt": "2026-01-01T00:00:01Z",
+            "cores": []
+        }))
+        .expect("fixture shutdown status should deserialize")
+    }
+
+    /// Scenario: the control plane rejects a pipeline reconfigure request
+    /// before rollout work starts.
+    /// Guarantees: the admin handler converts that rejection into a structured
+    /// operation-error body with the expected HTTP status.
+    #[tokio::test]
+    async fn put_pipeline_returns_operation_error_body_on_invalid_request() {
+        let response = put_pipeline(
+            Path(("default".to_string(), "main".to_string())),
+            Query(WaitParams {
+                wait: false,
+                timeout_secs: 60,
+            }),
+            State(test_app_state(Arc::new(StubControlPlane {
+                replace_result: Err(ControlPlaneError::InvalidRequest {
+                    message: "invalid candidate".to_string(),
+                }),
+                rollout_status_result: Ok(None),
+                shutdown_result: Ok(shutdown_status("succeeded")),
+                shutdown_status_result: Ok(None),
+            }))),
+            Json(request()),
+        )
+        .await
+        .into_response();
+
+        assert_eq!(response.status(), StatusCode::UNPROCESSABLE_ENTITY);
+        let body = to_bytes(response.into_body(), usize::MAX)
+            .await
+            .expect("body should collect");
+        let error: OperationError =
+            serde_json::from_slice(&body).expect("error body should deserialize");
+        assert_eq!(error.kind, OperationErrorKind::InvalidRequest);
+        assert_eq!(error.message.as_deref(), Some("invalid candidate"));
+    }
+
+    /// Scenario: a waited pipeline reconfigure request times out and the
+    /// control plane can still report the latest rollout snapshot.
+    /// Guarantees: the admin handler returns HTTP 504 with that rollout status
+    /// body instead of dropping the operation context.
+    #[tokio::test]
+    async fn put_pipeline_timeout_returns_latest_rollout_status_snapshot() {
+        let response = put_pipeline(
+            Path(("default".to_string(), "main".to_string())),
+            Query(WaitParams {
+                wait: true,
+                timeout_secs: 0,
+            }),
+            State(test_app_state(Arc::new(StubControlPlane {
+                replace_result: Ok(rollout_status(PipelineRolloutState::Running)),
+                rollout_status_result: Ok(Some(rollout_status(PipelineRolloutState::Running))),
+                shutdown_result: Ok(shutdown_status("succeeded")),
+                shutdown_status_result: Ok(None),
+            }))),
+            Json(request()),
+        )
+        .await
+        .into_response();
+
+        assert_eq!(response.status(), StatusCode::GATEWAY_TIMEOUT);
+        let body = to_bytes(response.into_body(), usize::MAX)
+            .await
+            .expect("body should collect");
+        let status: RolloutStatus =
+            serde_json::from_slice(&body).expect("timeout body should deserialize");
+        assert_eq!(status.rollout_id, "rollout-1");
+        assert_eq!(status.state, PipelineRolloutState::Running);
+    }
+
+    /// Scenario: a pipeline shutdown request collides with an active rollout
+    /// for the same logical pipeline.
+    /// Guarantees: the admin handler returns a typed conflict body so callers
+    /// can distinguish request rejection from shutdown progress.
+    #[tokio::test]
+    async fn shutdown_pipeline_returns_operation_error_body_on_conflict() {
+        let response = shutdown_pipeline(
+            Path(("default".to_string(), "main".to_string())),
+            Query(WaitParams {
+                wait: false,
+                timeout_secs: 60,
+            }),
+            State(test_app_state(Arc::new(StubControlPlane {
+                replace_result: Ok(rollout_status(PipelineRolloutState::Succeeded)),
+                rollout_status_result: Ok(None),
+                shutdown_result: Err(ControlPlaneError::RolloutConflict),
+                shutdown_status_result: Ok(None),
+            }))),
+        )
+        .await
+        .into_response();
+
+        assert_eq!(response.status(), StatusCode::CONFLICT);
+        let body = to_bytes(response.into_body(), usize::MAX)
+            .await
+            .expect("body should collect");
+        let error: OperationError =
+            serde_json::from_slice(&body).expect("error body should deserialize");
+        assert_eq!(error.kind, OperationErrorKind::Conflict);
+        assert_eq!(error.message, None);
+    }
+
+    /// Scenario: a waited pipeline shutdown request times out while the control
+    /// plane still has a current shutdown snapshot.
+    /// Guarantees: the admin handler responds with HTTP 504 and the latest
+    /// shutdown status body for follow-up polling.
+    #[tokio::test]
+    async fn shutdown_pipeline_timeout_returns_latest_status_snapshot() {
+        let response = shutdown_pipeline(
+            Path(("default".to_string(), "main".to_string())),
+            Query(WaitParams {
+                wait: true,
+                timeout_secs: 0,
+            }),
+            State(test_app_state(Arc::new(StubControlPlane {
+                replace_result: Ok(rollout_status(PipelineRolloutState::Succeeded)),
+                rollout_status_result: Ok(None),
+                shutdown_result: Ok(shutdown_status("running")),
+                shutdown_status_result: Ok(Some(shutdown_status("running"))),
+            }))),
+        )
+        .await
+        .into_response();
+
+        assert_eq!(response.status(), StatusCode::GATEWAY_TIMEOUT);
+        let body = to_bytes(response.into_body(), usize::MAX)
+            .await
+            .expect("body should collect");
+        let status: ShutdownStatus =
+            serde_json::from_slice(&body).expect("timeout body should deserialize");
+        assert_eq!(status.shutdown_id, "shutdown-1");
+        assert_eq!(status.state, "running");
+    }
+
+    /// Scenario: a caller asks for a rollout status id that is no longer
+    /// available from the control plane.
+    /// Guarantees: the admin handler returns HTTP 404 so evicted rollout
+    /// history is observable as not found.
+    #[tokio::test]
+    async fn show_rollout_returns_not_found_when_status_is_missing() {
+        let response = show_rollout(
+            Path((
+                "default".to_string(),
+                "main".to_string(),
+                "rollout-1".to_string(),
+            )),
+            State(test_app_state(Arc::new(StubControlPlane {
+                replace_result: Ok(rollout_status(PipelineRolloutState::Succeeded)),
+                rollout_status_result: Ok(None),
+                shutdown_result: Ok(shutdown_status("succeeded")),
+                shutdown_status_result: Ok(None),
+            }))),
+        )
+        .await
+        .into_response();
+
+        assert_eq!(response.status(), StatusCode::NOT_FOUND);
+    }
+
+    /// Scenario: a caller asks for a shutdown status id that is no longer
+    /// available from the control plane.
+    /// Guarantees: the admin handler returns HTTP 404 so evicted shutdown
+    /// history is observable as not found.
+    #[tokio::test]
+    async fn show_shutdown_returns_not_found_when_status_is_missing() {
+        let response = show_shutdown(
+            Path((
+                "default".to_string(),
+                "main".to_string(),
+                "shutdown-1".to_string(),
+            )),
+            State(test_app_state(Arc::new(StubControlPlane {
+                replace_result: Ok(rollout_status(PipelineRolloutState::Succeeded)),
+                rollout_status_result: Ok(None),
+                shutdown_result: Ok(shutdown_status("succeeded")),
+                shutdown_status_result: Ok(None),
+            }))),
+        )
+        .await
+        .into_response();
+
+        assert_eq!(response.status(), StatusCode::NOT_FOUND);
+    }
+}
diff --git a/rust/otap-dataflow/crates/admin/src/pipeline_group.rs b/rust/otap-dataflow/crates/admin/src/pipeline_group.rs
index e98efc3dcb..4a9de0909f 100644
--- a/rust/otap-dataflow/crates/admin/src/pipeline_group.rs
+++ b/rust/otap-dataflow/crates/admin/src/pipeline_group.rs
@@ -3,19 +3,19 @@
 
 //! Pipeline group endpoints.
 //!
-//! - GET `/api/v1/pipeline-groups/:id/pipelines` - list active pipelines and their status (ToDo)
-//! - POST `/api/v1/pipeline-groups/shutdown` - shutdown all pipelines in all groups
+//! - GET `/api/v1/groups/:id/pipelines` - list active pipelines and their status (ToDo)
+//! - POST `/api/v1/groups/shutdown` - shutdown all pipelines in all groups
 //!   - Query parameters:
 //!     - `wait` (bool, default: false) - if true, block until all pipelines have stopped
 //!     - `timeout_secs` (u64, default: 60) - maximum seconds to wait when `wait=true`
 //!
 //!   Example (fire-and-forget):
 //!   ```sh
-//!   curl -X POST http://localhost:8080/api/v1/pipeline-groups/shutdown
+//!   curl -X POST http://localhost:8080/api/v1/groups/shutdown
 //!   ```
 //!   Example (wait for graceful shutdown with 30s timeout):
 //!   ```sh
-//!   curl -X POST "http://localhost:8080/api/v1/pipeline-groups/shutdown?wait=true&timeout_secs=30"
+//!   curl -X POST "http://localhost:8080/api/v1/groups/shutdown?wait=true&timeout_secs=30"
 //!   ```
 //!
 //!   - 200 OK if `wait=true` and all pipelines stopped successfully
@@ -37,8 +37,8 @@ use axum::routing::{get, post};
 use axum::{Json, Router};
 use chrono::Utc;
 use otap_df_admin_types::{
+    groups::{ShutdownResponse, ShutdownStatus, Status as GroupsStatus},
     operations::OperationOptions,
-    pipeline_groups::{ShutdownResponse, ShutdownStatus, Status as PipelineGroupsStatus},
 };
 use otap_df_telemetry::otel_info;
 use std::time::{Duration, Instant};
@@ -47,16 +47,14 @@ use std::time::{Duration, Instant};
 pub(crate) fn routes() -> Router<AppState> {
     Router::new()
         // Returns a summary of all pipelines and their statuses.
-        .route("/pipeline-groups/status", get(show_status))
+        .route("/groups/status", get(show_status))
         // Shutdown all pipelines in all groups.
-        .route("/pipeline-groups/shutdown", post(shutdown_all_pipelines))
+        .route("/groups/shutdown", post(shutdown_all_pipelines))
     // ToDo Global liveness and readiness probes.
 }
 
-pub async fn show_status(
-    State(state): State<AppState>,
-) -> Result<Json<PipelineGroupsStatus>, StatusCode> {
-    Ok(Json(PipelineGroupsStatus {
+pub async fn show_status(State(state): State<AppState>) -> Result<Json<GroupsStatus>, StatusCode> {
+    Ok(Json(GroupsStatus {
         generated_at: Utc::now().to_rfc3339(),
         pipelines: json_shape(&state.observed_state_store.snapshot()),
     }))
@@ -74,31 +72,13 @@ async fn shutdown_all_pipelines(
         timeout_secs = params.timeout_secs
     );
 
-    // Send shutdown message to all pipelines
-    let errors: Vec<_> = (*state.ctrl_msg_senders.lock().await)
-        .drain(..)
-        .filter_map(|sender| {
-            // Use the timeout from params for the shutdown deadline
-            let deadline = Instant::now() + Duration::from_secs(params.timeout_secs);
-            sender
-                .try_send_shutdown(
-                    deadline,
-                    "Shutdown requested via the `/api/v1/pipeline-groups/shutdown` endpoint."
-                        .to_owned(),
-                )
-                .err()
-        })
-        .map(|e| e.to_string())
-        .collect();
-
-    // If there were errors sending shutdown messages, return immediately
-    if !errors.is_empty() {
-        otel_info!("shutdown.failed", error_count = errors.len());
+    if let Err(err) = state.controller.shutdown_all(params.timeout_secs) {
+        otel_info!("shutdown.failed", error = ?err);
         return (
             StatusCode::INTERNAL_SERVER_ERROR,
             Json(ShutdownResponse {
                 status: ShutdownStatus::Failed,
-                errors: Some(errors),
+                errors: Some(vec![format!("{err:?}")]),
                 duration_ms: Some(start_time.elapsed().as_millis() as u64),
             }),
         );
diff --git a/rust/otap-dataflow/crates/admin/src/telemetry.rs b/rust/otap-dataflow/crates/admin/src/telemetry.rs
index 8e014a9fa7..87b1748d35 100644
--- a/rust/otap-dataflow/crates/admin/src/telemetry.rs
+++ b/rust/otap-dataflow/crates/admin/src/telemetry.rs
@@ -5,11 +5,12 @@
 //!
 //! - /api/v1/telemetry/live-schema - current semantic conventions registry
 //! - /api/v1/telemetry/logs - retained internal logs from the in-memory log tap
+//! - /api/v1/telemetry/logs/stream - live internal log stream over WebSocket
 //! - /api/v1/telemetry/metrics - current aggregated metrics in JSON, line protocol, or Prometheus text format
 //! - /api/v1/telemetry/metrics/aggregate - aggregated metrics grouped by metric set name and optional attributes
 
 use crate::AppState;
-use crate::convert::{convert_attribute_value, json_shape};
+use crate::convert::json_shape;
 use axum::extract::ws::{Message, WebSocket, WebSocketUpgrade};
 use axum::extract::{Query, State};
 use axum::http::{StatusCode, header};
@@ -27,7 +28,7 @@ use otap_df_telemetry::self_tracing::format_log_record_to_string;
 use otap_df_telemetry::semconv::SemConvRegistry;
 use serde::{Deserialize, Serialize};
 use std::collections::hash_map::Entry;
-use std::collections::{BTreeMap, HashMap, HashSet};
+use std::collections::{HashMap, HashSet};
 use std::fmt::Write as _;
 use std::sync::Arc;
 use tokio::sync::broadcast;
@@ -147,8 +148,40 @@ struct AggregateGroup {
     metrics: HashMap<String, MetricValue>,
 }
 
-fn logs_response(registry: &TelemetryRegistryHandle, result: LogQueryResult) -> api::LogsResponse {
-    api::LogsResponse {
+#[derive(Serialize)]
+pub(crate) struct LogsResponse {
+    oldest_seq: Option<u64>,
+    newest_seq: Option<u64>,
+    next_seq: u64,
+    truncated_before_seq: Option<u64>,
+    dropped_on_ingest: u64,
+    dropped_on_retention: u64,
+    retained_bytes: usize,
+    logs: Vec<LogEntry>,
+}
+
+#[derive(Serialize)]
+struct LogEntry {
+    seq: u64,
+    timestamp: String,
+    level: String,
+    target: String,
+    event_name: String,
+    file: Option<String>,
+    line: Option<u32>,
+    rendered: String,
+    contexts: Vec<ResolvedLogContext>,
+}
+
+#[derive(Serialize)]
+struct ResolvedLogContext {
+    entity_key: String,
+    schema_name: Option<String>,
+    attributes: HashMap<String, AttributeValue>,
+}
+
+fn logs_response(registry: &TelemetryRegistryHandle, result: LogQueryResult) -> LogsResponse {
+    LogsResponse {
         oldest_seq: result.oldest_seq,
         newest_seq: result.newest_seq,
         next_seq: result.next_seq,
@@ -164,9 +197,9 @@ fn logs_response(registry: &TelemetryRegistryHandle, result: LogQueryResult) ->
     }
 }
 
-fn render_log_entry(registry: &TelemetryRegistryHandle, entry: &RetainedLogEvent) -> api::LogEntry {
+fn render_log_entry(registry: &TelemetryRegistryHandle, entry: &RetainedLogEvent) -> LogEntry {
     let callsite = entry.event.record.callsite();
-    api::LogEntry {
+    LogEntry {
         seq: entry.seq,
         timestamp: chrono::DateTime::<chrono::Utc>::from(entry.event.time).to_rfc3339(),
         level: callsite.level().to_string(),
@@ -186,25 +219,25 @@ fn render_log_message(event: &LogEvent) -> String {
 fn resolve_log_contexts(
     registry: &TelemetryRegistryHandle,
     event: &LogEvent,
-) -> Vec<api::ResolvedLogContext> {
+) -> Vec<ResolvedLogContext> {
     event
         .record
         .context
         .iter()
         .map(|entity_key| {
             registry
-                .visit_entity(*entity_key, |attrs| api::ResolvedLogContext {
+                .visit_entity(*entity_key, |attrs| ResolvedLogContext {
                     entity_key: format!("{entity_key:?}"),
                     schema_name: Some(attrs.schema_name().to_string()),
                     attributes: attrs
                         .iter_attributes()
-                        .map(|(key, value)| (key.to_string(), convert_attribute_value(value)))
+                        .map(|(key, value)| (key.to_string(), value.clone()))
                         .collect(),
                 })
-                .unwrap_or_else(|| api::ResolvedLogContext {
+                .unwrap_or_else(|| ResolvedLogContext {
                     entity_key: format!("{entity_key:?}"),
                     schema_name: None,
-                    attributes: BTreeMap::new(),
+                    attributes: HashMap::new(),
                 })
         })
         .collect()
@@ -233,7 +266,10 @@ pub async fn get_logs(
         after: q.after,
         limit,
     });
-    Ok(Json(logs_response(&state.metrics_registry, result)))
+    Ok(Json(json_shape(&logs_response(
+        &state.metrics_registry,
+        result,
+    ))))
 }
 
 /// Handler for the `/api/v1/telemetry/metrics` endpoint.
@@ -1242,7 +1278,7 @@ fn escape_prom_help(s: &str) -> String {
 // WebSocket live log stream  (/api/v1/telemetry/logs/stream)
 // ---------------------------------------------------------------------------
 
-/// Map a level string to a numeric severity (TRACE=0 … ERROR=4).
+/// Map a level string to a numeric severity (TRACE=0 through ERROR=4).
 /// Unknown levels are treated as TRACE (lowest severity).
 ///
 /// Uses ASCII-only comparison to avoid allocating a temporary uppercase string.
@@ -1289,7 +1325,7 @@ struct LogFilter {
 
 impl LogFilter {
     /// Returns `true` when the rendered log entry passes all active criteria.
-    fn matches(&self, entry: &api::LogEntry) -> bool {
+    fn matches(&self, entry: &LogEntry) -> bool {
         if let Some(min_ts) = &self.minimum_timestamp {
             if let Ok(ts) = chrono::DateTime::parse_from_rfc3339(&entry.timestamp) {
                 if ts.with_timezone(&chrono::Utc) < *min_ts {
@@ -1331,7 +1367,7 @@ impl LogFilter {
     ///
     /// Checks `minimum_level` and `minimum_timestamp` without rendering the
     /// entry, so we can skip the more expensive `render_log_entry()` call for
-    /// events that would be rejected anyway.  `search_query` is intentionally
+    /// events that would be rejected anyway. `search_query` is intentionally
     /// not checked here because it operates on the rendered text.
     fn prefilter_raw(&self, event: &RetainedLogEvent) -> bool {
         if let Some(min_ts) = &self.minimum_timestamp {
@@ -1373,16 +1409,16 @@ impl LogFilter {
     }
 }
 
-/// Client → server WebSocket messages.
+/// Client to server WebSocket messages.
 #[derive(Deserialize)]
 #[serde(tag = "type", rename_all = "camelCase")]
 enum WsClientMsg {
-    /// Begin streaming.  Sends an initial retained-log snapshot, then follows
+    /// Begin streaming. Sends an initial retained-log snapshot, then follows
     /// with live events.
     Subscribe {
         /// Cursor: only include retained entries strictly newer than this seq.
         after: Option<u64>,
-        /// Maximum retained entries in the initial snapshot (clamped 1–1000).
+        /// Maximum retained entries in the initial snapshot (clamped 1-1000).
         limit: Option<usize>,
         /// Case-insensitive text filter (applied server-side).
         #[serde(rename = "searchQuery")]
@@ -1419,7 +1455,7 @@ enum WsClientMsg {
     Ping,
 }
 
-/// Server → client WebSocket messages.
+/// Server to client WebSocket messages.
 #[derive(Serialize)]
 #[serde(tag = "type", rename_all = "snake_case")]
 enum WsServerMsg {
@@ -1432,12 +1468,12 @@ enum WsServerMsg {
         dropped_on_ingest: u64,
         dropped_on_retention: u64,
         retained_bytes: usize,
-        logs: Vec<api::LogEntry>,
+        logs: Vec<LogEntry>,
     },
     /// Single live log entry pushed by the server.
     Log {
         #[serde(flatten)]
-        entry: api::LogEntry,
+        entry: LogEntry,
     },
     /// Current pause state and cursor position.
     State { paused: bool, next_seq: u64 },
@@ -1501,10 +1537,10 @@ async fn ws_send_snapshot(
 /// 2. The server sends the initial retained-log snapshot, then streams live
 ///    events via `log` messages.
 /// 3. `pause` / `resume` toggle server-side forwarding without closing the
-///    socket.  While paused the server still drains the broadcast channel so
+///    socket. While paused the server still drains the broadcast channel so
 ///    that the producer is never slowed by this client.
 /// 4. On `backfill` the server re-queries the retained ring buffer and sends a
-///    `snapshot`.  The cursor is updated so subsequent live events do not
+///    `snapshot`. The cursor is updated so subsequent live events do not
 ///    duplicate.
 /// 5. If the client falls more than `SUBSCRIBER_CHANNEL_CAPACITY` events
 ///    behind, the broadcast channel drops the overflow; the server notifies the
@@ -1529,8 +1565,8 @@ async fn handle_ws_logs(mut ws: WebSocket, state: AppState) {
     let mut paused = false;
     let mut filter = LogFilter::default();
     // Tracks the sequence number of the last event we acknowledged (sent or
-    // deliberately skipped while paused).  Used in `state` replies so the
-    // client knows where the live cursor stands.
+    // deliberately skipped while paused). Used in `state` replies so the client
+    // knows where the live cursor stands.
     let mut cursor: u64 = 0;
 
     loop {
@@ -1562,11 +1598,9 @@ async fn handle_ws_logs(mut ws: WebSocket, state: AppState) {
                                 Ok(WsClientMsg::Backfill { after, limit }) => {
                                     let limit = limit.unwrap_or(100).clamp(1, 1000);
                                     let result = log_tap.query(LogQuery { after, limit });
-                                    // Only advance cursor — never move it backward.  A
-                                    // client may request an older `after` (e.g. a lag
-                                    // gap backfill) while the live stream has already
-                                    // moved the cursor forward; preserving the maximum
-                                    // keeps the dedup guard in the live event arm sound.
+                                    // Only advance cursor; never move it backward. A client may
+                                    // request an older `after` (e.g. a lag gap backfill) while the
+                                    // live stream has already moved the cursor forward.
                                     cursor = cursor.max(result.next_seq);
                                     if !ws_send_snapshot(&mut ws, registry, result, &filter).await {
                                         break;
@@ -1586,7 +1620,7 @@ async fn handle_ws_logs(mut ws: WebSocket, state: AppState) {
                             }
                         }
                         Some(Ok(Message::Close(_))) | None => break,
-                        Some(Ok(_)) => {} // binary / ping frames — ignore
+                        Some(Ok(_)) => {} // binary / ping frames; ignore
                         Some(Err(_)) => break,
                     }
                 }
@@ -1600,7 +1634,7 @@ async fn handle_ws_logs(mut ws: WebSocket, state: AppState) {
                             // were already delivered in the most recent snapshot
                             // or backfill (the subscribe-before-query race window).
                             if entry_seq <= cursor {
-                                // Discard silently — already in the snapshot.
+                                // Discard silently; already in the snapshot.
                             } else {
                                 // Advance cursor so `state` replies are accurate
                                 // even when paused or filtered.
@@ -1627,8 +1661,8 @@ async fn handle_ws_logs(mut ws: WebSocket, state: AppState) {
                         }
                         Err(broadcast::error::RecvError::Lagged(n)) => {
                             // The client was too slow; events were dropped from its
-                            // receiver slot.  `cursor` here is the last seq we
-                            // successfully delivered — the client can use it as
+                            // receiver slot. `cursor` here is the last seq we
+                            // successfully delivered; the client can use it as
                             // the `after` param for a backfill to recover the gap.
                             let msg = WsServerMsg::Error {
                                 message: format!(
@@ -1661,7 +1695,7 @@ async fn handle_ws_logs(mut ws: WebSocket, state: AppState) {
                     {
                         // Subscribe to the broadcast channel BEFORE querying
                         // retained logs so we do not miss events recorded between
-                        // the query and the first receive.  Live events with
+                        // the query and the first receive. Live events with
                         // seq <= cursor (set from snapshot.next_seq below) are
                         // silently discarded in the live_event arm to prevent
                         // duplicates for that race window.
@@ -1703,13 +1737,75 @@ async fn handle_ws_logs(mut ws: WebSocket, state: AppState) {
 #[cfg(test)]
 mod tests {
     use super::*;
-    use axum::body::to_bytes;
+    use crate::{
+        ControlPlane, ControlPlaneError, PipelineDetails, ReconfigureRequest, RolloutStatus,
+        ShutdownStatus,
+    };
+    use axum::body::{Body, to_bytes};
     use otap_df_config::observed_state::ObservedStateSettings;
     use otap_df_engine::memory_limiter::MemoryPressureState;
     use otap_df_state::store::ObservedStateStore;
     use otap_df_telemetry::descriptor::{Instrument, MetricsField, Temporality};
     use std::sync::Arc;
-    use tokio::sync::Mutex;
+    use tower::ServiceExt;
+
+    struct NoopControlPlane;
+
+    impl ControlPlane for NoopControlPlane {
+        fn shutdown_all(&self, _timeout_secs: u64) -> Result<(), ControlPlaneError> {
+            Err(ControlPlaneError::Internal {
+                message: "not used in telemetry tests".to_string(),
+            })
+        }
+
+        fn shutdown_pipeline(
+            &self,
+            _pipeline_group_id: &str,
+            _pipeline_id: &str,
+            _timeout_secs: u64,
+        ) -> Result<ShutdownStatus, ControlPlaneError> {
+            Err(ControlPlaneError::Internal {
+                message: "not used in telemetry tests".to_string(),
+            })
+        }
+
+        fn reconfigure_pipeline(
+            &self,
+            _pipeline_group_id: &str,
+            _pipeline_id: &str,
+            _request: ReconfigureRequest,
+        ) -> Result<RolloutStatus, ControlPlaneError> {
+            Err(ControlPlaneError::Internal {
+                message: "not used in telemetry tests".to_string(),
+            })
+        }
+
+        fn pipeline_details(
+            &self,
+            _pipeline_group_id: &str,
+            _pipeline_id: &str,
+        ) -> Result<Option<PipelineDetails>, ControlPlaneError> {
+            Ok(None)
+        }
+
+        fn rollout_status(
+            &self,
+            _pipeline_group_id: &str,
+            _pipeline_id: &str,
+            _rollout_id: &str,
+        ) -> Result<Option<RolloutStatus>, ControlPlaneError> {
+            Ok(None)
+        }
+
+        fn shutdown_status(
+            &self,
+            _pipeline_group_id: &str,
+            _pipeline_id: &str,
+            _shutdown_id: &str,
+        ) -> Result<Option<ShutdownStatus>, ControlPlaneError> {
+            Ok(None)
+        }
+    }
 
     static TEST_METRICS_DESCRIPTOR: MetricsDescriptor = MetricsDescriptor {
         name: "test_metrics",
@@ -1753,8 +1849,8 @@ mod tests {
         AppState {
             observed_state_store: observed_state_store.handle(),
             metrics_registry,
+            controller: Arc::new(NoopControlPlane),
             log_tap: None,
-            ctrl_msg_senders: Arc::new(Mutex::new(Vec::new())),
             memory_pressure_state: MemoryPressureState::default(),
         }
     }
@@ -1789,6 +1885,22 @@ mod tests {
         );
     }
 
+    #[tokio::test]
+    async fn telemetry_routes_include_logs_stream_websocket_endpoint() {
+        let response = routes()
+            .with_state(test_app_state())
+            .oneshot(
+                axum::http::Request::builder()
+                    .uri("/telemetry/logs/stream")
+                    .body(Body::empty())
+                    .expect("request should build"),
+            )
+            .await
+            .expect("route should respond");
+
+        assert_ne!(response.status(), StatusCode::NOT_FOUND);
+    }
+
     /// Ensures aggregate group ordering is deterministic: metric-set name first,
     /// then metric count when names are equal.
     #[test]
@@ -2218,8 +2330,8 @@ mod tests {
     // LogFilter unit tests
     // ---------------------------------------------------------------------------
 
-    fn make_log_entry(rendered: &str, level: &str, target: &str, timestamp: &str) -> api::LogEntry {
-        api::LogEntry {
+    fn make_log_entry(rendered: &str, level: &str, target: &str, timestamp: &str) -> LogEntry {
+        LogEntry {
             seq: 1,
             timestamp: timestamp.to_string(),
             level: level.to_string(),
@@ -2299,11 +2411,9 @@ mod tests {
             "2026-01-01T00:00:01Z",
         );
 
-        // The filter must pass the matching entry and reject the other.
         assert!(filter.matches(&match_entry));
         assert!(!filter.matches(&no_match_entry));
 
-        // Simulate the retain() call used in ws_send_snapshot.
         let mut logs = vec![match_entry, no_match_entry];
         logs.retain(|e| filter.matches(e));
         assert_eq!(logs.len(), 1);
@@ -2312,7 +2422,6 @@ mod tests {
 
     #[test]
     fn level_severity_ordering_is_correct() {
-        // TRACE < DEBUG < INFO < WARN < ERROR
         assert!(level_severity("TRACE") < level_severity("DEBUG"));
         assert!(level_severity("DEBUG") < level_severity("INFO"));
         assert!(level_severity("INFO") < level_severity("WARN"));
@@ -2361,7 +2470,6 @@ mod tests {
 
     #[test]
     fn log_filter_minimum_level_and_search_query_combine() {
-        // Both constraints must pass.
         let filter = LogFilter::from_params(
             Some("critical".to_string()),
             None,
@@ -2376,28 +2484,35 @@ mod tests {
     }
 
     // ---------------------------------------------------------------------------
-    // WebSocket ↔ HTTP schema alignment tests
+    // WebSocket / HTTP schema alignment tests
     // ---------------------------------------------------------------------------
 
     #[test]
     fn ws_log_msg_serializes_same_fields_as_api_log_entry() {
         let entry = make_log_entry("hello", "INFO", "admin", "2026-01-01T00:00:00Z");
-        let msg = WsServerMsg::Log {
-            entry: entry.clone(),
-        };
+        let expected_seq = entry.seq;
+        let expected_timestamp = entry.timestamp.clone();
+        let expected_level = entry.level.clone();
+        let expected_target = entry.target.clone();
+        let expected_event_name = entry.event_name.clone();
+        let expected_rendered = entry.rendered.clone();
+        let msg = WsServerMsg::Log { entry };
         let json: serde_json::Value = serde_json::to_value(&msg).unwrap();
         let obj = json.as_object().unwrap();
 
-        // The flattened entry must carry the same fields as api::LogEntry
-        // plus the discriminator tag.
         assert_eq!(obj.get("type").unwrap(), "log");
-        assert_eq!(obj.get("seq").unwrap(), entry.seq);
-        assert_eq!(obj.get("timestamp").unwrap(), &entry.timestamp);
-        assert_eq!(obj.get("level").unwrap(), &entry.level);
-        assert_eq!(obj.get("target").unwrap(), &entry.target);
-        assert_eq!(obj.get("event_name").unwrap(), &entry.event_name);
-        assert_eq!(obj.get("rendered").unwrap(), &entry.rendered);
+        assert_eq!(obj.get("seq").unwrap(), expected_seq);
+        assert_eq!(obj.get("timestamp").unwrap(), &expected_timestamp);
+        assert_eq!(obj.get("level").unwrap(), &expected_level);
+        assert_eq!(obj.get("target").unwrap(), &expected_target);
+        assert_eq!(obj.get("event_name").unwrap(), &expected_event_name);
+        assert_eq!(obj.get("rendered").unwrap(), &expected_rendered);
         assert!(obj.contains_key("contexts"));
+
+        let roundtrip: api::LogEntry =
+            serde_json::from_value(json).expect("log message should match api::LogEntry shape");
+        assert_eq!(roundtrip.seq, 1);
+        assert_eq!(roundtrip.rendered, "hello");
     }
 
     #[test]
@@ -2417,7 +2532,6 @@ mod tests {
         let logs = json.get("logs").unwrap().as_array().unwrap();
         assert_eq!(logs.len(), 1);
 
-        // Each log in the snapshot must deserialize as a valid api::LogEntry.
         let roundtrip: api::LogEntry = serde_json::from_value(logs[0].clone())
             .expect("snapshot log should match api::LogEntry");
         assert_eq!(roundtrip.seq, 1);
diff --git a/rust/otap-dataflow/crates/channel/src/mpmc.rs b/rust/otap-dataflow/crates/channel/src/mpmc.rs
index c6456b8079..cd9a841d1d 100644
--- a/rust/otap-dataflow/crates/channel/src/mpmc.rs
+++ b/rust/otap-dataflow/crates/channel/src/mpmc.rs
@@ -350,6 +350,14 @@ impl<T> Receiver<T> {
         let state = self.channel.state.borrow();
         state.buffer.is_empty()
     }
+
+    /// Checks whether the channel has been closed and will accept no further
+    /// sends.
+    #[must_use]
+    pub fn is_closed(&self) -> bool {
+        let state = self.channel.state.borrow();
+        state.is_closed
+    }
 }
 
 struct SendFuture<T> {
diff --git a/rust/otap-dataflow/crates/channel/src/mpsc.rs b/rust/otap-dataflow/crates/channel/src/mpsc.rs
index e5f3b558e4..86c7faea55 100644
--- a/rust/otap-dataflow/crates/channel/src/mpsc.rs
+++ b/rust/otap-dataflow/crates/channel/src/mpsc.rs
@@ -342,6 +342,14 @@ impl<T> Receiver<T> {
         let state = self.channel.state.borrow();
         state.buffer.is_empty()
     }
+
+    /// Checks whether the channel has been closed and will accept no further
+    /// sends.
+    #[must_use]
+    pub fn is_closed(&self) -> bool {
+        let state = self.channel.state.borrow();
+        state.is_closed
+    }
 }
 
 struct SendFuture<T> {
diff --git a/rust/otap-dataflow/crates/config/src/engine/resolve.rs b/rust/otap-dataflow/crates/config/src/engine/resolve.rs
index 8865c99896..70200e33ef 100644
--- a/rust/otap-dataflow/crates/config/src/engine/resolve.rs
+++ b/rust/otap-dataflow/crates/config/src/engine/resolve.rs
@@ -76,6 +76,59 @@ pub struct ResolvedPipelineConfig {
     pub role: ResolvedPipelineRole,
 }
 
+impl ResolvedPipelineConfig {
+    /// Compares two resolved pipelines for exact runtime equivalence.
+    ///
+    /// Logical identity is intentionally ignored here; callers compare two
+    /// candidate snapshots for the same logical pipeline and only care whether
+    /// runtime-relevant config and resolved policies match.
+    #[must_use]
+    pub fn runtime_matches(&self, other: &Self) -> bool {
+        let Self {
+            pipeline_group_id: _,
+            pipeline_id: _,
+            pipeline: self_pipeline,
+            policies: self_policies,
+            role: self_role,
+        } = self;
+        let Self {
+            pipeline_group_id: _,
+            pipeline_id: _,
+            pipeline: other_pipeline,
+            policies: other_policies,
+            role: other_role,
+        } = other;
+
+        self_role == other_role
+            && self_pipeline == other_pipeline
+            && self_policies == other_policies
+    }
+
+    /// Compares two resolved pipelines while ignoring resource-only policy
+    /// differences used by resize classification.
+    #[must_use]
+    pub fn runtime_shape_matches_ignoring_resources(&self, other: &Self) -> bool {
+        let Self {
+            pipeline_group_id: _,
+            pipeline_id: _,
+            pipeline: self_pipeline,
+            policies: self_policies,
+            role: self_role,
+        } = self;
+        let Self {
+            pipeline_group_id: _,
+            pipeline_id: _,
+            pipeline: other_pipeline,
+            policies: other_policies,
+            role: other_role,
+        } = other;
+
+        self_role == other_role
+            && self_pipeline.eq_ignoring_policies(other_pipeline)
+            && self_policies.eq_ignoring_resources(other_policies)
+    }
+}
+
 impl OtelDataflowSpec {
     /// Resolves and materializes policies once for all pipelines.
     ///
@@ -173,3 +226,127 @@ impl OtelDataflowSpec {
         self.topics.get(topic_name).cloned()
     }
 }
+
+#[cfg(test)]
+mod tests {
+    use super::{ResolvedPipelineConfig, ResolvedPipelineRole};
+    use crate::pipeline::PipelineConfig;
+    use crate::policy::{CoreAllocation, ResolvedPolicies, ResourcesPolicy, TelemetryPolicy};
+
+    #[test]
+    fn runtime_shape_matches_ignoring_resources_ignores_resource_only_changes() {
+        let current = ResolvedPipelineConfig {
+            pipeline_group_id: "g1".into(),
+            pipeline_id: "p1".into(),
+            pipeline: PipelineConfig::from_yaml(
+                "g1".into(),
+                "p1".into(),
+                r#"
+policies:
+  resources:
+    core_allocation:
+      type: core_count
+      count: 1
+nodes:
+  receiver:
+    type: "urn:test:receiver:example"
+    config: null
+  exporter:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: receiver
+    to: exporter
+"#,
+            )
+            .expect("current pipeline should parse"),
+            policies: ResolvedPolicies {
+                resources: ResourcesPolicy {
+                    core_allocation: CoreAllocation::core_count(1),
+                    memory_limiter: None,
+                },
+                ..ResolvedPolicies::default()
+            },
+            role: ResolvedPipelineRole::Regular,
+        };
+        let candidate = ResolvedPipelineConfig {
+            pipeline_group_id: "g1".into(),
+            pipeline_id: "p1".into(),
+            pipeline: PipelineConfig::from_yaml(
+                "g1".into(),
+                "p1".into(),
+                r#"
+policies:
+  resources:
+    core_allocation:
+      type: core_count
+      count: 2
+nodes:
+  receiver:
+    type: "urn:test:receiver:example"
+    config: null
+  exporter:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: receiver
+    to: exporter
+"#,
+            )
+            .expect("candidate pipeline should parse"),
+            policies: ResolvedPolicies {
+                resources: ResourcesPolicy {
+                    core_allocation: CoreAllocation::core_count(2),
+                    memory_limiter: None,
+                },
+                ..ResolvedPolicies::default()
+            },
+            role: ResolvedPipelineRole::Regular,
+        };
+
+        assert!(!current.runtime_matches(&candidate));
+        assert!(current.runtime_shape_matches_ignoring_resources(&candidate));
+    }
+
+    #[test]
+    fn runtime_shape_matches_ignoring_resources_detects_runtime_policy_change() {
+        let current = ResolvedPipelineConfig {
+            pipeline_group_id: "g1".into(),
+            pipeline_id: "p1".into(),
+            pipeline: PipelineConfig::from_yaml(
+                "g1".into(),
+                "p1".into(),
+                r#"
+nodes:
+  receiver:
+    type: "urn:test:receiver:example"
+    config: null
+  exporter:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: receiver
+    to: exporter
+"#,
+            )
+            .expect("current pipeline should parse"),
+            policies: ResolvedPolicies::default(),
+            role: ResolvedPipelineRole::Regular,
+        };
+        let candidate = ResolvedPipelineConfig {
+            pipeline_group_id: "g1".into(),
+            pipeline_id: "p1".into(),
+            pipeline: current.pipeline.clone(),
+            policies: ResolvedPolicies {
+                telemetry: TelemetryPolicy {
+                    pipeline_metrics: false,
+                    ..TelemetryPolicy::default()
+                },
+                ..ResolvedPolicies::default()
+            },
+            role: ResolvedPipelineRole::Regular,
+        };
+
+        assert!(!current.runtime_shape_matches_ignoring_resources(&candidate));
+    }
+}
diff --git a/rust/otap-dataflow/crates/config/src/lib.rs b/rust/otap-dataflow/crates/config/src/lib.rs
index 107fdee33c..55711bc009 100644
--- a/rust/otap-dataflow/crates/config/src/lib.rs
+++ b/rust/otap-dataflow/crates/config/src/lib.rs
@@ -153,7 +153,7 @@ impl Serialize for PipelineKey {
 }
 
 /// Unique key for identifying a pipeline running on a specific core.
-#[derive(Debug, Clone, Serialize)]
+#[derive(Debug, Clone, Serialize, PartialEq, Eq, Hash)]
 pub struct DeployedPipelineKey {
     /// The unique ID of the pipeline group the pipeline belongs to.
     pub pipeline_group_id: PipelineGroupId,
@@ -163,4 +163,11 @@ pub struct DeployedPipelineKey {
 
     /// The CPU core ID the pipeline is pinned to.
     pub core_id: CoreId,
+
+    /// Monotonic deployment generation for this logical pipeline.
+    ///
+    /// Generation `0` is the initial startup deployment. Higher generations are
+    /// created by live reconfiguration rollouts.
+    #[serde(default)]
+    pub deployment_generation: u64,
 }
diff --git a/rust/otap-dataflow/crates/config/src/pipeline.rs b/rust/otap-dataflow/crates/config/src/pipeline.rs
index 065bfe2971..2e56d2977b 100644
--- a/rust/otap-dataflow/crates/config/src/pipeline.rs
+++ b/rust/otap-dataflow/crates/config/src/pipeline.rs
@@ -510,6 +510,34 @@ impl FromIterator<(NodeId, Arc<NodeUserConfig>)> for PipelineNodes {
 }
 
 impl PipelineConfig {
+    /// Compares two pipeline configs while intentionally ignoring the optional
+    /// pipeline-level policies block.
+    ///
+    /// This keeps the "what is pipeline shape vs. what is policy" decision next
+    /// to the struct definition so new fields require an explicit choice.
+    #[must_use]
+    pub fn eq_ignoring_policies(&self, other: &Self) -> bool {
+        let Self {
+            r#type: self_type,
+            policies: _,
+            nodes: self_nodes,
+            extensions: self_extensions,
+            connections: self_connections,
+        } = self;
+        let Self {
+            r#type: other_type,
+            policies: _,
+            nodes: other_nodes,
+            extensions: other_extensions,
+            connections: other_connections,
+        } = other;
+
+        self_type == other_type
+            && self_nodes == other_nodes
+            && self_extensions == other_extensions
+            && self_connections == other_connections
+    }
+
     /// Create a new [`PipelineConfig`] from a JSON string.
     pub fn from_json(
         pipeline_group_id: PipelineGroupId,
@@ -1379,6 +1407,99 @@ mod tests {
     use crate::pipeline::{PipelineConfigBuilder, PipelineType};
     use serde_json::json;
 
+    #[test]
+    fn eq_ignoring_policies_ignores_policy_only_changes() {
+        let current = super::PipelineConfig::from_yaml(
+            "g1".into(),
+            "p1".into(),
+            r#"
+policies:
+  resources:
+    core_allocation:
+      type: core_count
+      count: 1
+nodes:
+  receiver:
+    type: "urn:test:receiver:example"
+    config: null
+  exporter:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: receiver
+    to: exporter
+"#,
+        )
+        .expect("current pipeline should parse");
+        let candidate = super::PipelineConfig::from_yaml(
+            "g1".into(),
+            "p1".into(),
+            r#"
+policies:
+  resources:
+    core_allocation:
+      type: core_count
+      count: 2
+  telemetry:
+    pipeline_metrics: false
+nodes:
+  receiver:
+    type: "urn:test:receiver:example"
+    config: null
+  exporter:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: receiver
+    to: exporter
+"#,
+        )
+        .expect("candidate pipeline should parse");
+
+        assert_ne!(current, candidate);
+        assert!(current.eq_ignoring_policies(&candidate));
+    }
+
+    #[test]
+    fn eq_ignoring_policies_detects_topology_change() {
+        let current = super::PipelineConfig::from_yaml(
+            "g1".into(),
+            "p1".into(),
+            r#"
+nodes:
+  receiver:
+    type: "urn:test:receiver:example"
+    config: null
+  exporter:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: receiver
+    to: exporter
+"#,
+        )
+        .expect("current pipeline should parse");
+        let candidate = super::PipelineConfig::from_yaml(
+            "g1".into(),
+            "p1".into(),
+            r#"
+nodes:
+  input:
+    type: "urn:test:receiver:example"
+    config: null
+  exporter:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: input
+    to: exporter
+"#,
+        )
+        .expect("candidate pipeline should parse");
+
+        assert!(!current.eq_ignoring_policies(&candidate));
+    }
+
     #[test]
     fn test_duplicate_node_errors() {
         let result = PipelineConfigBuilder::new()
diff --git a/rust/otap-dataflow/crates/config/src/policy.rs b/rust/otap-dataflow/crates/config/src/policy.rs
index d5309cb598..c3092d7052 100644
--- a/rust/otap-dataflow/crates/config/src/policy.rs
+++ b/rust/otap-dataflow/crates/config/src/policy.rs
@@ -206,6 +206,33 @@ pub struct ResolvedPolicies {
     /// (opt-in only -- no headers are captured or propagated by default).
     pub transport_headers: Option<TransportHeadersPolicy>,
 }
+
+impl ResolvedPolicies {
+    /// Compares resolved policies while intentionally ignoring the resources
+    /// policy, which controls placement and scaling rather than runtime shape.
+    #[must_use]
+    pub fn eq_ignoring_resources(&self, other: &Self) -> bool {
+        let Self {
+            channel_capacity: self_channel_capacity,
+            health: self_health,
+            telemetry: self_telemetry,
+            resources: _,
+            transport_headers: self_transport_headers,
+        } = self;
+        let Self {
+            channel_capacity: other_channel_capacity,
+            health: other_health,
+            telemetry: other_telemetry,
+            resources: _,
+            transport_headers: other_transport_headers,
+        } = other;
+
+        self_channel_capacity == other_channel_capacity
+            && self_health == other_health
+            && self_telemetry == other_telemetry
+            && self_transport_headers == other_transport_headers
+    }
+}
 /// instrumentation overhead.
 #[derive(
     Clone, Copy, Debug, Default, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize, JsonSchema,
@@ -595,6 +622,41 @@ mod tests {
     use super::{MemoryLimiterMode, MemoryLimiterPolicy, MemoryLimiterSource, Policies};
     use std::time::Duration;
 
+    #[test]
+    fn resolved_policies_eq_ignoring_resources_ignores_resource_only_changes() {
+        let current = super::ResolvedPolicies {
+            resources: super::ResourcesPolicy {
+                core_allocation: super::CoreAllocation::core_count(1),
+                memory_limiter: None,
+            },
+            ..super::ResolvedPolicies::default()
+        };
+        let candidate = super::ResolvedPolicies {
+            resources: super::ResourcesPolicy {
+                core_allocation: super::CoreAllocation::core_count(2),
+                memory_limiter: None,
+            },
+            ..super::ResolvedPolicies::default()
+        };
+
+        assert_ne!(current, candidate);
+        assert!(current.eq_ignoring_resources(&candidate));
+    }
+
+    #[test]
+    fn resolved_policies_eq_ignoring_resources_detects_runtime_policy_change() {
+        let current = super::ResolvedPolicies::default();
+        let candidate = super::ResolvedPolicies {
+            telemetry: super::TelemetryPolicy {
+                pipeline_metrics: false,
+                ..super::TelemetryPolicy::default()
+            },
+            ..super::ResolvedPolicies::default()
+        };
+
+        assert!(!current.eq_ignoring_resources(&candidate));
+    }
+
     #[test]
     fn defaults_match_expected_values() {
         let defaults = Policies::resolve([&Policies::default()]);
diff --git a/rust/otap-dataflow/crates/controller/Cargo.toml b/rust/otap-dataflow/crates/controller/Cargo.toml
index 2b0b198ddd..66cc251c10 100644
--- a/rust/otap-dataflow/crates/controller/Cargo.toml
+++ b/rust/otap-dataflow/crates/controller/Cargo.toml
@@ -30,3 +30,5 @@ tokio = { workspace = true }
 tokio-util = { workspace = true }
 flume = { workspace = true }
 smallvec = { workspace = true }
+chrono = { workspace = true }
+serde_json = { workspace = true }
diff --git a/rust/otap-dataflow/crates/controller/src/lib.rs b/rust/otap-dataflow/crates/controller/src/lib.rs
index 547ac54b37..ee635562f5 100644
--- a/rust/otap-dataflow/crates/controller/src/lib.rs
+++ b/rust/otap-dataflow/crates/controller/src/lib.rs
@@ -68,13 +68,14 @@ use otap_df_engine::ReceivedAtNode;
 use otap_df_engine::Unwindable;
 use otap_df_engine::context::{ControllerContext, PipelineContext};
 use otap_df_engine::control::{
-    PipelineCompletionMsgReceiver, PipelineCompletionMsgSender, RuntimeCtrlMsgReceiver,
-    RuntimeCtrlMsgSender, pipeline_completion_msg_channel, runtime_ctrl_msg_channel,
+    PipelineAdminSender, PipelineCompletionMsgReceiver, PipelineCompletionMsgSender,
+    RuntimeCtrlMsgReceiver, RuntimeCtrlMsgSender, pipeline_completion_msg_channel,
+    runtime_ctrl_msg_channel,
 };
 use otap_df_engine::entity_context::{
     node_entity_key, pipeline_entity_key, set_pipeline_entity_key,
 };
-use otap_df_engine::error::{Error as EngineError, error_summary_from};
+use otap_df_engine::error::Error as EngineError;
 use otap_df_engine::memory_limiter::{
     EffectiveMemoryLimiter, MemoryLimiterTick, MemoryPressureBehaviorConfig, MemoryPressureChanged,
     MemoryPressureLevel,
@@ -93,17 +94,25 @@ use otap_df_telemetry::{
 };
 use smallvec::smallvec;
 use std::collections::{HashMap, HashSet};
+use std::panic::{AssertUnwindSafe, catch_unwind};
 use std::sync::Arc;
 use std::sync::mpsc as std_mpsc;
 use std::thread;
+use std::time::Duration;
 
 /// Error types and helpers for the controller module.
 pub mod error;
+mod live_control;
 /// Reusable startup helpers (validation, CLI overrides, system info).
 pub mod startup;
 /// Utilities to spawn async tasks on dedicated threads with graceful shutdown.
 pub mod thread_task;
 
+use live_control::{
+    ControllerRuntime, LaunchedPipelineThread, PanicReport, RuntimeInstanceError,
+    RuntimeInstanceExit,
+};
+
 /// Controller for managing pipelines in a thread-per-core model.
 ///
 /// # Thread Safety
@@ -272,6 +281,103 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
         Self { pipeline_factory }
     }
 
+    /// Validates component-specific configuration for one pipeline before startup or reconfigure.
+    fn validate_pipeline_components_with_factory(
+        pipeline_factory: &'static PipelineFactory<PData>,
+        pipeline_group_id: &PipelineGroupId,
+        pipeline_id: &PipelineId,
+        pipeline_cfg: &PipelineConfig,
+    ) -> Result<(), String> {
+        for (node_id, node_cfg) in pipeline_cfg.node_iter() {
+            let urn_str = node_cfg.r#type.as_str();
+            let validate_config_fn = match node_cfg.kind() {
+                NodeKind::Receiver => pipeline_factory
+                    .get_receiver_factory_map()
+                    .get(urn_str)
+                    .map(|factory| factory.validate_config),
+                NodeKind::Processor | NodeKind::ProcessorChain => pipeline_factory
+                    .get_processor_factory_map()
+                    .get(urn_str)
+                    .map(|factory| factory.validate_config),
+                NodeKind::Exporter => pipeline_factory
+                    .get_exporter_factory_map()
+                    .get(urn_str)
+                    .map(|factory| factory.validate_config),
+                NodeKind::Extension => {
+                    // Extensions are not yet validated here because PipelineFactory
+                    // does not expose an extension factory registry.
+                    continue;
+                }
+            };
+
+            let Some(validate_fn) = validate_config_fn else {
+                let kind_name = match node_cfg.kind() {
+                    NodeKind::Receiver => "receiver",
+                    NodeKind::Processor | NodeKind::ProcessorChain => "processor",
+                    NodeKind::Exporter => "exporter",
+                    NodeKind::Extension => unreachable!("handled above"),
+                };
+                return Err(format!(
+                    "Unknown {} component `{}` in pipeline_group={} pipeline={} node={}",
+                    kind_name,
+                    urn_str,
+                    pipeline_group_id.as_ref(),
+                    pipeline_id.as_ref(),
+                    node_id.as_ref()
+                ));
+            };
+
+            validate_fn(&node_cfg.config).map_err(|err| {
+                format!(
+                    "Invalid config for component `{}` in pipeline_group={} pipeline={} node={}: {}",
+                    urn_str,
+                    pipeline_group_id.as_ref(),
+                    pipeline_id.as_ref(),
+                    node_id.as_ref(),
+                    err
+                )
+            })?;
+        }
+        Ok(())
+    }
+
+    /// Validates every configured pipeline and observability pipeline against registered components.
+    fn validate_engine_components_with_factory(
+        pipeline_factory: &'static PipelineFactory<PData>,
+        engine_cfg: &OtelDataflowSpec,
+    ) -> Result<(), String> {
+        for (pipeline_group_id, pipeline_group) in &engine_cfg.groups {
+            for (pipeline_id, pipeline_cfg) in &pipeline_group.pipelines {
+                Self::validate_pipeline_components_with_factory(
+                    pipeline_factory,
+                    pipeline_group_id,
+                    pipeline_id,
+                    pipeline_cfg,
+                )?;
+            }
+        }
+
+        if let Some(obs_pipeline) = &engine_cfg.engine.observability.pipeline {
+            let obs_group_id: PipelineGroupId = SYSTEM_PIPELINE_GROUP_ID.into();
+            let obs_pipeline_id: PipelineId = SYSTEM_OBSERVABILITY_PIPELINE_ID.into();
+            let obs_pipeline_config = obs_pipeline.clone().into_pipeline_config();
+            Self::validate_pipeline_components_with_factory(
+                pipeline_factory,
+                &obs_group_id,
+                &obs_pipeline_id,
+                &obs_pipeline_config,
+            )?;
+        }
+
+        Ok(())
+    }
+
+    /// Validates that every configured node resolves to a registered component and that the
+    /// static component-specific configuration validates.
+    pub fn validate_engine_components(&self, engine_cfg: &OtelDataflowSpec) -> Result<(), String> {
+        Self::validate_engine_components_with_factory(self.pipeline_factory, engine_cfg)
+    }
+
     /// Starts the controller with the given engine configurations.
     pub fn run_forever(&self, engine_config: OtelDataflowSpec) -> Result<(), Error> {
         self.run_with_mode(
@@ -448,10 +554,13 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
         config: &OtelDataflowSpec,
         global_names: &HashMap<TopicName, TopicName>,
         group_names: &HashMap<(PipelineGroupId, TopicName), TopicName>,
-    ) -> (
-        HashMap<TopicName, InferredTopicMode>,
-        Vec<InferredTopicModeReport>,
-    ) {
+    ) -> Result<
+        (
+            HashMap<TopicName, InferredTopicMode>,
+            Vec<InferredTopicModeReport>,
+        ),
+        Error,
+    > {
         let mut usage_by_declared_topic = HashMap::<TopicName, TopicUsageSummary>::new();
         for declared_name in global_names.values().chain(group_names.values()) {
             _ = usage_by_declared_topic.insert(declared_name.clone(), TopicUsageSummary::default());
@@ -518,16 +627,13 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
             }
         }
 
-        let mut inferred_modes = HashMap::with_capacity(usage_by_declared_topic.len());
-        let mut inferred_mode_reports = Vec::with_capacity(usage_by_declared_topic.len());
-        let mut declared_topics: Vec<_> = usage_by_declared_topic.keys().cloned().collect();
-        declared_topics.sort_by(|left, right| left.as_ref().cmp(right.as_ref()));
+        let mut declared_topics: Vec<_> = usage_by_declared_topic.into_iter().collect();
+        declared_topics.sort_by(|(left, _), (right, _)| left.as_ref().cmp(right.as_ref()));
 
-        for declared_topic in declared_topics {
-            let summary = usage_by_declared_topic
-                .get(&declared_topic)
-                .expect("declared topic must have a usage summary");
-            let topology_mode = Self::infer_topic_mode(summary);
+        let mut inferred_modes = HashMap::with_capacity(declared_topics.len());
+        let mut inferred_mode_reports = Vec::with_capacity(declared_topics.len());
+        for (declared_topic, summary) in declared_topics {
+            let topology_mode = Self::infer_topic_mode(&summary);
             inferred_mode_reports.push(InferredTopicModeReport {
                 topic: declared_topic.clone(),
                 topology_mode,
@@ -542,7 +648,7 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
             _ = inferred_modes.insert(declared_topic, topology_mode);
         }
 
-        (inferred_modes, inferred_mode_reports)
+        Ok((inferred_modes, inferred_mode_reports))
     }
 
     fn add_topic_wiring_edge(
@@ -683,21 +789,23 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
         let mut group_ids = config.groups.keys().cloned().collect::<Vec<_>>();
         group_ids.sort_by(|left, right| left.as_ref().cmp(right.as_ref()));
         for group_id in group_ids {
-            let group_cfg = config
-                .groups
-                .get(&group_id)
-                .expect("group collected from config must still exist");
-            let mut pipeline_ids = group_cfg.pipelines.keys().cloned().collect::<Vec<_>>();
-            pipeline_ids.sort_by(|left, right| left.as_ref().cmp(right.as_ref()));
-            for pipeline_id in pipeline_ids {
-                let pipeline_cfg = group_cfg
-                    .pipelines
-                    .get(&pipeline_id)
-                    .expect("pipeline collected from config must still exist");
+            let Some(group_cfg) = config.groups.get(&group_id) else {
+                return Err(Error::PipelineRuntimeError {
+                    source: Box::new(EngineError::InternalError {
+                        message: format!(
+                            "group `{}` disappeared while validating topic wiring",
+                            group_id.as_ref()
+                        ),
+                    }),
+                });
+            };
+            let mut pipelines = group_cfg.pipelines.iter().collect::<Vec<_>>();
+            pipelines.sort_by(|(left, _), (right, _)| left.as_ref().cmp(right.as_ref()));
+            for (pipeline_id, pipeline_cfg) in pipelines {
                 Self::collect_topic_wiring_edges_for_pipeline(
                     &mut adjacency,
                     &group_id,
-                    &pipeline_id,
+                    pipeline_id,
                     pipeline_cfg,
                     global_names,
                     group_names,
@@ -887,13 +995,20 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
         let (global_names, group_names) = Self::build_declared_topic_name_maps(config)?;
         Self::validate_topic_wiring_acyclic(config, &global_names, &group_names)?;
         let (inferred_modes, mut inferred_mode_reports) =
-            Self::infer_topic_modes(config, &global_names, &group_names);
+            Self::infer_topic_modes(config, &global_names, &group_names)?;
         let default_selection_policy = config.engine.topics.impl_selection;
 
         for (topic_name, spec) in &config.topics {
             let declared_name = global_names
                 .get(topic_name)
-                .expect("global topic declaration must resolve to a declared topic name")
+                .ok_or_else(|| Error::PipelineRuntimeError {
+                    source: Box::new(EngineError::InternalError {
+                        message: format!(
+                            "missing declared topic name for global topic `{}` during topic declaration",
+                            topic_name.as_ref()
+                        ),
+                    }),
+                })?
                 .clone();
             let topology_mode = inferred_modes
                 .get(&declared_name)
@@ -915,7 +1030,15 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
             for (topic_name, spec) in &group_cfg.topics {
                 let declared_name = group_names
                     .get(&(group_id.clone(), topic_name.clone()))
-                    .expect("group topic declaration must resolve to a declared topic name")
+                    .ok_or_else(|| Error::PipelineRuntimeError {
+                        source: Box::new(EngineError::InternalError {
+                            message: format!(
+                                "missing declared topic name for group `{}` topic `{}` during topic declaration",
+                                group_id.as_ref(),
+                                topic_name.as_ref()
+                            ),
+                        }),
+                    })?
                     .clone();
                 let topology_mode = inferred_modes
                     .get(&declared_name)
@@ -1181,10 +1304,11 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
             );
         }
 
-        let pipeline_count = pipelines.len();
         let all_cores =
             core_affinity::get_core_ids().ok_or_else(|| Error::CoreDetectionUnavailable)?;
-        let its_core = *all_cores.first().expect("a cpu core");
+        let its_core = *all_cores
+            .first()
+            .ok_or_else(|| Error::CoreDetectionUnavailable)?;
         let its_key = Self::internal_pipeline_key(its_core);
         if let Some(pipeline) = observability_pipeline.as_ref() {
             obs_state_store.register_pipeline_health_policy(
@@ -1195,20 +1319,33 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
                 pipeline.policies.health.clone(),
             );
         }
-        let available_core_ids = if pipeline_count == 0 {
-            Vec::new()
-        } else {
-            all_cores
-        };
         let planned_core_assignments =
-            Self::preflight_pipeline_core_allocations(&pipelines, &available_core_ids)?;
+            Self::preflight_pipeline_core_allocations(&pipelines, &all_cores)?;
 
+        let runtime = Arc::new(ControllerRuntime::new(
+            self.pipeline_factory,
+            controller_ctx.clone(),
+            obs_state_store.clone(),
+            obs_state_handle.clone(),
+            engine_evt_reporter.clone(),
+            metrics_reporter.clone(),
+            declared_topics,
+            all_cores.clone(),
+            telemetry_system.engine_tracing_setup(),
+            telemetry_reporting_interval,
+            memory_pressure_tx.clone(),
+            engine_config.clone(),
+        ));
+
+        // Pipeline threads receive only a Weak handle back to the controller runtime. That lets
+        // them report their terminal exit without becoming owners that keep the runtime alive
+        // during shutdown.
         let internal_pipeline_handle = Self::spawn_internal_pipeline_if_configured(
+            Arc::downgrade(&runtime),
             its_key.clone(),
             its_core,
             observability_pipeline,
             &engine_config,
-            &declared_topics,
             &telemetry_system,
             self.pipeline_factory,
             &controller_ctx,
@@ -1243,7 +1380,7 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
         // successful startup. This ensures the channel receiver is being consumed
         // before we start sending logs.
         telemetry_system.init_global_subscriber();
-        Self::emit_topic_mode_reports(&declared_topics.inferred_mode_reports);
+        Self::emit_topic_mode_reports(&runtime.declared_topics().inferred_mode_reports);
 
         let internal_collector = telemetry_system.collector();
         let metrics_agg_handle = spawn_thread_local_task(
@@ -1264,10 +1401,11 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
         };
 
         // Start the observed state store background task
+        let obs_state_store_runtime = obs_state_store.clone();
         let obs_state_join_handle = spawn_thread_local_task(
             "observed-state-store",
             admin_tracing_setup.clone(),
-            move |cancellation_token| obs_state_store.run(cancellation_token),
+            move |cancellation_token| obs_state_store_runtime.run(cancellation_token),
         )?;
 
         // Start the engine-wide metrics collection task.
@@ -1317,142 +1455,72 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
             },
         )?;
 
-        let mut threads = Vec::new();
-        let mut ctrl_msg_senders = Vec::new();
-
-        // TODO: We do not have proper thread::current().id assignment.
-        let mut next_thread_id: usize = 1;
-        let its_thread_id: usize = 0;
-
-        // Add internal pipeline to threads list if present
-        if let Some((thread_name, handle)) = internal_pipeline_handle {
-            threads.push((thread_name, its_thread_id, its_key, handle));
+        if let Some(launched) = internal_pipeline_handle {
+            runtime.register_launched_instance(launched);
         }
 
-        for (pipeline_entry, requested_cores) in pipelines.into_iter().zip(planned_core_assignments)
-        {
+        for (pipeline_entry, requested_cores) in pipelines.iter().zip(planned_core_assignments) {
+            runtime.register_committed_pipeline(pipeline_entry.clone(), 0);
+            let num_cores = requested_cores.len();
+
             let core_allocation = pipeline_entry
                 .policies
                 .resources
                 .core_allocation
                 .to_string();
-            let channel_capacity_policy = pipeline_entry.policies.channel_capacity;
-            let telemetry_policy = pipeline_entry.policies.telemetry;
-            let transport_headers_policy = pipeline_entry.policies.transport_headers;
-            let pipeline_group_id = pipeline_entry.pipeline_group_id;
-            let pipeline_id = pipeline_entry.pipeline_id;
-            let pipeline = pipeline_entry.pipeline;
-
-            let num_cores = requested_cores.len();
             otel_info!(
                 "pipeline.core_allocation",
-                pipeline_group_id = pipeline_group_id.as_ref(),
-                pipeline_id = pipeline_id.as_ref(),
+                pipeline_group_id = pipeline_entry.pipeline_group_id.as_ref(),
+                pipeline_id = pipeline_entry.pipeline_id.as_ref(),
                 num_cores = num_cores,
                 core_allocation = core_allocation
             );
-            for core_id in requested_cores {
-                let pipeline_key = DeployedPipelineKey {
-                    pipeline_group_id: pipeline_group_id.clone(),
-                    pipeline_id: pipeline_id.clone(),
-                    core_id: core_id.id,
-                };
-                let (runtime_ctrl_msg_tx, runtime_ctrl_msg_rx) =
-                    runtime_ctrl_msg_channel(channel_capacity_policy.control.pipeline);
-                let (pipeline_completion_msg_tx, pipeline_completion_msg_rx) =
-                    pipeline_completion_msg_channel(channel_capacity_policy.control.completion);
-                ctrl_msg_senders.push(runtime_ctrl_msg_tx.clone());
-
-                let pipeline_config = pipeline.clone();
-                let pipeline_factory = self.pipeline_factory;
-                let thread_id = next_thread_id;
-                next_thread_id += 1;
-                let mut pipeline_handle = controller_ctx.pipeline_context_with(
-                    pipeline_group_id.clone(),
-                    pipeline_id.clone(),
-                    core_id.id,
+
+            for core_id in &requested_cores {
+                // Pass a Weak runtime handle into each pipeline thread. The thread upgrades it
+                // only when it needs to report Success/Error/Panic on exit, and silently skips
+                // that late report if shutdown has already dropped the runtime.
+                let launched = Self::launch_pipeline_thread(
+                    self.pipeline_factory,
+                    DeployedPipelineKey {
+                        pipeline_group_id: pipeline_entry.pipeline_group_id.clone(),
+                        pipeline_id: pipeline_entry.pipeline_id.clone(),
+                        core_id: core_id.id,
+                        deployment_generation: 0,
+                    },
+                    *core_id,
                     num_cores,
-                    thread_id,
-                );
-                let topic_set = Self::build_pipeline_topic_set(
+                    pipeline_entry.pipeline.clone(),
+                    pipeline_entry.policies.channel_capacity.clone(),
+                    pipeline_entry.policies.telemetry.clone(),
+                    pipeline_entry.policies.transport_headers.clone(),
+                    controller_ctx.clone(),
+                    metrics_reporter.clone(),
+                    engine_evt_reporter.clone(),
+                    telemetry_system.engine_tracing_setup(),
+                    telemetry_reporting_interval,
+                    memory_pressure_tx.clone(),
                     &engine_config,
-                    &declared_topics,
-                    &pipeline_group_id,
-                    &pipeline_id,
-                    core_id.id,
+                    runtime.declared_topics(),
+                    Arc::downgrade(&runtime),
+                    runtime.next_thread_id(),
+                    None,
                 )?;
-                pipeline_handle.set_topic_set(topic_set);
-                let metrics_reporter = metrics_reporter.clone();
-
-                let thread_name = format!(
-                    "pipeline-{}-{}-core-{}",
-                    pipeline_group_id.as_ref(),
-                    pipeline_id.as_ref(),
-                    core_id.id
-                );
-
-                let run_key = pipeline_key.clone();
-                let engine_tracing_setup = telemetry_system.engine_tracing_setup();
-                let engine_evt_reporter = engine_evt_reporter.clone();
-                let effective_channel_capacity_policy = channel_capacity_policy.clone();
-                let effective_telemetry_policy = telemetry_policy.clone();
-                let pipeline_memory_pressure_rx = memory_pressure_tx.subscribe();
-                let effective_transport_headers_policy = transport_headers_policy.clone();
-                let handle = thread::Builder::new()
-                    .name(thread_name.clone())
-                    .spawn(move || {
-                        Self::run_pipeline_thread(
-                            run_key,
-                            core_id,
-                            pipeline_config,
-                            effective_channel_capacity_policy,
-                            effective_telemetry_policy,
-                            effective_transport_headers_policy,
-                            telemetry_reporting_interval,
-                            pipeline_factory,
-                            pipeline_handle,
-                            engine_evt_reporter,
-                            metrics_reporter,
-                            runtime_ctrl_msg_tx,
-                            runtime_ctrl_msg_rx,
-                            pipeline_completion_msg_tx,
-                            pipeline_completion_msg_rx,
-                            pipeline_memory_pressure_rx,
-                            engine_tracing_setup,
-                            None,
-                        )
-                    })
-                    .map_err(|e| Error::ThreadSpawnError {
-                        thread_name: thread_name.clone(),
-                        source: e,
-                    })?;
-
-                threads.push((thread_name, thread_id, pipeline_key, handle));
+                runtime.register_launched_instance(launched);
             }
         }
 
-        // Drop the original metrics sender so only pipeline threads hold references
         drop(metrics_reporter);
 
-        // Start the admin HTTP server
+        let control_plane = runtime.control_plane();
         let admin_server_handle = spawn_thread_local_task(
             "http-admin",
             admin_tracing_setup,
             move |cancellation_token| {
-                // Convert the concrete senders to trait objects for the admin crate
-                let admin_senders: Vec<Arc<dyn otap_df_engine::control::PipelineAdminSender>> =
-                    ctrl_msg_senders
-                        .into_iter()
-                        .map(|sender| {
-                            Arc::new(sender)
-                                as Arc<dyn otap_df_engine::control::PipelineAdminSender>
-                        })
-                        .collect();
-
                 otap_df_admin::run(
                     admin_settings,
                     obs_state_handle,
-                    admin_senders,
+                    control_plane,
                     telemetry_registry,
                     memory_pressure_state,
                     log_tap_handle,
@@ -1461,48 +1529,8 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
             },
         )?;
 
-        // Wait for all pipeline threads to finish and collect their results
-        let mut results: Vec<Result<(), Error>> = Vec::with_capacity(threads.len());
-        for (thread_name, thread_id, pipeline_key, handle) in threads {
-            match handle.join() {
-                Ok(Ok(_)) => {
-                    engine_evt_reporter.report(EngineEvent::drained(pipeline_key, None));
-                }
-                Ok(Err(e)) => {
-                    let err_summary: ErrorSummary = error_summary_from_gen(&e);
-                    engine_evt_reporter.report(EngineEvent::pipeline_runtime_error(
-                        pipeline_key.clone(),
-                        "Pipeline encountered a runtime error.",
-                        err_summary,
-                    ));
-                    results.push(Err(e));
-                }
-                Err(e) => {
-                    let err_summary = ErrorSummary::Pipeline {
-                        error_kind: "panic".into(),
-                        message: "The pipeline panicked during execution.".into(),
-                        source: Some(format!("{e:?}")),
-                    };
-                    engine_evt_reporter.report(EngineEvent::pipeline_runtime_error(
-                        pipeline_key.clone(),
-                        "The pipeline panicked during execution.",
-                        err_summary,
-                    ));
-                    // Thread join failed, handle the error
-                    let core_id = pipeline_key.core_id;
-                    return Err(Error::ThreadPanic {
-                        thread_name,
-                        thread_id,
-                        core_id,
-                        panic_message: format!("{e:?}"),
-                    });
-                }
-            }
-        }
-
-        // Check if any pipeline threads returned an error
-        if let Some(err) = results.into_iter().find_map(Result::err) {
-            return Err(err);
+        if run_mode == RunMode::ShutdownWhenDone {
+            runtime.wait_until_all_instances_exit();
         }
 
         // In standard engine mode we keep the main thread parked after startup.
@@ -1523,6 +1551,10 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
         obs_state_join_handle.shutdown_and_join()?;
         telemetry_system.shutdown_otel()?;
 
+        if let Some(err) = runtime.take_runtime_error() {
+            return Err(err);
+        }
+
         Ok(())
     }
 
@@ -1607,128 +1639,103 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
 
         match core_allocation.strategy {
             CoreAllocationStrategy::AllCores => Ok(available_core_ids),
-            CoreAllocationStrategy::CoreCount => {
-                match core_allocation.count {
-                    Some(count) => {
-                        if count == 0 {
-                            Ok(available_core_ids)
-                        } else if count > num_cores {
-                            Err(Error::InvalidCoreAllocation {
+            CoreAllocationStrategy::CoreCount => match core_allocation.count {
+                Some(count) => {
+                    if count == 0 {
+                        Ok(available_core_ids)
+                    } else if count > num_cores {
+                        Err(Error::InvalidCoreAllocation {
+                            alloc: core_allocation.clone(),
+                            message: format!(
+                                "Requested {} cores but only {} cores available on this system",
+                                count, num_cores
+                            ),
+                            available: available_core_ids.iter().map(|c| c.id).collect(),
+                        })
+                    } else {
+                        Ok(available_core_ids.into_iter().take(count).collect())
+                    }
+                }
+                None => Ok(available_core_ids),
+            },
+            CoreAllocationStrategy::CoreSet => match &core_allocation.set {
+                Some(set) => {
+                    for r in set.iter() {
+                        if r.start > r.end {
+                            return Err(Error::InvalidCoreAllocation {
                                 alloc: core_allocation.clone(),
                                 message: format!(
-                                    "Requested {} cores but only {} cores available on this system",
-                                    count, num_cores
+                                    "Invalid core range: start ({}) is greater than end ({})",
+                                    r.start, r.end
                                 ),
                                 available: available_core_ids.iter().map(|c| c.id).collect(),
-                            })
-                        } else {
-                            Ok(available_core_ids.into_iter().take(count).collect())
+                            });
+                        }
+                        if r.start > max_core_id {
+                            return Err(Error::InvalidCoreAllocation {
+                                alloc: core_allocation.clone(),
+                                message: format!(
+                                    "Core ID {} exceeds available cores (system has cores 0-{})",
+                                    r.start, max_core_id
+                                ),
+                                available: available_core_ids.iter().map(|c| c.id).collect(),
+                            });
+                        }
+                        if r.end > max_core_id {
+                            return Err(Error::InvalidCoreAllocation {
+                                alloc: core_allocation.clone(),
+                                message: format!(
+                                    "Core ID {} exceeds available cores (system has cores 0-{})",
+                                    r.end, max_core_id
+                                ),
+                                available: available_core_ids.iter().map(|c| c.id).collect(),
+                            });
                         }
                     }
-                    None => {
-                        // Treat no count supplied the same as count: 0
-                        Ok(available_core_ids)
-                    }
-                }
-            }
-            CoreAllocationStrategy::CoreSet => {
-                match &core_allocation.set {
-                    Some(set) => {
-                        // Validate all ranges first
-                        for r in set.iter() {
-                            if r.start > r.end {
-                                return Err(Error::InvalidCoreAllocation {
-                                    alloc: core_allocation.clone(),
-                                    message: format!(
-                                        "Invalid core range: start ({}) is greater than end ({})",
-                                        r.start, r.end
-                                    ),
-                                    available: available_core_ids.iter().map(|c| c.id).collect(),
-                                });
-                            }
-                            if r.start > max_core_id {
-                                return Err(Error::InvalidCoreAllocation {
-                                    alloc: core_allocation.clone(),
-                                    message: format!(
-                                        "Core ID {} exceeds available cores (system has cores 0-{})",
-                                        r.start, max_core_id
-                                    ),
-                                    available: available_core_ids.iter().map(|c| c.id).collect(),
-                                });
-                            }
-                            if r.end > max_core_id {
+
+                    for (i, r1) in set.iter().enumerate() {
+                        for r2 in set.iter().skip(i + 1) {
+                            if r1.start <= r2.end && r2.start <= r1.end {
+                                let overlap_start = r1.start.max(r2.start);
+                                let overlap_end = r1.end.min(r2.end);
                                 return Err(Error::InvalidCoreAllocation {
                                     alloc: core_allocation.clone(),
                                     message: format!(
-                                        "Core ID {} exceeds available cores (system has cores 0-{})",
-                                        r.end, max_core_id
+                                        "Core ranges overlap: {}-{} and {}-{} share cores {}-{}",
+                                        r1.start,
+                                        r1.end,
+                                        r2.start,
+                                        r2.end,
+                                        overlap_start,
+                                        overlap_end
                                     ),
                                     available: available_core_ids.iter().map(|c| c.id).collect(),
                                 });
                             }
                         }
+                    }
 
-                        // Check for overlapping ranges
-                        for (i, r1) in set.iter().enumerate() {
-                            for r2 in set.iter().skip(i + 1) {
-                                // Two ranges overlap if they share any common cores
-                                if r1.start <= r2.end && r2.start <= r1.end {
-                                    let overlap_start = r1.start.max(r2.start);
-                                    let overlap_end = r1.end.min(r2.end);
-                                    return Err(Error::InvalidCoreAllocation {
-                                        alloc: core_allocation.clone(),
-                                        message: format!(
-                                            "Core ranges overlap: {}-{} and {}-{} share cores {}-{}",
-                                            r1.start,
-                                            r1.end,
-                                            r2.start,
-                                            r2.end,
-                                            overlap_start,
-                                            overlap_end
-                                        ),
-                                        available: available_core_ids
-                                            .iter()
-                                            .map(|c| c.id)
-                                            .collect(),
-                                    });
-                                }
-                            }
-                        }
-
-                        // Filter cores in range
-                        let selected: Vec<_> = available_core_ids
-                            .into_iter()
-                            // Naively check if each interval contains the point
-                            // This problem is known as the "Interval Stabbing Problem"
-                            // and has more efficient but more complex solutions
-                            .filter(|c| set.iter().any(|r| r.start <= c.id && c.id <= r.end))
-                            .collect();
-
-                        if selected.is_empty() {
-                            return Err(Error::InvalidCoreAllocation {
-                                alloc: core_allocation.clone(),
-                                message: "No available cores in the specified ranges".to_owned(),
-                                available: core_affinity::get_core_ids()
-                                    .unwrap_or_default()
-                                    .iter()
-                                    .map(|c| c.id)
-                                    .collect(),
-                            });
-                        }
+                    let selected: Vec<_> = available_core_ids
+                        .into_iter()
+                        .filter(|c| set.iter().any(|r| r.start <= c.id && c.id <= r.end))
+                        .collect();
 
-                        Ok(selected)
+                    if selected.is_empty() {
+                        return Err(Error::InvalidCoreAllocation {
+                            alloc: core_allocation.clone(),
+                            message: "No available cores in the specified ranges".to_owned(),
+                            available: core_affinity::get_core_ids()
+                                .unwrap_or_default()
+                                .iter()
+                                .map(|c| c.id)
+                                .collect(),
+                        });
                     }
-                    None => Err(Error::InvalidCoreAllocation {
-                        alloc: core_allocation.clone(),
-                        message: "No range of cores supplied for allocation".to_owned(),
-                        available: core_affinity::get_core_ids()
-                            .unwrap_or_default()
-                            .iter()
-                            .map(|c| c.id)
-                            .collect(),
-                    }),
+
+                    Ok(selected)
                 }
-            }
+                None => Ok(Vec::new()),
+            },
         }
     }
 
@@ -1755,29 +1762,150 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
             pipeline_group_id: SYSTEM_PIPELINE_GROUP_ID.into(),
             pipeline_id: SYSTEM_OBSERVABILITY_PIPELINE_ID.into(),
             core_id: core_id.id,
+            deployment_generation: 0,
         }
     }
 
+    /// Launches one pipeline OS thread and wires its terminal exit back into the controller.
+    ///
+    /// The spawned thread owns the actual pipeline execution and maps success, runtime error, or
+    /// panic into RuntimeInstanceExit. `runtime` is a Weak handle on purpose: the pipeline thread
+    /// should be able to report its exit, but it must not become an owner that prolongs the
+    /// controller runtime during shutdown.
+    #[allow(clippy::too_many_arguments)]
+    fn launch_pipeline_thread(
+        pipeline_factory: &'static PipelineFactory<PData>,
+        pipeline_key: DeployedPipelineKey,
+        core_id: CoreId,
+        num_cores: usize,
+        pipeline_config: PipelineConfig,
+        channel_capacity_policy: ChannelCapacityPolicy,
+        telemetry_policy: TelemetryPolicy,
+        transport_headers_policy: Option<TransportHeadersPolicy>,
+        controller_ctx: ControllerContext,
+        metrics_reporter: MetricsReporter,
+        engine_evt_reporter: ObservedEventReporter,
+        tracing_setup: TracingSetup,
+        telemetry_reporting_interval: Duration,
+        memory_pressure_tx: tokio::sync::watch::Sender<MemoryPressureChanged>,
+        config: &OtelDataflowSpec,
+        declared_topics: &DeclaredTopics<PData>,
+        runtime: std::sync::Weak<ControllerRuntime<PData>>,
+        thread_id: usize,
+        internal_telemetry: Option<(
+            InternalTelemetrySettings,
+            std_mpsc::SyncSender<Result<(), EngineError>>,
+        )>,
+    ) -> Result<LaunchedPipelineThread<PData>, Error> {
+        let mut pipeline_ctx = controller_ctx.pipeline_context_with_generation(
+            pipeline_key.pipeline_group_id.clone(),
+            pipeline_key.pipeline_id.clone(),
+            pipeline_key.core_id,
+            num_cores,
+            thread_id,
+            pipeline_key.deployment_generation,
+        );
+        let topic_set = Self::build_pipeline_topic_set(
+            config,
+            declared_topics,
+            &pipeline_key.pipeline_group_id,
+            &pipeline_key.pipeline_id,
+            pipeline_key.core_id,
+        )?;
+        pipeline_ctx.set_topic_set(topic_set);
+        let (runtime_ctrl_msg_tx, runtime_ctrl_msg_rx) =
+            runtime_ctrl_msg_channel(channel_capacity_policy.control.pipeline);
+        let (pipeline_completion_msg_tx, pipeline_completion_msg_rx) =
+            pipeline_completion_msg_channel(channel_capacity_policy.control.completion);
+        let control_sender: Arc<dyn PipelineAdminSender> = Arc::new(runtime_ctrl_msg_tx.clone());
+        let memory_pressure_rx = memory_pressure_tx.subscribe();
+        let thread_name = format!(
+            "pipeline-{}-{}-core-{}-gen-{}",
+            pipeline_key.pipeline_group_id.as_ref(),
+            pipeline_key.pipeline_id.as_ref(),
+            pipeline_key.core_id,
+            pipeline_key.deployment_generation
+        );
+        let run_key = pipeline_key.clone();
+        let runtime_key = pipeline_key.clone();
+        let runtime_thread_name = thread_name.clone();
+        let _handle = thread::Builder::new()
+            .name(thread_name.clone())
+            .spawn(move || {
+                let exit = match catch_unwind(AssertUnwindSafe(|| {
+                    Self::run_pipeline_thread(
+                        run_key,
+                        core_id,
+                        pipeline_config,
+                        channel_capacity_policy,
+                        telemetry_policy,
+                        transport_headers_policy,
+                        telemetry_reporting_interval,
+                        pipeline_factory,
+                        pipeline_ctx,
+                        engine_evt_reporter,
+                        metrics_reporter,
+                        runtime_ctrl_msg_tx,
+                        runtime_ctrl_msg_rx,
+                        pipeline_completion_msg_tx,
+                        pipeline_completion_msg_rx,
+                        memory_pressure_rx,
+                        tracing_setup,
+                        internal_telemetry,
+                    )
+                })) {
+                    Ok(Ok(_)) => RuntimeInstanceExit::Success,
+                    Ok(Err(err)) => {
+                        RuntimeInstanceExit::Error(RuntimeInstanceError::runtime(err.to_string()))
+                    }
+                    Err(panic) => RuntimeInstanceExit::Error(RuntimeInstanceError::from_panic(
+                        PanicReport::capture(
+                            "runtime thread",
+                            panic,
+                            Some(runtime_thread_name),
+                            Some(thread_id),
+                            Some(runtime_key.core_id),
+                        ),
+                    )),
+                };
+                if let Some(runtime) = runtime.upgrade() {
+                    runtime.note_instance_exit(runtime_key, exit);
+                }
+                // The controller runtime may already be gone during teardown. In that case there
+                // is nothing left to update, so late exit reporting is intentionally best-effort.
+            })
+            .map_err(|e| Error::ThreadSpawnError {
+                thread_name: thread_name.clone(),
+                source: e,
+            })?;
+
+        Ok(LaunchedPipelineThread {
+            pipeline_key,
+            control_sender,
+            _marker: std::marker::PhantomData,
+        })
+    }
+
     /// Spawns the internal telemetry pipeline if engine observability config provides one.
     ///
     /// Returns the thread handle if an internal pipeline was spawned
     /// and waits for it to start, or None.
     #[allow(clippy::too_many_arguments)]
     fn spawn_internal_pipeline_if_configured(
+        runtime: std::sync::Weak<ControllerRuntime<PData>>,
         its_key: DeployedPipelineKey,
         its_core: CoreId,
         observability_pipeline: Option<ResolvedPipelineConfig>,
         config: &OtelDataflowSpec,
-        declared_topics: &DeclaredTopics<PData>,
         telemetry_system: &InternalTelemetrySystem,
         pipeline_factory: &'static PipelineFactory<PData>,
         controller_ctx: &ControllerContext,
         engine_evt_reporter: &ObservedEventReporter,
         metrics_reporter: &MetricsReporter,
-        telemetry_reporting_interval: std::time::Duration,
+        telemetry_reporting_interval: Duration,
         memory_pressure_tx: &tokio::sync::watch::Sender<MemoryPressureChanged>,
         tracing_setup: TracingSetup,
-    ) -> Result<Option<(String, thread::JoinHandle<Result<Vec<()>, Error>>)>, Error> {
+    ) -> Result<Option<LaunchedPipelineThread<PData>>, Error> {
         let (internal_config, channel_capacity_policy, telemetry_policy): (
             PipelineConfig,
             ChannelCapacityPolicy,
@@ -1809,66 +1937,32 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
             Some(its_settings) => its_settings,
         };
 
-        let mut internal_pipeline_ctx = controller_ctx.pipeline_context_with(
-            its_key.pipeline_group_id.clone(),
-            its_key.pipeline_id.clone(),
-            its_key.core_id,
-            1, // Internal telemetry pipeline runs on a single core
-            0, // TODO: we do not have a thread_id
-        );
-        let topic_set = Self::build_pipeline_topic_set(
-            config,
-            declared_topics,
-            &its_key.pipeline_group_id,
-            &its_key.pipeline_id,
-            its_key.core_id,
-        )?;
-        internal_pipeline_ctx.set_topic_set(topic_set);
-
-        // Create control message channel for internal pipeline
-        let (internal_ctrl_tx, internal_ctrl_rx) =
-            runtime_ctrl_msg_channel(channel_capacity_policy.control.pipeline);
-        let (internal_return_tx, internal_return_rx) =
-            pipeline_completion_msg_channel(channel_capacity_policy.control.completion);
-
         // Create a channel to signal startup success/failure
         let (startup_tx, startup_rx) = std_mpsc::sync_channel::<Result<(), EngineError>>(1);
-
-        let thread_name = "internal-pipeline".to_string();
-        let internal_evt_reporter = engine_evt_reporter.clone();
-        let internal_metrics_reporter = metrics_reporter.clone();
-        let internal_channel_capacity_policy = channel_capacity_policy;
-        let internal_telemetry_policy = telemetry_policy;
-        let internal_memory_pressure_rx = memory_pressure_tx.subscribe();
-
-        let handle = thread::Builder::new()
-            .name(thread_name.clone())
-            .spawn(move || {
-                Self::run_pipeline_thread(
-                    its_key,
-                    its_core,
-                    internal_config,
-                    internal_channel_capacity_policy,
-                    internal_telemetry_policy,
-                    None, // no transport headers for the internal observability pipeline
-                    telemetry_reporting_interval,
-                    pipeline_factory,
-                    internal_pipeline_ctx,
-                    internal_evt_reporter,
-                    internal_metrics_reporter,
-                    internal_ctrl_tx,
-                    internal_ctrl_rx,
-                    internal_return_tx,
-                    internal_return_rx,
-                    internal_memory_pressure_rx,
-                    tracing_setup,
-                    Some((its_settings, startup_tx)),
-                )
-            })
-            .map_err(|e| Error::ThreadSpawnError {
-                thread_name: thread_name.clone(),
-                source: e,
-            })?;
+        let launched = Self::launch_pipeline_thread(
+            pipeline_factory,
+            its_key,
+            its_core,
+            1,
+            internal_config,
+            channel_capacity_policy,
+            telemetry_policy,
+            None,
+            controller_ctx.clone(),
+            metrics_reporter.clone(),
+            engine_evt_reporter.clone(),
+            tracing_setup,
+            telemetry_reporting_interval,
+            memory_pressure_tx.clone(),
+            config,
+            runtime
+                .upgrade()
+                .expect("controller runtime should exist while spawning internal pipeline")
+                .declared_topics(),
+            runtime,
+            0,
+            Some((its_settings, startup_tx)),
+        )?;
 
         // Wait for the internal pipeline to signal successful startup
         match startup_rx.recv() {
@@ -1892,7 +1986,7 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
             }
         }
 
-        Ok(Some((thread_name, handle)))
+        Ok(Some(launched))
     }
 
     /// Runs a single pipeline in the current thread.
@@ -1903,7 +1997,7 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
         channel_capacity_policy: ChannelCapacityPolicy,
         telemetry_policy: TelemetryPolicy,
         transport_headers_policy: Option<TransportHeadersPolicy>,
-        telemetry_reporting_interval: std::time::Duration,
+        telemetry_reporting_interval: Duration,
         pipeline_factory: &'static PipelineFactory<PData>,
         pipeline_context: PipelineContext,
         obs_evt_reporter: ObservedEventReporter,
@@ -2017,27 +2111,6 @@ impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + U
     }
 }
 
-fn error_summary_from_gen(error: &Error) -> ErrorSummary {
-    match error {
-        Error::PipelineRuntimeError { source } => {
-            if let Some(engine_error) = source.downcast_ref::<EngineError>() {
-                error_summary_from(engine_error)
-            } else {
-                ErrorSummary::Pipeline {
-                    error_kind: "runtime".into(),
-                    message: source.to_string(),
-                    source: None,
-                }
-            }
-        }
-        _ => ErrorSummary::Pipeline {
-            error_kind: "runtime".into(),
-            message: error.to_string(),
-            source: None,
-        },
-    }
-}
-
 #[cfg(test)]
 mod tests {
     use super::*;
@@ -2305,7 +2378,6 @@ connections:
 
     #[test]
     fn select_with_adjacent_ranges_succeeds() {
-        // Adjacent but non-overlapping ranges should work
         let core_allocation = CoreAllocation::core_set(vec![
             CoreRange { start: 2, end: 3 },
             CoreRange { start: 4, end: 5 },
@@ -3122,7 +3194,7 @@ groups:
         );
         assert_eq!(
             local_block.default_publish_outcome_config().timeout,
-            std::time::Duration::from_secs(46)
+            Duration::from_secs(46)
         );
 
         // group-local declaration must override global policy for same local name
@@ -3147,7 +3219,7 @@ groups:
         );
         assert_eq!(
             overridden.default_publish_outcome_config().timeout,
-            std::time::Duration::from_secs(47)
+            Duration::from_secs(47)
         );
     }
 
diff --git a/rust/otap-dataflow/crates/controller/src/live_control/README.md b/rust/otap-dataflow/crates/controller/src/live_control/README.md
new file mode 100644
index 0000000000..584d669657
--- /dev/null
+++ b/rust/otap-dataflow/crates/controller/src/live_control/README.md
@@ -0,0 +1,109 @@
+# Controller Live Control
+
+`live_control` owns the in-process runtime model used by the admin control
+plane to reconfigure and shut down logical pipelines while the engine is
+running. It is deliberately internal to the controller: public admin API
+shapes live in `otap-df-admin` and `otap-df-admin-types`, while this module
+tracks the mutable controller state required to execute those API requests.
+
+## Goals
+
+- Accept per-pipeline rollout and shutdown requests without restarting the
+  whole engine.
+- Keep controller state consistent across asynchronous pipeline-thread exits,
+  rollout workers, shutdown workers, and observed-state updates.
+- Preserve useful recent operation snapshots while bounding in-memory
+  retention.
+- Keep old runtime instances visible only while active controller work still
+  needs generation-specific status.
+
+## Architecture
+
+The module is split by responsibility:
+
+- `mod.rs` is the facade. It defines `ControllerRuntime`, the control-plane
+  adapter, startup registration, shared pruning helpers, and the
+  `ControlPlane` implementation.
+- `state.rs` defines the in-memory state model: rollout/shutdown records,
+  runtime-instance records, candidate plans, panic/error reports, and retention
+  constants.
+- `planning.rs` validates requests, classifies rollout actions, prepares
+  candidate rollout/shutdown plans, records accepted operations, updates status
+  snapshots, and spawns background workers.
+- `execution.rs` runs rollout and shutdown workers. It handles create, resize,
+  replace, rollback, panic cleanup, and per-core progress updates.
+- `runtime.rs` launches pipeline threads, registers instances, records exits,
+  sends shutdown requests, waits for readiness/exit, and exposes global runtime
+  shutdown/error helpers.
+
+`ControllerRuntime` is the shared owner. It is held behind an `Arc` by the
+admin control-plane adapter and by detached rollout/shutdown workers. Pipeline
+threads receive a `Weak<ControllerRuntime<_>>` so they can report exits without
+extending the controller lifetime.
+
+## Lifecycle Model
+
+Live control separates three related concepts:
+
+- A logical pipeline is identified by `(pipeline_group_id, pipeline_id)` and
+  points at the committed resolved pipeline plus its active generation.
+- A pipeline group is the config hierarchy that contains related pipelines,
+  group-local topics, and group-level policies. Current live-control operations
+  target one logical pipeline inside that group.
+- A deployed runtime instance is identified by `(pipeline_group_id,
+  pipeline_id, core_id, deployment_generation)` and tracks whether that thread
+  is still active or has exited.
+- A controller operation is a rollout or shutdown record with public progress
+  state and per-core details.
+
+Rollouts are classified before execution:
+
+- `create` launches a logical pipeline that did not exist.
+- `noop` commits an identical request without launching a worker.
+- `resize` changes core placement without changing the runtime shape.
+- `replace` launches a new generation, waits for readiness, then drains the
+  previous generation.
+
+Shutdown targets the currently active deployed instances for one logical
+pipeline. Global shutdown bypasses operation history and broadcasts shutdown to
+all active instances.
+
+## Design Decisions
+
+- The controller is the authority for when old generations can be retired.
+  Observed-state compaction is invoked only after active rollout/shutdown work
+  no longer needs generation-specific entries.
+- The current consistency scope is one logical pipeline. Planning validates a
+  candidate against a cloned full config snapshot, but commit patches only that
+  pipeline into the latest live config. This intentionally does not provide
+  whole-config serializability across concurrent operations on different
+  logical pipelines.
+- Terminal rollout and shutdown records are retained in memory with both a
+  per-logical-pipeline cap and a TTL. This keeps recent admin lookups useful
+  without unbounded history growth.
+- Runtime exit reporting is race-tolerant. A pipeline thread can exit before
+  `register_launched_instance()` publishes it as active; such exits are parked
+  in `pending_instance_exits` and reconciled during registration.
+- Worker panic handling is unwind-safe. Panic cleanup records terminal failure,
+  clears active-operation conflict state, and reports concise public failure
+  reasons plus detailed internal diagnostics.
+- Topic broker runtime shape is not mutable through live reconfiguration.
+  Rollout planning rejects requests that would require changing declared topic
+  backend, policy, or selected implementation mode.
+
+## Current Limits
+
+- Rollout and shutdown workers are detached OS threads. They are supervised by
+  panic cleanup, but there is no bounded worker pool or join-handle supervisor.
+- Topic declaration changes are intentionally rejected. Supporting them would
+  require a separate broker migration model.
+- Operation history is in-memory only. It is bounded and useful for recent
+  lookups, but it is not durable across controller restarts.
+- Full group shutdown is orchestrated above this module by issuing
+  per-pipeline/global control-plane calls; this module tracks per-pipeline
+  live-control state.
+- Future group-level reconfiguration can widen the active-operation conflict
+  scope from logical pipeline to pipeline group without changing the existing
+  per-pipeline endpoint shape.
+- Rollbacks are best effort. If rollback itself fails, the operation records
+  `rollback_failed` and preserves diagnostics for operators.
diff --git a/rust/otap-dataflow/crates/controller/src/live_control/execution.rs b/rust/otap-dataflow/crates/controller/src/live_control/execution.rs
new file mode 100644
index 0000000000..30f1ca69b4
--- /dev/null
+++ b/rust/otap-dataflow/crates/controller/src/live_control/execution.rs
@@ -0,0 +1,906 @@
+// Copyright The OpenTelemetry Authors
+// SPDX-License-Identifier: Apache-2.0
+
+//! Background rollout and shutdown execution.
+//!
+//! The planning module records accepted operations and spawns workers; this
+//! module contains the worker bodies. Each worker updates per-core progress,
+//! drives runtime instance launch/shutdown, commits successful generations, and
+//! performs best-effort rollback when a multi-step rollout fails.
+
+use super::*;
+
+impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + Unwindable>
+    ControllerRuntime<PData>
+{
+    /// Emits the internal telemetry event for a rollout/shutdown worker panic.
+    pub(super) fn report_controller_worker_panic(
+        &self,
+        pipeline_key: &PipelineKey,
+        operation_kind: &'static str,
+        operation_id: &str,
+        report: &PanicReport,
+    ) {
+        let ErrorSummary::Pipeline {
+            error_kind,
+            message,
+            source,
+        } = report.error_summary()
+        else {
+            unreachable!("panic reports are always pipeline-level summaries");
+        };
+
+        otel_error!(
+            "controller.worker_panic",
+            pipeline_group_id = %pipeline_key.pipeline_group_id(),
+            pipeline_id = %pipeline_key.pipeline_id(),
+            operation_kind = operation_kind,
+            operation_id = operation_id,
+            error_kind = error_kind.as_str(),
+            message = message.as_str(),
+            source = source.as_deref().unwrap_or(""),
+        );
+    }
+
+    /// Forces rollout terminal cleanup when the detached rollout worker panics.
+    pub(super) fn handle_rollout_worker_panic(
+        &self,
+        pipeline_key: &PipelineKey,
+        rollout_id: &str,
+        thread_name: String,
+        panic: Box<dyn Any + Send>,
+    ) {
+        let report = PanicReport::capture("rollout worker", panic, Some(thread_name), None, None);
+        let failure_reason = report.summary_message();
+        self.request_rollout_panic_candidate_cleanup(pipeline_key, rollout_id);
+        self.update_rollout(pipeline_key, rollout_id, |rollout| {
+            rollout.state = RolloutLifecycleState::Failed;
+            rollout.failure_reason = Some(failure_reason.clone());
+        });
+        self.report_controller_worker_panic(pipeline_key, "rollout", rollout_id, &report);
+        self.finish_rollout(pipeline_key, rollout_id);
+    }
+
+    /// Sends shutdown to candidate instances left behind by a panicking rollout worker.
+    fn request_rollout_panic_candidate_cleanup(
+        &self,
+        pipeline_key: &PipelineKey,
+        rollout_id: &str,
+    ) {
+        let (mut candidates, timeout_secs) = {
+            let state = self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            let Some(rollout) = state.rollouts.get(rollout_id) else {
+                return;
+            };
+            let target_generation = rollout.target_generation;
+            let timeout_secs = rollout.drain_timeout_secs;
+
+            // Only an uncommitted target generation is safe to clean up here.
+            // Resize rollouts use the committed generation, and a post-commit
+            // panic means the target generation is already the serving one.
+            if state
+                .logical_pipelines
+                .get(pipeline_key)
+                .is_some_and(|record| record.active_generation == target_generation)
+            {
+                return;
+            }
+
+            let candidates = state
+                .runtime_instances
+                .iter()
+                .filter_map(|(deployed_key, instance)| {
+                    if deployed_key.pipeline_group_id == *pipeline_key.pipeline_group_id()
+                        && deployed_key.pipeline_id == *pipeline_key.pipeline_id()
+                        && deployed_key.deployment_generation == target_generation
+                        && matches!(instance.lifecycle, RuntimeInstanceLifecycle::Active)
+                        && instance.control_sender.is_some()
+                    {
+                        Some(deployed_key.clone())
+                    } else {
+                        None
+                    }
+                })
+                .collect::<Vec<_>>();
+            (candidates, timeout_secs)
+        };
+        candidates.sort_by_key(|deployed_key| deployed_key.core_id);
+
+        for deployed_key in candidates {
+            if let Err(message) =
+                self.request_instance_shutdown(&deployed_key, timeout_secs, "rollout panic cleanup")
+            {
+                otel_error!(
+                    "controller.rollout_panic_cleanup_failed",
+                    pipeline_group_id = %deployed_key.pipeline_group_id.as_ref(),
+                    pipeline_id = %deployed_key.pipeline_id.as_ref(),
+                    core_id = deployed_key.core_id,
+                    deployment_generation = deployed_key.deployment_generation,
+                    rollout_id = rollout_id,
+                    error = message.as_str(),
+                );
+            }
+        }
+    }
+
+    /// Forces shutdown terminal cleanup when the detached shutdown worker panics.
+    pub(super) fn handle_shutdown_worker_panic(
+        &self,
+        pipeline_key: &PipelineKey,
+        shutdown_id: &str,
+        thread_name: String,
+        panic: Box<dyn Any + Send>,
+    ) {
+        let report = PanicReport::capture("shutdown worker", panic, Some(thread_name), None, None);
+        let failure_reason = report.summary_message();
+        self.update_shutdown(pipeline_key, shutdown_id, |shutdown| {
+            shutdown.state = ShutdownLifecycleState::Failed;
+            shutdown.failure_reason = Some(failure_reason.clone());
+        });
+        self.report_controller_worker_panic(pipeline_key, "shutdown", shutdown_id, &report);
+    }
+
+    /// Runs one accepted rollout plan and records its terminal state.
+    pub(super) fn run_rollout(self: Arc<Self>, plan: CandidateRolloutPlan) {
+        self.update_rollout(&plan.pipeline_key, &plan.rollout.rollout_id, |rollout| {
+            rollout.state = RolloutLifecycleState::Running;
+        });
+
+        let result = match plan.action {
+            RolloutAction::Create => self.run_create_rollout(&plan),
+            RolloutAction::NoOp => Ok(()),
+            RolloutAction::Replace => self.run_replace_rollout(&plan),
+            RolloutAction::Resize => self.run_resize_rollout(&plan),
+        };
+
+        match result {
+            Ok(()) => {
+                self.update_rollout(&plan.pipeline_key, &plan.rollout.rollout_id, |rollout| {
+                    rollout.state = RolloutLifecycleState::Succeeded;
+                    rollout.failure_reason = None;
+                });
+            }
+            Err(RolloutExecutionError::Failed(reason)) => {
+                self.update_rollout(&plan.pipeline_key, &plan.rollout.rollout_id, |rollout| {
+                    rollout.state = RolloutLifecycleState::Failed;
+                    rollout.failure_reason = Some(reason);
+                });
+            }
+            Err(RolloutExecutionError::RollbackFailed(reason)) => {
+                self.update_rollout(&plan.pipeline_key, &plan.rollout.rollout_id, |rollout| {
+                    rollout.state = RolloutLifecycleState::RollbackFailed;
+                    rollout.failure_reason = Some(reason);
+                });
+            }
+        }
+
+        self.finish_rollout(&plan.pipeline_key, &plan.rollout.rollout_id);
+    }
+
+    /// Drives one pipeline shutdown operation to completion or failure.
+    pub(super) fn run_shutdown(self: Arc<Self>, plan: CandidateShutdownPlan) {
+        self.update_shutdown(&plan.pipeline_key, &plan.shutdown.shutdown_id, |shutdown| {
+            shutdown.state = ShutdownLifecycleState::Running;
+        });
+
+        for deployed_key in &plan.target_instances {
+            if let Err(message) =
+                self.request_instance_shutdown(deployed_key, plan.timeout_secs, "pipeline shutdown")
+            {
+                self.update_shutdown(&plan.pipeline_key, &plan.shutdown.shutdown_id, |shutdown| {
+                    shutdown.state = ShutdownLifecycleState::Failed;
+                    shutdown.failure_reason = Some(message.clone());
+                    if let Some(core) = shutdown.cores.iter_mut().find(|core| {
+                        core.core_id == deployed_key.core_id
+                            && core.deployment_generation == deployed_key.deployment_generation
+                    }) {
+                        core.state = "failed".to_owned();
+                        core.updated_at = timestamp_now();
+                        core.detail = Some(message.clone());
+                    }
+                });
+                return;
+            }
+
+            self.update_shutdown(&plan.pipeline_key, &plan.shutdown.shutdown_id, |shutdown| {
+                if let Some(core) = shutdown.cores.iter_mut().find(|core| {
+                    core.core_id == deployed_key.core_id
+                        && core.deployment_generation == deployed_key.deployment_generation
+                }) {
+                    core.state = "shutdown_requested".to_owned();
+                    core.updated_at = timestamp_now();
+                }
+            });
+        }
+
+        let deadline = Instant::now() + Duration::from_secs(plan.timeout_secs);
+        let mut remaining: HashSet<_> = plan.target_instances.iter().cloned().collect();
+        while !remaining.is_empty() {
+            let mut completed = Vec::new();
+            for deployed_key in &remaining {
+                match self.instance_exit(deployed_key) {
+                    Some(RuntimeInstanceExit::Success) => {
+                        completed.push(deployed_key.clone());
+                    }
+                    Some(RuntimeInstanceExit::Error(error)) => {
+                        self.update_shutdown(
+                            &plan.pipeline_key,
+                            &plan.shutdown.shutdown_id,
+                            |shutdown| {
+                                shutdown.state = ShutdownLifecycleState::Failed;
+                                shutdown.failure_reason = Some(error.message.clone());
+                                if let Some(core) = shutdown.cores.iter_mut().find(|core| {
+                                    core.core_id == deployed_key.core_id
+                                        && core.deployment_generation
+                                            == deployed_key.deployment_generation
+                                }) {
+                                    core.state = "failed".to_owned();
+                                    core.updated_at = timestamp_now();
+                                    core.detail = Some(error.message.clone());
+                                }
+                            },
+                        );
+                        return;
+                    }
+                    None => {}
+                }
+            }
+
+            for deployed_key in completed {
+                let _ = remaining.remove(&deployed_key);
+                self.update_shutdown(&plan.pipeline_key, &plan.shutdown.shutdown_id, |shutdown| {
+                    if let Some(core) = shutdown.cores.iter_mut().find(|core| {
+                        core.core_id == deployed_key.core_id
+                            && core.deployment_generation == deployed_key.deployment_generation
+                    }) {
+                        core.state = "exited".to_owned();
+                        core.updated_at = timestamp_now();
+                    }
+                });
+            }
+
+            if remaining.is_empty() {
+                break;
+            }
+
+            if Instant::now() >= deadline {
+                let failure_reason = remaining
+                    .iter()
+                    .next()
+                    .map(|deployed_key| {
+                        format!(
+                            "timed out waiting for pipeline {}:{} core={} generation={} to drain",
+                            deployed_key.pipeline_group_id.as_ref(),
+                            deployed_key.pipeline_id.as_ref(),
+                            deployed_key.core_id,
+                            deployed_key.deployment_generation
+                        )
+                    })
+                    .unwrap_or_else(|| "shutdown timed out".to_owned());
+                self.update_shutdown(&plan.pipeline_key, &plan.shutdown.shutdown_id, |shutdown| {
+                    shutdown.state = ShutdownLifecycleState::Failed;
+                    shutdown.failure_reason = Some(failure_reason.clone());
+                    for deployed_key in &remaining {
+                        if let Some(core) = shutdown.cores.iter_mut().find(|core| {
+                            core.core_id == deployed_key.core_id
+                                && core.deployment_generation == deployed_key.deployment_generation
+                        }) {
+                            core.state = "failed".to_owned();
+                            core.updated_at = timestamp_now();
+                            core.detail = Some(failure_reason.clone());
+                        }
+                    }
+                });
+                return;
+            }
+
+            thread::sleep(Duration::from_millis(50));
+        }
+
+        self.update_shutdown(&plan.pipeline_key, &plan.shutdown.shutdown_id, |shutdown| {
+            shutdown.state = ShutdownLifecycleState::Succeeded;
+        });
+    }
+
+    /// Creates a brand-new logical pipeline by launching all target instances.
+    pub(super) fn run_create_rollout(
+        self: &Arc<Self>,
+        plan: &CandidateRolloutPlan,
+    ) -> Result<(), RolloutExecutionError> {
+        let mut launched = Vec::new();
+        let deadline = Instant::now() + Duration::from_secs(plan.step_timeout_secs);
+        for core_id in &plan.target_assigned_cores {
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "starting",
+                None,
+            );
+            let deployed_key = match self.launch_regular_pipeline_instance(
+                &plan.resolved_pipeline,
+                *core_id,
+                plan.target_generation,
+            ) {
+                Ok(deployed_key) => deployed_key,
+                Err(err) => {
+                    let reason = err.to_string();
+                    self.update_rollout_core_state(
+                        &plan.pipeline_key,
+                        &plan.rollout.rollout_id,
+                        *core_id,
+                        "failed",
+                        Some(reason.clone()),
+                    );
+                    // Create rollouts have no previous generation to restore, so a launch
+                    // failure must tear down any candidate instances that were already started.
+                    let _ = self.shutdown_instances(&launched, plan.drain_timeout_secs);
+                    return Err(RolloutExecutionError::Failed(reason));
+                }
+            };
+            launched.push(deployed_key);
+        }
+
+        for deployed_key in &launched {
+            self.wait_for_pipeline_ready(deployed_key, deadline)
+                .map_err(|reason| {
+                    let _ = self.shutdown_instances(&launched, plan.drain_timeout_secs);
+                    RolloutExecutionError::Failed(reason)
+                })?;
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                deployed_key.core_id,
+                "ready",
+                None,
+            );
+        }
+
+        self.commit_pipeline_record(plan, plan.target_generation);
+        Ok(())
+    }
+
+    /// Resizes a pipeline when only core allocation changed and common cores stay untouched.
+    pub(super) fn run_resize_rollout(
+        self: &Arc<Self>,
+        plan: &CandidateRolloutPlan,
+    ) -> Result<(), RolloutExecutionError> {
+        let Some(current_record) = plan.current_record.as_ref() else {
+            return Err(RolloutExecutionError::Failed(
+                "internal error: resize rollout missing current record".to_owned(),
+            ));
+        };
+        let active_generation = current_record.active_generation;
+        let mut started_cores = Vec::new();
+        let mut retired_cores = Vec::new();
+
+        for core_id in &plan.resize_start_cores {
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "starting",
+                None,
+            );
+
+            let new_key = match self.launch_regular_pipeline_instance(
+                &plan.resolved_pipeline,
+                *core_id,
+                active_generation,
+            ) {
+                Ok(new_key) => new_key,
+                Err(err) => {
+                    let reason = err.to_string();
+                    self.update_rollout_core_state(
+                        &plan.pipeline_key,
+                        &plan.rollout.rollout_id,
+                        *core_id,
+                        "failed",
+                        Some(reason.clone()),
+                    );
+                    return self.rollback_resize_rollout(
+                        plan,
+                        &started_cores,
+                        &retired_cores,
+                        reason,
+                    );
+                }
+            };
+            let ready_deadline = Instant::now() + Duration::from_secs(plan.step_timeout_secs);
+            if let Err(reason) = self.wait_for_pipeline_ready(&new_key, ready_deadline) {
+                let _ = self.shutdown_instances(&[new_key], plan.drain_timeout_secs);
+                return self.rollback_resize_rollout(plan, &started_cores, &retired_cores, reason);
+            }
+
+            started_cores.push(*core_id);
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "started",
+                None,
+            );
+        }
+
+        for core_id in &plan.resize_stop_cores {
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "draining_old",
+                None,
+            );
+
+            let old_key = DeployedPipelineKey {
+                pipeline_group_id: plan.pipeline_group_id.clone(),
+                pipeline_id: plan.pipeline_id.clone(),
+                core_id: *core_id,
+                deployment_generation: active_generation,
+            };
+            if let Err(reason) =
+                self.shutdown_instance(&old_key, plan.drain_timeout_secs, "resize rollout drain")
+            {
+                return self.rollback_resize_rollout(plan, &started_cores, &retired_cores, reason);
+            }
+
+            retired_cores.push(*core_id);
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "retired",
+                None,
+            );
+        }
+
+        self.commit_pipeline_record(plan, active_generation);
+        self.clear_pipeline_serving_generations(
+            &plan.pipeline_key,
+            plan.current_assigned_cores
+                .iter()
+                .chain(plan.target_assigned_cores.iter())
+                .copied(),
+        );
+        Ok(())
+    }
+
+    /// Performs the serial rolling cutover used for topology or config changes.
+    pub(super) fn run_replace_rollout(
+        self: &Arc<Self>,
+        plan: &CandidateRolloutPlan,
+    ) -> Result<(), RolloutExecutionError> {
+        let Some(previous) = plan.current_record.as_ref() else {
+            return Err(RolloutExecutionError::Failed(
+                "internal error: replace rollout missing current record".to_owned(),
+            ));
+        };
+        let previous_generation = previous.active_generation;
+        for core_id in &plan.current_assigned_cores {
+            self.observed_state_store.set_pipeline_serving_generation(
+                plan.pipeline_key.clone(),
+                *core_id,
+                previous_generation,
+            );
+        }
+
+        let mut activated_added_cores = Vec::new();
+        let mut switched_common_cores = Vec::new();
+        let mut retired_removed_cores = Vec::new();
+
+        for core_id in &plan.added_assigned_cores {
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "starting",
+                None,
+            );
+
+            let new_key = match self.launch_regular_pipeline_instance(
+                &plan.resolved_pipeline,
+                *core_id,
+                plan.target_generation,
+            ) {
+                Ok(new_key) => new_key,
+                Err(err) => {
+                    let reason = err.to_string();
+                    self.update_rollout_core_state(
+                        &plan.pipeline_key,
+                        &plan.rollout.rollout_id,
+                        *core_id,
+                        "failed",
+                        Some(reason.clone()),
+                    );
+                    return self.rollback_replace_rollout(
+                        plan,
+                        &switched_common_cores,
+                        &activated_added_cores,
+                        &retired_removed_cores,
+                        reason,
+                    );
+                }
+            };
+            let ready_deadline = Instant::now() + Duration::from_secs(plan.step_timeout_secs);
+            if let Err(reason) = self.wait_for_pipeline_ready(&new_key, ready_deadline) {
+                let _ = self.shutdown_instances(&[new_key], plan.drain_timeout_secs);
+                return self.rollback_replace_rollout(
+                    plan,
+                    &switched_common_cores,
+                    &activated_added_cores,
+                    &retired_removed_cores,
+                    reason,
+                );
+            }
+
+            self.observed_state_store.set_pipeline_serving_generation(
+                plan.pipeline_key.clone(),
+                *core_id,
+                plan.target_generation,
+            );
+            activated_added_cores.push(*core_id);
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "switched",
+                None,
+            );
+        }
+
+        for core_id in &plan.common_assigned_cores {
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "starting",
+                None,
+            );
+
+            let new_key = match self.launch_regular_pipeline_instance(
+                &plan.resolved_pipeline,
+                *core_id,
+                plan.target_generation,
+            ) {
+                Ok(new_key) => new_key,
+                Err(err) => {
+                    let reason = err.to_string();
+                    self.update_rollout_core_state(
+                        &plan.pipeline_key,
+                        &plan.rollout.rollout_id,
+                        *core_id,
+                        "failed",
+                        Some(reason.clone()),
+                    );
+                    return self.rollback_replace_rollout(
+                        plan,
+                        &switched_common_cores,
+                        &activated_added_cores,
+                        &retired_removed_cores,
+                        reason,
+                    );
+                }
+            };
+            let ready_deadline = Instant::now() + Duration::from_secs(plan.step_timeout_secs);
+            if let Err(reason) = self.wait_for_pipeline_ready(&new_key, ready_deadline) {
+                let _ = self.shutdown_instances(&[new_key], plan.drain_timeout_secs);
+                return self.rollback_replace_rollout(
+                    plan,
+                    &switched_common_cores,
+                    &activated_added_cores,
+                    &retired_removed_cores,
+                    reason,
+                );
+            }
+
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "draining_old",
+                None,
+            );
+
+            let old_key = DeployedPipelineKey {
+                pipeline_group_id: plan.pipeline_group_id.clone(),
+                pipeline_id: plan.pipeline_id.clone(),
+                core_id: *core_id,
+                deployment_generation: previous_generation,
+            };
+            if let Err(reason) =
+                self.shutdown_instance(&old_key, plan.drain_timeout_secs, "rolling cutover drain")
+            {
+                let _ = self.shutdown_instances(&[new_key], plan.drain_timeout_secs);
+                return self.rollback_replace_rollout(
+                    plan,
+                    &switched_common_cores,
+                    &activated_added_cores,
+                    &retired_removed_cores,
+                    reason,
+                );
+            }
+
+            self.observed_state_store.set_pipeline_serving_generation(
+                plan.pipeline_key.clone(),
+                *core_id,
+                plan.target_generation,
+            );
+            switched_common_cores.push(*core_id);
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "switched",
+                None,
+            );
+        }
+
+        for core_id in &plan.removed_assigned_cores {
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "draining_old",
+                None,
+            );
+
+            let old_key = DeployedPipelineKey {
+                pipeline_group_id: plan.pipeline_group_id.clone(),
+                pipeline_id: plan.pipeline_id.clone(),
+                core_id: *core_id,
+                deployment_generation: previous_generation,
+            };
+            if let Err(reason) = self.shutdown_instance(
+                &old_key,
+                plan.drain_timeout_secs,
+                "resource policy rollout drain",
+            ) {
+                return self.rollback_replace_rollout(
+                    plan,
+                    &switched_common_cores,
+                    &activated_added_cores,
+                    &retired_removed_cores,
+                    reason,
+                );
+            }
+
+            self.observed_state_store
+                .clear_pipeline_serving_generation(plan.pipeline_key.clone(), *core_id);
+            retired_removed_cores.push(*core_id);
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "retired",
+                None,
+            );
+        }
+
+        self.commit_pipeline_record(plan, plan.target_generation);
+        self.clear_pipeline_serving_generations(
+            &plan.pipeline_key,
+            plan.current_assigned_cores
+                .iter()
+                .chain(plan.target_assigned_cores.iter())
+                .copied(),
+        );
+        Ok(())
+    }
+
+    /// Restores the prior core footprint after a resize rollout fails.
+    pub(super) fn rollback_resize_rollout(
+        self: &Arc<Self>,
+        plan: &CandidateRolloutPlan,
+        started_cores: &[usize],
+        retired_cores: &[usize],
+        failure_reason: String,
+    ) -> Result<(), RolloutExecutionError> {
+        self.update_rollout(&plan.pipeline_key, &plan.rollout.rollout_id, |rollout| {
+            rollout.state = RolloutLifecycleState::RollingBack;
+            rollout.failure_reason = Some(failure_reason.clone());
+        });
+        let Some(previous) = plan.current_record.as_ref() else {
+            return Err(RolloutExecutionError::RollbackFailed(
+                "internal error: resize rollback missing current record".to_owned(),
+            ));
+        };
+        let previous_generation = previous.active_generation;
+
+        for core_id in retired_cores.iter().rev() {
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "rollback_starting",
+                None,
+            );
+
+            let old_key = self
+                .launch_regular_pipeline_instance(&previous.resolved, *core_id, previous_generation)
+                .map_err(|err| RolloutExecutionError::RollbackFailed(err.to_string()))?;
+            let ready_deadline = Instant::now() + Duration::from_secs(plan.step_timeout_secs);
+            self.wait_for_pipeline_ready(&old_key, ready_deadline)
+                .map_err(RolloutExecutionError::RollbackFailed)?;
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "rolled_back",
+                None,
+            );
+        }
+
+        for core_id in started_cores.iter().rev() {
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "rollback_starting",
+                None,
+            );
+
+            let new_key = DeployedPipelineKey {
+                pipeline_group_id: plan.pipeline_group_id.clone(),
+                pipeline_id: plan.pipeline_id.clone(),
+                core_id: *core_id,
+                deployment_generation: previous_generation,
+            };
+            self.shutdown_instance(&new_key, plan.drain_timeout_secs, "rollback cleanup")
+                .map_err(RolloutExecutionError::RollbackFailed)?;
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "rolled_back",
+                None,
+            );
+        }
+
+        self.clear_pipeline_serving_generations(
+            &plan.pipeline_key,
+            plan.current_assigned_cores
+                .iter()
+                .chain(plan.target_assigned_cores.iter())
+                .copied(),
+        );
+        Err(RolloutExecutionError::Failed(failure_reason))
+    }
+
+    /// Restores the previous serving generation after a replace rollout fails.
+    pub(super) fn rollback_replace_rollout(
+        self: &Arc<Self>,
+        plan: &CandidateRolloutPlan,
+        switched_common_cores: &[usize],
+        activated_added_cores: &[usize],
+        retired_removed_cores: &[usize],
+        failure_reason: String,
+    ) -> Result<(), RolloutExecutionError> {
+        self.update_rollout(&plan.pipeline_key, &plan.rollout.rollout_id, |rollout| {
+            rollout.state = RolloutLifecycleState::RollingBack;
+            rollout.failure_reason = Some(failure_reason.clone());
+        });
+        let Some(previous) = plan.current_record.as_ref() else {
+            return Err(RolloutExecutionError::RollbackFailed(
+                "internal error: replace rollback missing current record".to_owned(),
+            ));
+        };
+        let previous_generation = previous.active_generation;
+
+        for core_id in retired_removed_cores.iter().rev() {
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "rollback_starting",
+                None,
+            );
+
+            let old_key = self
+                .launch_regular_pipeline_instance(&previous.resolved, *core_id, previous_generation)
+                .map_err(|err| RolloutExecutionError::RollbackFailed(err.to_string()))?;
+            let ready_deadline = Instant::now() + Duration::from_secs(plan.step_timeout_secs);
+            self.wait_for_pipeline_ready(&old_key, ready_deadline)
+                .map_err(RolloutExecutionError::RollbackFailed)?;
+            self.observed_state_store.set_pipeline_serving_generation(
+                plan.pipeline_key.clone(),
+                *core_id,
+                previous_generation,
+            );
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "rolled_back",
+                None,
+            );
+        }
+
+        for core_id in switched_common_cores.iter().rev() {
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "rollback_starting",
+                None,
+            );
+
+            let old_key = self
+                .launch_regular_pipeline_instance(&previous.resolved, *core_id, previous_generation)
+                .map_err(|err| RolloutExecutionError::RollbackFailed(err.to_string()))?;
+            let ready_deadline = Instant::now() + Duration::from_secs(plan.step_timeout_secs);
+            self.wait_for_pipeline_ready(&old_key, ready_deadline)
+                .map_err(RolloutExecutionError::RollbackFailed)?;
+
+            let new_key = DeployedPipelineKey {
+                pipeline_group_id: plan.pipeline_group_id.clone(),
+                pipeline_id: plan.pipeline_id.clone(),
+                core_id: *core_id,
+                deployment_generation: plan.target_generation,
+            };
+            self.shutdown_instance(&new_key, plan.drain_timeout_secs, "rollback drain")
+                .map_err(RolloutExecutionError::RollbackFailed)?;
+            self.observed_state_store.set_pipeline_serving_generation(
+                plan.pipeline_key.clone(),
+                *core_id,
+                previous_generation,
+            );
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "rolled_back",
+                None,
+            );
+        }
+
+        for core_id in activated_added_cores.iter().rev() {
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "rollback_starting",
+                None,
+            );
+
+            let new_key = DeployedPipelineKey {
+                pipeline_group_id: plan.pipeline_group_id.clone(),
+                pipeline_id: plan.pipeline_id.clone(),
+                core_id: *core_id,
+                deployment_generation: plan.target_generation,
+            };
+            self.shutdown_instance(&new_key, plan.drain_timeout_secs, "rollback cleanup")
+                .map_err(RolloutExecutionError::RollbackFailed)?;
+            self.observed_state_store
+                .clear_pipeline_serving_generation(plan.pipeline_key.clone(), *core_id);
+            self.update_rollout_core_state(
+                &plan.pipeline_key,
+                &plan.rollout.rollout_id,
+                *core_id,
+                "rolled_back",
+                None,
+            );
+        }
+        self.clear_pipeline_serving_generations(
+            &plan.pipeline_key,
+            plan.current_assigned_cores
+                .iter()
+                .chain(plan.target_assigned_cores.iter())
+                .copied(),
+        );
+        Err(RolloutExecutionError::Failed(failure_reason))
+    }
+
+    /// Best-effort cleanup helper for batches of launched candidate instances.
+    pub(super) fn shutdown_instances(
+        self: &Arc<Self>,
+        keys: &[DeployedPipelineKey],
+        timeout_secs: u64,
+    ) -> Result<(), String> {
+        for key in keys {
+            self.shutdown_instance(key, timeout_secs, "candidate cleanup")?;
+        }
+        Ok(())
+    }
+}
diff --git a/rust/otap-dataflow/crates/controller/src/live_control/mod.rs b/rust/otap-dataflow/crates/controller/src/live_control/mod.rs
new file mode 100644
index 0000000000..ba106f66a5
--- /dev/null
+++ b/rust/otap-dataflow/crates/controller/src/live_control/mod.rs
@@ -0,0 +1,529 @@
+// Copyright The OpenTelemetry Authors
+// SPDX-License-Identifier: Apache-2.0
+
+//! Live reconfiguration runtime for controller-owned pipelines.
+//!
+//! This module is the controller's internal state machine for live pipeline
+//! rollout and shutdown. It translates admin control-plane requests into
+//! concrete runtime-instance changes, tracks active and terminal operations,
+//! reconciles pipeline-thread exits with controller bookkeeping, and compacts
+//! observed state once old generations are no longer needed for coordination.
+//!
+//! The submodules intentionally split the lifecycle by concern:
+//! planning validates requests and records accepted operations, execution runs
+//! rollout/shutdown workers, runtime owns per-instance launch and exit
+//! reporting, and state contains the shared in-memory model.
+
+use super::*;
+use chrono::Utc;
+use otap_df_admin::{
+    ControlPlane, ControlPlaneError, PipelineDetails,
+    PipelineRolloutState as ApiPipelineRolloutState,
+    PipelineRolloutSummary as ApiPipelineRolloutSummary, ReconfigureRequest, RolloutCoreStatus,
+    RolloutStatus, ShutdownCoreStatus, ShutdownStatus,
+};
+use otap_df_state::conditions::ConditionStatus;
+use otap_df_state::phase::PipelinePhase;
+use otap_df_state::pipeline_status::{PipelineRolloutState, PipelineRolloutSummary};
+use std::any::Any;
+use std::backtrace::Backtrace;
+use std::collections::VecDeque;
+use std::io;
+use std::panic::{AssertUnwindSafe, catch_unwind};
+use std::sync::{Condvar, Mutex};
+use std::time::{Duration, Instant};
+
+mod execution;
+mod planning;
+mod runtime;
+mod state;
+
+#[cfg(test)]
+use self::state::TERMINAL_OPERATION_RETENTION_TTL;
+use self::state::{
+    ActiveRuntimeCoreState, CandidateRolloutPlan, CandidateShutdownPlan, ControllerRuntimeState,
+    LogicalPipelineRecord, RolloutAction, RolloutCoreProgress, RolloutExecutionError,
+    RolloutLifecycleState, RolloutRecord, RuntimeInstanceLifecycle, RuntimeInstanceRecord,
+    ShutdownCoreProgress, ShutdownLifecycleState, ShutdownRecord, TERMINAL_ROLLOUT_RETENTION_LIMIT,
+    TERMINAL_SHUTDOWN_RETENTION_LIMIT, TopicRuntimeProfile, is_expired, timestamp_now,
+};
+pub(crate) use self::state::{PanicReport, RuntimeInstanceError, RuntimeInstanceExit};
+
+/// Shared live-control runtime used by the admin control plane and workers.
+///
+/// `ControllerRuntime` is the synchronization point for logical pipeline
+/// records, deployed runtime instances, rollout/shutdown histories, observed
+/// state updates, and topic/runtime registries. All mutable controller state is
+/// kept behind `state`; pipeline execution threads report back through a
+/// `Weak<ControllerRuntime<_>>` so they do not keep the controller alive during
+/// teardown.
+pub(super) struct ControllerRuntime<PData: 'static + Clone + Send + Sync + std::fmt::Debug> {
+    /// Factory used to build runtime pipelines for new instances.
+    pipeline_factory: &'static PipelineFactory<PData>,
+    /// Static controller context cloned into launched pipeline threads.
+    controller_context: ControllerContext,
+    /// Mutable observed-state store used for compaction and status updates.
+    observed_state_store: ObservedStateStore,
+    /// Read handle used by wait paths to observe readiness and phase changes.
+    observed_state_handle: ObservedStateHandle,
+    /// Reporter used for lifecycle and runtime-error events.
+    engine_event_reporter: ObservedEventReporter,
+    /// Metrics reporter cloned into launched runtime instances.
+    metrics_reporter: MetricsReporter,
+    /// Topic registry shared by all runtime instances.
+    declared_topics: DeclaredTopics<PData>,
+    /// Controller-wide core ids available for policy-based allocation.
+    available_core_ids: Vec<CoreId>,
+    /// Tracing setup cloned into launched runtime threads.
+    engine_tracing_setup: TracingSetup,
+    /// Runtime telemetry reporting cadence.
+    telemetry_reporting_interval: Duration,
+    /// Memory-pressure signal fanout shared with pipeline runtimes.
+    memory_pressure_tx: tokio::sync::watch::Sender<MemoryPressureChanged>,
+    /// All mutable live-control state protected by a single mutex.
+    state: Mutex<ControllerRuntimeState>,
+    /// Wakes global shutdown waiters when runtime instance liveness changes.
+    state_changed: Condvar,
+}
+
+/// Thin adapter that exposes `ControllerRuntime` through the admin trait.
+struct ControllerControlPlane<PData: 'static + Clone + Send + Sync + std::fmt::Debug> {
+    runtime: Arc<ControllerRuntime<PData>>,
+}
+
+/// Result of launching one pipeline runtime thread.
+///
+/// The controller stores the `control_sender` while the instance is active and
+/// drops it after shutdown is requested so the pipeline can observe control
+/// channel closure once node tasks finish.
+pub(super) struct LaunchedPipelineThread<PData> {
+    /// Concrete deployed instance key for the launched runtime thread.
+    pub(super) pipeline_key: DeployedPipelineKey,
+    /// Admin sender used by live control to send shutdown to the instance.
+    pub(super) control_sender: Arc<dyn PipelineAdminSender>,
+    /// Keeps the launch result tied to the pipeline data type.
+    pub(super) _marker: std::marker::PhantomData<PData>,
+}
+
+impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + Unwindable>
+    ControllerRuntime<PData>
+{
+    #[allow(clippy::too_many_arguments)]
+    /// Builds the resident controller runtime used by live reconfiguration.
+    pub(super) fn new(
+        pipeline_factory: &'static PipelineFactory<PData>,
+        controller_context: ControllerContext,
+        observed_state_store: ObservedStateStore,
+        observed_state_handle: ObservedStateHandle,
+        engine_event_reporter: ObservedEventReporter,
+        metrics_reporter: MetricsReporter,
+        declared_topics: DeclaredTopics<PData>,
+        available_core_ids: Vec<CoreId>,
+        engine_tracing_setup: TracingSetup,
+        telemetry_reporting_interval: Duration,
+        memory_pressure_tx: tokio::sync::watch::Sender<MemoryPressureChanged>,
+        live_config: OtelDataflowSpec,
+    ) -> Self {
+        Self {
+            pipeline_factory,
+            controller_context,
+            observed_state_store,
+            observed_state_handle,
+            engine_event_reporter,
+            metrics_reporter,
+            declared_topics,
+            available_core_ids,
+            engine_tracing_setup,
+            telemetry_reporting_interval,
+            memory_pressure_tx,
+            state: Mutex::new(ControllerRuntimeState {
+                live_config,
+                logical_pipelines: HashMap::new(),
+                runtime_instances: HashMap::new(),
+                pending_instance_exits: HashMap::new(),
+                rollouts: HashMap::new(),
+                active_rollouts: HashMap::new(),
+                terminal_rollouts: HashMap::new(),
+                shutdowns: HashMap::new(),
+                active_shutdowns: HashMap::new(),
+                terminal_shutdowns: HashMap::new(),
+                generation_counters: HashMap::new(),
+                active_instances: 0,
+                next_rollout_id: 0,
+                next_shutdown_id: 0,
+                next_thread_id: 1,
+                first_error: None,
+            }),
+            state_changed: Condvar::new(),
+        }
+    }
+
+    /// Seeds the runtime registry with a pipeline already committed at startup.
+    pub(super) fn register_committed_pipeline(
+        &self,
+        resolved: ResolvedPipelineConfig,
+        generation: u64,
+    ) {
+        let pipeline_key = PipelineKey::new(
+            resolved.pipeline_group_id.clone(),
+            resolved.pipeline_id.clone(),
+        );
+        if let Ok(active_cores) = self.assigned_cores_for_resolved(&resolved) {
+            self.observed_state_store
+                .set_pipeline_active_cores(pipeline_key.clone(), active_cores);
+        }
+        self.observed_state_store
+            .set_pipeline_active_generation(pipeline_key.clone(), generation);
+
+        let mut state = self
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        _ = state
+            .generation_counters
+            .insert(pipeline_key.clone(), generation + 1);
+        _ = state.logical_pipelines.insert(
+            pipeline_key,
+            LogicalPipelineRecord {
+                resolved,
+                active_generation: generation,
+            },
+        );
+    }
+
+    /// Allocates the next controller-local logical thread identifier.
+    pub(super) fn next_thread_id(&self) -> usize {
+        let mut state = self
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        let thread_id = state.next_thread_id;
+        state.next_thread_id += 1;
+        thread_id
+    }
+
+    /// Returns the declared-topic registry shared with launched pipelines.
+    pub(super) fn declared_topics(&self) -> &DeclaredTopics<PData> {
+        &self.declared_topics
+    }
+
+    /// Exposes the runtime as the admin control-plane trait object.
+    pub(super) fn control_plane(self: &Arc<Self>) -> Arc<dyn ControlPlane> {
+        Arc::new(ControllerControlPlane {
+            runtime: Arc::clone(self),
+        })
+    }
+
+    /// Checks whether a logical pipeline still has an active rollout or shutdown.
+    fn pipeline_has_active_operation_locked(
+        state: &ControllerRuntimeState,
+        pipeline_key: &PipelineKey,
+    ) -> bool {
+        state.active_rollouts.contains_key(pipeline_key)
+            || state.active_shutdowns.contains_key(pipeline_key)
+    }
+
+    /// Applies a terminal instance exit to controller state after the instance
+    /// has been registered as active.
+    fn apply_instance_exit_locked(
+        state: &mut ControllerRuntimeState,
+        pipeline_key: &DeployedPipelineKey,
+        exit: &RuntimeInstanceExit,
+    ) -> bool {
+        if let Some(instance) = state.runtime_instances.get_mut(pipeline_key) {
+            instance.lifecycle = RuntimeInstanceLifecycle::Exited(exit.clone());
+        }
+        state.active_instances = state.active_instances.saturating_sub(1);
+        if let RuntimeInstanceExit::Error(error) = exit {
+            if state.first_error.is_none() {
+                state.first_error = Some(error.message.clone());
+            }
+        }
+        let logical_pipeline_key = PipelineKey::new(
+            pipeline_key.pipeline_group_id.clone(),
+            pipeline_key.pipeline_id.clone(),
+        );
+        Self::prune_exited_runtime_instances_for_pipeline_locked(state, &logical_pipeline_key)
+    }
+
+    /// Marks a rollout terminal and enqueues it for bounded retention.
+    fn record_terminal_rollout_locked(
+        state: &mut ControllerRuntimeState,
+        pipeline_key: &PipelineKey,
+        rollout_id: &str,
+        now: Instant,
+    ) {
+        let mut enqueue = false;
+        if let Some(rollout) = state.rollouts.get_mut(rollout_id) {
+            if rollout.state.is_terminal() && rollout.completed_at.is_none() {
+                rollout.completed_at = Some(now);
+                enqueue = true;
+            }
+        }
+        if enqueue {
+            state
+                .terminal_rollouts
+                .entry(pipeline_key.clone())
+                .or_default()
+                .push_back(rollout_id.to_owned());
+        }
+        Self::prune_terminal_rollout_queue_locked(state, pipeline_key, now);
+    }
+
+    /// Evicts expired or over-cap terminal rollout snapshots for one pipeline.
+    fn prune_terminal_rollout_queue_locked(
+        state: &mut ControllerRuntimeState,
+        pipeline_key: &PipelineKey,
+        now: Instant,
+    ) {
+        while let Some((rollout_id, queue_len)) =
+            state.terminal_rollouts.get(pipeline_key).and_then(|queue| {
+                queue
+                    .front()
+                    .cloned()
+                    .map(|rollout_id| (rollout_id, queue.len()))
+            })
+        {
+            let should_evict = queue_len > TERMINAL_ROLLOUT_RETENTION_LIMIT
+                || state
+                    .rollouts
+                    .get(&rollout_id)
+                    .is_none_or(|rollout| is_expired(rollout.completed_at, now));
+            if !should_evict {
+                break;
+            }
+
+            if let Some(evicted_id) = state
+                .terminal_rollouts
+                .get_mut(pipeline_key)
+                .and_then(VecDeque::pop_front)
+            {
+                _ = state.rollouts.remove(&evicted_id);
+            }
+        }
+
+        if state
+            .terminal_rollouts
+            .get(pipeline_key)
+            .is_some_and(VecDeque::is_empty)
+        {
+            _ = state.terminal_rollouts.remove(pipeline_key);
+        }
+    }
+
+    /// Marks a shutdown terminal and enqueues it for bounded retention.
+    fn record_terminal_shutdown_locked(
+        state: &mut ControllerRuntimeState,
+        pipeline_key: &PipelineKey,
+        shutdown_id: &str,
+        now: Instant,
+    ) {
+        let mut enqueue = false;
+        if let Some(shutdown) = state.shutdowns.get_mut(shutdown_id) {
+            if shutdown.state.is_terminal() && shutdown.completed_at.is_none() {
+                shutdown.completed_at = Some(now);
+                enqueue = true;
+            }
+        }
+        if enqueue {
+            state
+                .terminal_shutdowns
+                .entry(pipeline_key.clone())
+                .or_default()
+                .push_back(shutdown_id.to_owned());
+        }
+        Self::prune_terminal_shutdown_queue_locked(state, pipeline_key, now);
+    }
+
+    /// Evicts expired or over-cap terminal shutdown snapshots for one pipeline.
+    fn prune_terminal_shutdown_queue_locked(
+        state: &mut ControllerRuntimeState,
+        pipeline_key: &PipelineKey,
+        now: Instant,
+    ) {
+        while let Some((shutdown_id, queue_len)) = state
+            .terminal_shutdowns
+            .get(pipeline_key)
+            .and_then(|queue| {
+                queue
+                    .front()
+                    .cloned()
+                    .map(|shutdown_id| (shutdown_id, queue.len()))
+            })
+        {
+            let should_evict = queue_len > TERMINAL_SHUTDOWN_RETENTION_LIMIT
+                || state
+                    .shutdowns
+                    .get(&shutdown_id)
+                    .is_none_or(|shutdown| is_expired(shutdown.completed_at, now));
+            if !should_evict {
+                break;
+            }
+
+            if let Some(evicted_id) = state
+                .terminal_shutdowns
+                .get_mut(pipeline_key)
+                .and_then(VecDeque::pop_front)
+            {
+                _ = state.shutdowns.remove(&evicted_id);
+            }
+        }
+
+        if state
+            .terminal_shutdowns
+            .get(pipeline_key)
+            .is_some_and(VecDeque::is_empty)
+        {
+            _ = state.terminal_shutdowns.remove(pipeline_key);
+        }
+    }
+
+    /// Runs TTL/cap eviction across all retained terminal operation history.
+    fn prune_terminal_operation_history_locked(state: &mut ControllerRuntimeState, now: Instant) {
+        let rollout_keys: Vec<_> = state.terminal_rollouts.keys().cloned().collect();
+        for pipeline_key in rollout_keys {
+            Self::prune_terminal_rollout_queue_locked(state, &pipeline_key, now);
+        }
+
+        let shutdown_keys: Vec<_> = state.terminal_shutdowns.keys().cloned().collect();
+        for pipeline_key in shutdown_keys {
+            Self::prune_terminal_shutdown_queue_locked(state, &pipeline_key, now);
+        }
+    }
+
+    /// Drops exited runtime instances once no active controller work still needs them.
+    fn prune_exited_runtime_instances_for_pipeline_locked(
+        state: &mut ControllerRuntimeState,
+        pipeline_key: &PipelineKey,
+    ) -> bool {
+        if Self::pipeline_has_active_operation_locked(state, pipeline_key) {
+            return false;
+        }
+
+        state.runtime_instances.retain(|deployed_key, instance| {
+            if deployed_key.pipeline_group_id != *pipeline_key.pipeline_group_id()
+                || deployed_key.pipeline_id != *pipeline_key.pipeline_id()
+            {
+                return true;
+            }
+
+            matches!(instance.lifecycle, RuntimeInstanceLifecycle::Active)
+        });
+        true
+    }
+
+    /// Opportunistically trims retained rollout and shutdown history.
+    fn prune_retained_operation_history(&self) {
+        let mut state = self
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        Self::prune_terminal_operation_history_locked(&mut state, Instant::now());
+    }
+
+    /// Trims exited instances and terminal history for one logical pipeline.
+    fn prune_pipeline_runtime_and_history(&self, pipeline_key: &PipelineKey) {
+        let should_compact = {
+            let mut state = self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            let should_compact =
+                Self::prune_exited_runtime_instances_for_pipeline_locked(&mut state, pipeline_key);
+            Self::prune_terminal_rollout_queue_locked(&mut state, pipeline_key, Instant::now());
+            Self::prune_terminal_shutdown_queue_locked(&mut state, pipeline_key, Instant::now());
+            should_compact
+        };
+        if should_compact {
+            self.observed_state_store
+                .compact_pipeline_instances(pipeline_key);
+        }
+    }
+}
+
+impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + Unwindable>
+    ControlPlane for ControllerControlPlane<PData>
+{
+    fn shutdown_all(&self, timeout_secs: u64) -> Result<(), ControlPlaneError> {
+        self.runtime.request_shutdown_all(timeout_secs)
+    }
+
+    fn shutdown_pipeline(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        timeout_secs: u64,
+    ) -> Result<ShutdownStatus, ControlPlaneError> {
+        self.runtime
+            .request_shutdown_pipeline(pipeline_group_id, pipeline_id, timeout_secs)
+    }
+
+    fn reconfigure_pipeline(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        request: ReconfigureRequest,
+    ) -> Result<RolloutStatus, ControlPlaneError> {
+        let plan = self
+            .runtime
+            .prepare_rollout_plan(pipeline_group_id, pipeline_id, &request)?;
+        self.runtime.spawn_rollout(plan)
+    }
+
+    fn pipeline_details(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+    ) -> Result<Option<PipelineDetails>, ControlPlaneError> {
+        self.runtime.pipeline_details_snapshot(&PipelineKey::new(
+            pipeline_group_id.to_owned().into(),
+            pipeline_id.to_owned().into(),
+        ))
+    }
+
+    fn rollout_status(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        rollout_id: &str,
+    ) -> Result<Option<RolloutStatus>, ControlPlaneError> {
+        let expected_key = PipelineKey::new(
+            pipeline_group_id.to_owned().into(),
+            pipeline_id.to_owned().into(),
+        );
+        let Some(status) = self.runtime.rollout_status_snapshot(rollout_id) else {
+            return Ok(None);
+        };
+        let actual_key =
+            PipelineKey::new(status.pipeline_group_id.clone(), status.pipeline_id.clone());
+        if actual_key != expected_key {
+            return Err(ControlPlaneError::RolloutNotFound);
+        }
+        Ok(Some(status))
+    }
+
+    fn shutdown_status(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        shutdown_id: &str,
+    ) -> Result<Option<ShutdownStatus>, ControlPlaneError> {
+        let expected_key = PipelineKey::new(
+            pipeline_group_id.to_owned().into(),
+            pipeline_id.to_owned().into(),
+        );
+        let Some(status) = self.runtime.shutdown_status_snapshot(shutdown_id) else {
+            return Ok(None);
+        };
+        let actual_key =
+            PipelineKey::new(status.pipeline_group_id.clone(), status.pipeline_id.clone());
+        if actual_key != expected_key {
+            return Err(ControlPlaneError::ShutdownNotFound);
+        }
+        Ok(Some(status))
+    }
+}
+
+#[cfg(test)]
+#[path = "../live_control_tests.rs"]
+mod tests;
diff --git a/rust/otap-dataflow/crates/controller/src/live_control/planning.rs b/rust/otap-dataflow/crates/controller/src/live_control/planning.rs
new file mode 100644
index 0000000000..725c702343
--- /dev/null
+++ b/rust/otap-dataflow/crates/controller/src/live_control/planning.rs
@@ -0,0 +1,883 @@
+// Copyright The OpenTelemetry Authors
+// SPDX-License-Identifier: Apache-2.0
+
+//! Request planning, operation recording, and worker spawning.
+//!
+//! Planning converts admin requests into explicit candidate plans while holding
+//! no long-running runtime resources. It also owns operation-record insertion
+//! and status snapshot materialization because those steps are tightly coupled
+//! to conflict detection and bounded history retention.
+
+use super::*;
+
+impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + Unwindable>
+    ControllerRuntime<PData>
+{
+    /// Resolves the concrete core ids selected by a pipeline resource policy.
+    pub(super) fn assigned_cores_for_resolved(
+        &self,
+        resolved_pipeline: &ResolvedPipelineConfig,
+    ) -> Result<Vec<usize>, ControlPlaneError> {
+        Controller::<PData>::select_cores_for_allocation(
+            self.available_core_ids.clone(),
+            &resolved_pipeline.policies.resources.core_allocation,
+        )
+        .map(|cores| cores.into_iter().map(|core| core.id).collect())
+        .map_err(|err| ControlPlaneError::InvalidRequest {
+            message: err.to_string(),
+        })
+    }
+
+    /// Reports which active cores still belong to the current committed generation.
+    pub(super) fn active_runtime_core_state(
+        &self,
+        pipeline_key: &PipelineKey,
+        active_generation: u64,
+    ) -> ActiveRuntimeCoreState {
+        let state = self
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        let mut current_generation_cores = Vec::new();
+        let mut has_foreign_active_generations = false;
+
+        for (deployed_key, instance) in &state.runtime_instances {
+            if deployed_key.pipeline_group_id != *pipeline_key.pipeline_group_id()
+                || deployed_key.pipeline_id != *pipeline_key.pipeline_id()
+                || !matches!(instance.lifecycle, RuntimeInstanceLifecycle::Active)
+            {
+                continue;
+            }
+
+            if deployed_key.deployment_generation == active_generation {
+                current_generation_cores.push(deployed_key.core_id);
+            } else {
+                has_foreign_active_generations = true;
+            }
+        }
+
+        current_generation_cores.sort_unstable();
+        ActiveRuntimeCoreState {
+            current_generation_cores,
+            has_foreign_active_generations,
+        }
+    }
+
+    /// Builds the effective runtime topic profile map used to reject broker mutations.
+    pub(super) fn pipeline_topic_profiles(
+        config: &OtelDataflowSpec,
+    ) -> Result<HashMap<TopicName, TopicRuntimeProfile>, ControlPlaneError> {
+        let (global_names, group_names) =
+            Controller::<PData>::build_declared_topic_name_maps(config).map_err(|err| {
+                ControlPlaneError::InvalidRequest {
+                    message: err.to_string(),
+                }
+            })?;
+        Controller::<PData>::validate_topic_wiring_acyclic(config, &global_names, &group_names)
+            .map_err(|err| ControlPlaneError::InvalidRequest {
+                message: err.to_string(),
+            })?;
+        let (inferred_modes, _) =
+            Controller::<PData>::infer_topic_modes(config, &global_names, &group_names).map_err(
+                |err| ControlPlaneError::InvalidRequest {
+                    message: err.to_string(),
+                },
+            )?;
+        let default_selection_policy = config.engine.topics.impl_selection;
+
+        let mut profiles = HashMap::new();
+        for (topic_name, spec) in &config.topics {
+            let declared_name = global_names
+                .get(topic_name)
+                .ok_or_else(|| ControlPlaneError::Internal {
+                    message: format!(
+                        "missing declared topic name for global topic `{}` while building runtime profiles",
+                        topic_name.as_ref()
+                    ),
+                })?
+                .clone();
+            let topology_mode = inferred_modes
+                .get(&declared_name)
+                .copied()
+                .unwrap_or(InferredTopicMode::Mixed);
+            let selection_policy = spec.impl_selection.unwrap_or(default_selection_policy);
+            let selected_mode = Controller::<PData>::apply_topic_impl_selection_policy(
+                topology_mode,
+                selection_policy,
+            );
+            _ = profiles.insert(
+                declared_name,
+                TopicRuntimeProfile {
+                    backend: spec.backend,
+                    policies: spec.policies.clone(),
+                    selected_mode,
+                },
+            );
+        }
+
+        for (group_id, group_cfg) in &config.groups {
+            for (topic_name, spec) in &group_cfg.topics {
+                let declared_name = group_names
+                    .get(&(group_id.clone(), topic_name.clone()))
+                    .ok_or_else(|| ControlPlaneError::Internal {
+                        message: format!(
+                            "missing declared topic name for group `{}` topic `{}` while building runtime profiles",
+                            group_id.as_ref(),
+                            topic_name.as_ref()
+                        ),
+                    })?
+                    .clone();
+                let topology_mode = inferred_modes
+                    .get(&declared_name)
+                    .copied()
+                    .unwrap_or(InferredTopicMode::Mixed);
+                let selection_policy = spec.impl_selection.unwrap_or(default_selection_policy);
+                let selected_mode = Controller::<PData>::apply_topic_impl_selection_policy(
+                    topology_mode,
+                    selection_policy,
+                );
+                _ = profiles.insert(
+                    declared_name,
+                    TopicRuntimeProfile {
+                        backend: spec.backend,
+                        policies: spec.policies.clone(),
+                        selected_mode,
+                    },
+                );
+            }
+        }
+
+        Ok(profiles)
+    }
+
+    /// Classifies a reconfigure request and prepares the rollout state machine inputs.
+    pub(super) fn prepare_rollout_plan(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        request: &ReconfigureRequest,
+    ) -> Result<CandidateRolloutPlan, ControlPlaneError> {
+        let pipeline_group_id: PipelineGroupId = pipeline_group_id.to_owned().into();
+        let pipeline_id: PipelineId = pipeline_id.to_owned().into();
+        let pipeline_key = PipelineKey::new(pipeline_group_id.clone(), pipeline_id.clone());
+
+        let (live_config, current_record) = {
+            let state = self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            if !state.live_config.groups.contains_key(&pipeline_group_id) {
+                return Err(ControlPlaneError::GroupNotFound);
+            }
+            if state.active_rollouts.contains_key(&pipeline_key)
+                || state.active_shutdowns.contains_key(&pipeline_key)
+            {
+                return Err(ControlPlaneError::RolloutConflict);
+            }
+            (
+                state.live_config.clone(),
+                state.logical_pipelines.get(&pipeline_key).cloned(),
+            )
+        };
+
+        let candidate_pipeline = request.pipeline.clone();
+        candidate_pipeline
+            .validate(&pipeline_group_id, &pipeline_id)
+            .map_err(|err| ControlPlaneError::InvalidRequest {
+                message: err.to_string(),
+            })?;
+
+        let mut candidate_config = live_config.clone();
+        let group_cfg = candidate_config
+            .groups
+            .get_mut(&pipeline_group_id)
+            .ok_or_else(|| ControlPlaneError::Internal {
+                message: format!(
+                    "group `{}` disappeared while preparing rollout plan",
+                    pipeline_group_id.as_ref()
+                ),
+            })?;
+        _ = group_cfg
+            .pipelines
+            .insert(pipeline_id.clone(), candidate_pipeline.clone());
+
+        candidate_config
+            .validate()
+            .map_err(|err| ControlPlaneError::InvalidRequest {
+                message: err.to_string(),
+            })?;
+        Controller::<PData>::validate_engine_components_with_factory(
+            self.pipeline_factory,
+            &candidate_config,
+        )
+        .map_err(|message| ControlPlaneError::InvalidRequest { message })?;
+
+        let current_profiles = Self::pipeline_topic_profiles(&live_config)?;
+        let candidate_profiles = Self::pipeline_topic_profiles(&candidate_config)?;
+        if current_profiles != candidate_profiles {
+            return Err(ControlPlaneError::InvalidRequest {
+                message: "request would require runtime topic broker mutation".to_owned(),
+            });
+        }
+
+        let resolved_pipeline = candidate_config
+            .resolve()
+            .pipelines
+            .into_iter()
+            .find(|pipeline| {
+                pipeline.role == ResolvedPipelineRole::Regular
+                    && pipeline.pipeline_group_id == pipeline_group_id
+                    && pipeline.pipeline_id == pipeline_id
+            })
+            .ok_or_else(|| ControlPlaneError::Internal {
+                message: "candidate pipeline disappeared during resolution".to_owned(),
+            })?;
+        let current_assigned_cores = if let Some(record) = current_record.as_ref() {
+            self.assigned_cores_for_resolved(&record.resolved)?
+        } else {
+            Vec::new()
+        };
+        let target_assigned_cores = self.assigned_cores_for_resolved(&resolved_pipeline)?;
+        let current_core_set: HashSet<_> = current_assigned_cores.iter().copied().collect();
+        let target_core_set: HashSet<_> = target_assigned_cores.iter().copied().collect();
+        let active_runtime_state = current_record
+            .as_ref()
+            .map(|record| self.active_runtime_core_state(&pipeline_key, record.active_generation))
+            .unwrap_or(ActiveRuntimeCoreState {
+                current_generation_cores: Vec::new(),
+                has_foreign_active_generations: false,
+            });
+        let active_core_set: HashSet<_> = active_runtime_state
+            .current_generation_cores
+            .iter()
+            .copied()
+            .collect();
+        let common_assigned_cores: Vec<_> = target_assigned_cores
+            .iter()
+            .copied()
+            .filter(|core_id| current_core_set.contains(core_id))
+            .collect();
+        let added_assigned_cores: Vec<_> = target_assigned_cores
+            .iter()
+            .copied()
+            .filter(|core_id| !current_core_set.contains(core_id))
+            .collect();
+        let removed_assigned_cores: Vec<_> = current_assigned_cores
+            .iter()
+            .copied()
+            .filter(|core_id| !target_core_set.contains(core_id))
+            .collect();
+        let resize_start_cores: Vec<_> = target_assigned_cores
+            .iter()
+            .copied()
+            .filter(|core_id| !active_core_set.contains(core_id))
+            .collect();
+        let resize_stop_cores: Vec<_> = active_runtime_state
+            .current_generation_cores
+            .iter()
+            .copied()
+            .filter(|core_id| !target_core_set.contains(core_id))
+            .collect();
+        let action = if let Some(record) = current_record.as_ref() {
+            let identical_update = current_assigned_cores == target_assigned_cores
+                && active_runtime_state.current_generation_cores == target_assigned_cores
+                && !active_runtime_state.has_foreign_active_generations
+                && record.resolved.runtime_matches(&resolved_pipeline);
+            let resize_only = current_assigned_cores != target_assigned_cores
+                && !active_runtime_state.has_foreign_active_generations
+                && record
+                    .resolved
+                    .runtime_shape_matches_ignoring_resources(&resolved_pipeline);
+            if identical_update {
+                RolloutAction::NoOp
+            } else if resize_only {
+                RolloutAction::Resize
+            } else {
+                RolloutAction::Replace
+            }
+        } else {
+            RolloutAction::Create
+        };
+        let (resize_start_cores, resize_stop_cores) = match action {
+            RolloutAction::Resize => (resize_start_cores, resize_stop_cores),
+            RolloutAction::Create | RolloutAction::NoOp | RolloutAction::Replace => {
+                (Vec::new(), Vec::new())
+            }
+        };
+        let previous_generation = current_record
+            .as_ref()
+            .map(|record| record.active_generation);
+
+        let (rollout_id, target_generation) = {
+            let mut state = self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            if state.active_rollouts.contains_key(&pipeline_key)
+                || state.active_shutdowns.contains_key(&pipeline_key)
+            {
+                return Err(ControlPlaneError::RolloutConflict);
+            }
+            let rollout_id = format!("rollout-{}", state.next_rollout_id);
+            state.next_rollout_id += 1;
+            let target_generation = match action {
+                RolloutAction::NoOp | RolloutAction::Resize => {
+                    previous_generation.ok_or_else(|| ControlPlaneError::Internal {
+                        message: format!(
+                            "rollout planner produced {:?} for {}:{} without a current generation",
+                            action,
+                            pipeline_key.pipeline_group_id().as_ref(),
+                            pipeline_key.pipeline_id().as_ref()
+                        ),
+                    })?
+                }
+                RolloutAction::Create | RolloutAction::Replace => {
+                    let generation_counter = state
+                        .generation_counters
+                        .entry(pipeline_key.clone())
+                        .or_insert(0);
+                    let target_generation = *generation_counter;
+                    *generation_counter += 1;
+                    target_generation
+                }
+            };
+            (rollout_id, target_generation)
+        };
+
+        let rollout_core_ids = match action {
+            RolloutAction::NoOp => Vec::new(),
+            RolloutAction::Resize => {
+                let mut ids = resize_start_cores.clone();
+                let additional_stop_cores: Vec<_> = resize_stop_cores
+                    .iter()
+                    .copied()
+                    .filter(|core_id| !ids.contains(core_id))
+                    .collect();
+                ids.extend(additional_stop_cores);
+                ids
+            }
+            RolloutAction::Create | RolloutAction::Replace => {
+                let mut ids = target_assigned_cores.clone();
+                ids.extend(removed_assigned_cores.iter().copied());
+                ids
+            }
+        };
+        let cores = rollout_core_ids
+            .into_iter()
+            .map(|core_id| RolloutCoreProgress {
+                core_id,
+                previous_generation: match action {
+                    RolloutAction::Create => None,
+                    RolloutAction::NoOp => active_core_set
+                        .contains(&core_id)
+                        .then_some(previous_generation)
+                        .flatten(),
+                    RolloutAction::Replace => current_core_set
+                        .contains(&core_id)
+                        .then_some(previous_generation)
+                        .flatten(),
+                    RolloutAction::Resize => active_core_set
+                        .contains(&core_id)
+                        .then_some(previous_generation)
+                        .flatten(),
+                },
+                target_generation,
+                state: "pending".to_owned(),
+                updated_at: timestamp_now(),
+                detail: None,
+            })
+            .collect();
+        let step_timeout_secs = request.step_timeout_secs.max(1);
+        let drain_timeout_secs = request.drain_timeout_secs.max(1);
+        let rollout = RolloutRecord::new(
+            rollout_id,
+            pipeline_group_id.clone(),
+            pipeline_id.clone(),
+            action,
+            target_generation,
+            current_record
+                .as_ref()
+                .map(|record| record.active_generation),
+            drain_timeout_secs,
+            cores,
+        );
+
+        Ok(CandidateRolloutPlan {
+            pipeline_key,
+            pipeline_group_id,
+            pipeline_id,
+            action,
+            resolved_pipeline,
+            current_record,
+            current_assigned_cores,
+            target_assigned_cores,
+            common_assigned_cores,
+            added_assigned_cores,
+            removed_assigned_cores,
+            resize_start_cores,
+            resize_stop_cores,
+            target_generation,
+            rollout,
+            step_timeout_secs,
+            drain_timeout_secs,
+        })
+    }
+
+    /// Registers a newly accepted rollout and publishes its initial summary.
+    pub(super) fn insert_rollout(
+        &self,
+        pipeline_key: &PipelineKey,
+        rollout: RolloutRecord,
+    ) -> Result<(), ControlPlaneError> {
+        self.prune_retained_operation_history();
+        {
+            let mut state = self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            if state.active_rollouts.contains_key(pipeline_key)
+                || state.active_shutdowns.contains_key(pipeline_key)
+            {
+                return Err(ControlPlaneError::RolloutConflict);
+            }
+            _ = state
+                .active_rollouts
+                .insert(pipeline_key.clone(), rollout.rollout_id.clone());
+            _ = state
+                .rollouts
+                .insert(rollout.rollout_id.clone(), rollout.clone());
+        }
+        self.observed_state_store
+            .set_pipeline_rollout_summary(pipeline_key.clone(), rollout.summary());
+        Ok(())
+    }
+
+    /// Applies an in-place update to a rollout record and refreshes observed state.
+    pub(super) fn update_rollout<F>(&self, pipeline_key: &PipelineKey, rollout_id: &str, update: F)
+    where
+        F: FnOnce(&mut RolloutRecord),
+    {
+        let summary = {
+            let mut state = self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            let Some(rollout) = state.rollouts.get_mut(rollout_id) else {
+                return;
+            };
+            update(rollout);
+            rollout.updated_at = timestamp_now();
+            let is_terminal = rollout.state.is_terminal();
+            let summary = rollout.summary();
+            if is_terminal {
+                Self::record_terminal_rollout_locked(
+                    &mut state,
+                    pipeline_key,
+                    rollout_id,
+                    Instant::now(),
+                );
+            }
+            summary
+        };
+        self.observed_state_store
+            .set_pipeline_rollout_summary(pipeline_key.clone(), summary);
+    }
+
+    /// Updates the per-core progress entry for a rollout.
+    pub(super) fn update_rollout_core_state(
+        &self,
+        pipeline_key: &PipelineKey,
+        rollout_id: &str,
+        core_id: usize,
+        state: &str,
+        detail: Option<String>,
+    ) {
+        self.update_rollout(pipeline_key, rollout_id, |rollout| {
+            if let Some(core) = rollout
+                .cores
+                .iter_mut()
+                .find(|core| core.core_id == core_id)
+            {
+                core.state = state.to_owned();
+                core.updated_at = timestamp_now();
+                core.detail = detail;
+            }
+        });
+    }
+
+    /// Marks a rollout inactive and prunes any no-longer-needed retained state.
+    pub(super) fn finish_rollout(&self, pipeline_key: &PipelineKey, rollout_id: &str) {
+        {
+            let mut state = self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            if state
+                .active_rollouts
+                .get(pipeline_key)
+                .is_some_and(|id| id == rollout_id)
+            {
+                let _ = state.active_rollouts.remove(pipeline_key);
+            }
+        }
+        self.prune_pipeline_runtime_and_history(pipeline_key);
+    }
+
+    /// Returns the latest rollout snapshot, evicting expired history first.
+    pub(super) fn rollout_status_snapshot(&self, rollout_id: &str) -> Option<RolloutStatus> {
+        let mut state = self
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        Self::prune_terminal_operation_history_locked(&mut state, Instant::now());
+        state.rollouts.get(rollout_id).map(RolloutRecord::status)
+    }
+
+    /// Clears temporary serving-generation overrides after a rollout settles.
+    pub(super) fn clear_pipeline_serving_generations<I>(
+        &self,
+        pipeline_key: &PipelineKey,
+        core_ids: I,
+    ) where
+        I: IntoIterator<Item = usize>,
+    {
+        for core_id in core_ids {
+            self.observed_state_store
+                .clear_pipeline_serving_generation(pipeline_key.clone(), core_id);
+        }
+    }
+
+    /// Commits the winning pipeline config and active generation into runtime state.
+    pub(super) fn commit_pipeline_record(
+        &self,
+        plan: &CandidateRolloutPlan,
+        active_generation: u64,
+    ) {
+        {
+            let mut state = self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            if let Some(group_cfg) = state.live_config.groups.get_mut(&plan.pipeline_group_id) {
+                _ = group_cfg.pipelines.insert(
+                    plan.pipeline_id.clone(),
+                    plan.resolved_pipeline.pipeline.clone(),
+                );
+            }
+            _ = state.logical_pipelines.insert(
+                plan.pipeline_key.clone(),
+                LogicalPipelineRecord {
+                    resolved: plan.resolved_pipeline.clone(),
+                    active_generation,
+                },
+            );
+        }
+        self.observed_state_store.set_pipeline_active_cores(
+            plan.pipeline_key.clone(),
+            plan.target_assigned_cores.iter().copied(),
+        );
+        self.observed_state_store
+            .set_pipeline_active_generation(plan.pipeline_key.clone(), active_generation);
+    }
+
+    /// Selects the active instances targeted by a per-pipeline shutdown request.
+    pub(super) fn prepare_shutdown_plan(
+        &self,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        timeout_secs: u64,
+    ) -> Result<CandidateShutdownPlan, ControlPlaneError> {
+        let pipeline_group_id: PipelineGroupId = pipeline_group_id.to_owned().into();
+        let pipeline_id: PipelineId = pipeline_id.to_owned().into();
+        let pipeline_key = PipelineKey::new(pipeline_group_id.clone(), pipeline_id.clone());
+
+        let target_instances = {
+            let state = self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            if !state.live_config.groups.contains_key(&pipeline_group_id) {
+                return Err(ControlPlaneError::GroupNotFound);
+            }
+            if !state.logical_pipelines.contains_key(&pipeline_key) {
+                return Err(ControlPlaneError::PipelineNotFound);
+            }
+            if state.active_rollouts.contains_key(&pipeline_key)
+                || state.active_shutdowns.contains_key(&pipeline_key)
+            {
+                return Err(ControlPlaneError::RolloutConflict);
+            }
+
+            let targets: Vec<_> = state
+                .runtime_instances
+                .iter()
+                .filter_map(|(deployed_key, instance)| {
+                    if deployed_key.pipeline_group_id == pipeline_group_id
+                        && deployed_key.pipeline_id == pipeline_id
+                        && matches!(instance.lifecycle, RuntimeInstanceLifecycle::Active)
+                    {
+                        Some(deployed_key.clone())
+                    } else {
+                        None
+                    }
+                })
+                .collect();
+            if targets.is_empty() {
+                return Err(ControlPlaneError::InvalidRequest {
+                    message: format!(
+                        "pipeline {}:{} is already stopped",
+                        pipeline_group_id.as_ref(),
+                        pipeline_id.as_ref()
+                    ),
+                });
+            }
+            targets
+        };
+
+        let shutdown_id = {
+            let mut state = self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            if state.active_rollouts.contains_key(&pipeline_key)
+                || state.active_shutdowns.contains_key(&pipeline_key)
+            {
+                return Err(ControlPlaneError::RolloutConflict);
+            }
+            let shutdown_id = format!("shutdown-{}", state.next_shutdown_id);
+            state.next_shutdown_id += 1;
+            shutdown_id
+        };
+
+        let shutdown = ShutdownRecord::new(
+            shutdown_id,
+            pipeline_group_id,
+            pipeline_id,
+            target_instances
+                .iter()
+                .map(|instance| ShutdownCoreProgress {
+                    core_id: instance.core_id,
+                    deployment_generation: instance.deployment_generation,
+                    state: "pending".to_owned(),
+                    updated_at: timestamp_now(),
+                    detail: None,
+                })
+                .collect(),
+        );
+
+        Ok(CandidateShutdownPlan {
+            pipeline_key,
+            shutdown,
+            target_instances,
+            timeout_secs: timeout_secs.max(1),
+        })
+    }
+
+    /// Registers a newly accepted shutdown operation.
+    pub(super) fn insert_shutdown(
+        &self,
+        pipeline_key: &PipelineKey,
+        shutdown: ShutdownRecord,
+    ) -> Result<(), ControlPlaneError> {
+        self.prune_retained_operation_history();
+        let mut state = self
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        if state.active_rollouts.contains_key(pipeline_key)
+            || state.active_shutdowns.contains_key(pipeline_key)
+        {
+            return Err(ControlPlaneError::RolloutConflict);
+        }
+        _ = state
+            .active_shutdowns
+            .insert(pipeline_key.clone(), shutdown.shutdown_id.clone());
+        _ = state
+            .shutdowns
+            .insert(shutdown.shutdown_id.clone(), shutdown);
+        Ok(())
+    }
+
+    /// Applies an in-place update to a shutdown record and prunes on completion.
+    pub(super) fn update_shutdown<F>(
+        &self,
+        pipeline_key: &PipelineKey,
+        shutdown_id: &str,
+        update: F,
+    ) where
+        F: FnOnce(&mut ShutdownRecord),
+    {
+        let should_prune = {
+            let mut state = self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            let Some(shutdown) = state.shutdowns.get_mut(shutdown_id) else {
+                return;
+            };
+            update(shutdown);
+            shutdown.updated_at = timestamp_now();
+            let is_terminal = shutdown.state.is_terminal();
+            if is_terminal {
+                Self::record_terminal_shutdown_locked(
+                    &mut state,
+                    pipeline_key,
+                    shutdown_id,
+                    Instant::now(),
+                );
+                let _ = state.active_shutdowns.remove(pipeline_key);
+                true
+            } else {
+                false
+            }
+        };
+
+        if should_prune {
+            self.prune_pipeline_runtime_and_history(pipeline_key);
+        }
+    }
+
+    /// Returns the latest shutdown snapshot, evicting expired history first.
+    pub(super) fn shutdown_status_snapshot(&self, shutdown_id: &str) -> Option<ShutdownStatus> {
+        let mut state = self
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        Self::prune_terminal_operation_history_locked(&mut state, Instant::now());
+        state.shutdowns.get(shutdown_id).map(ShutdownRecord::status)
+    }
+
+    /// Returns committed pipeline details plus any active rollout summary.
+    pub(super) fn pipeline_details_snapshot(
+        &self,
+        pipeline_key: &PipelineKey,
+    ) -> Result<Option<PipelineDetails>, ControlPlaneError> {
+        let state = self
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        let Some(record) = state.logical_pipelines.get(pipeline_key) else {
+            if !state
+                .live_config
+                .groups
+                .contains_key(pipeline_key.pipeline_group_id())
+            {
+                return Err(ControlPlaneError::GroupNotFound);
+            }
+            return Ok(None);
+        };
+        let rollout = state
+            .active_rollouts
+            .get(pipeline_key)
+            .and_then(|rollout_id| state.rollouts.get(rollout_id))
+            .map(RolloutRecord::api_summary);
+        Ok(Some(PipelineDetails {
+            pipeline_group_id: pipeline_key.pipeline_group_id().clone(),
+            pipeline_id: pipeline_key.pipeline_id().clone(),
+            active_generation: Some(record.active_generation),
+            pipeline: record.resolved.pipeline.clone(),
+            rollout,
+        }))
+    }
+
+    /// Records a rollout and launches its background execution worker.
+    pub(super) fn spawn_rollout(
+        self: &Arc<Self>,
+        plan: CandidateRolloutPlan,
+    ) -> Result<RolloutStatus, ControlPlaneError> {
+        let rollout_id = plan.rollout.rollout_id.clone();
+        let pipeline_key = plan.pipeline_key.clone();
+        self.insert_rollout(&pipeline_key, plan.rollout.clone())?;
+        if matches!(plan.action, RolloutAction::NoOp) {
+            self.commit_pipeline_record(&plan, plan.target_generation);
+            self.update_rollout(&pipeline_key, &rollout_id, |rollout| {
+                rollout.state = RolloutLifecycleState::Succeeded;
+                rollout.failure_reason = None;
+            });
+            self.finish_rollout(&pipeline_key, &rollout_id);
+            return self.rollout_status_snapshot(&rollout_id).ok_or_else(|| {
+                ControlPlaneError::Internal {
+                    message: format!("rollout {rollout_id} disappeared before response"),
+                }
+            });
+        }
+
+        let initial_status = plan.rollout.status();
+        let runtime = Arc::clone(self);
+        let rollout_runtime = Arc::clone(&runtime);
+        let rollout_cleanup_runtime = Arc::clone(&runtime);
+        let worker_pipeline_key = pipeline_key.clone();
+        let worker_rollout_id = rollout_id.clone();
+        let worker_thread_name = format!(
+            "rollout-{}-{}",
+            pipeline_key.pipeline_group_id().as_ref(),
+            pipeline_key.pipeline_id().as_ref()
+        );
+        let _rollout_handle = thread::Builder::new()
+            .name(worker_thread_name.clone())
+            .spawn(move || {
+                if let Err(panic) =
+                    catch_unwind(AssertUnwindSafe(|| rollout_runtime.run_rollout(plan)))
+                {
+                    rollout_cleanup_runtime.handle_rollout_worker_panic(
+                        &worker_pipeline_key,
+                        &worker_rollout_id,
+                        worker_thread_name,
+                        panic,
+                    );
+                }
+            })
+            .map_err(|err| {
+                runtime.finish_rollout(&pipeline_key, &rollout_id);
+                ControlPlaneError::Internal {
+                    message: err.to_string(),
+                }
+            })?;
+        Ok(initial_status)
+    }
+
+    /// Records a shutdown and launches its background execution worker.
+    pub(super) fn spawn_shutdown(
+        self: &Arc<Self>,
+        plan: CandidateShutdownPlan,
+    ) -> Result<ShutdownStatus, ControlPlaneError> {
+        let shutdown_id = plan.shutdown.shutdown_id.clone();
+        let pipeline_key = plan.pipeline_key.clone();
+        let initial_status = plan.shutdown.status();
+        self.insert_shutdown(&pipeline_key, plan.shutdown.clone())?;
+        let runtime = Arc::clone(self);
+        let shutdown_runtime = Arc::clone(&runtime);
+        let shutdown_cleanup_runtime = Arc::clone(&runtime);
+        let worker_pipeline_key = pipeline_key.clone();
+        let worker_shutdown_id = shutdown_id.clone();
+        let worker_thread_name = format!(
+            "shutdown-{}-{}",
+            pipeline_key.pipeline_group_id().as_ref(),
+            pipeline_key.pipeline_id().as_ref()
+        );
+        let _shutdown_handle = thread::Builder::new()
+            .name(worker_thread_name.clone())
+            .spawn(move || {
+                if let Err(panic) =
+                    catch_unwind(AssertUnwindSafe(|| shutdown_runtime.run_shutdown(plan)))
+                {
+                    shutdown_cleanup_runtime.handle_shutdown_worker_panic(
+                        &worker_pipeline_key,
+                        &worker_shutdown_id,
+                        worker_thread_name,
+                        panic,
+                    );
+                }
+            })
+            .map_err(|err| {
+                runtime.update_shutdown(&pipeline_key, &shutdown_id, |shutdown| {
+                    shutdown.state = ShutdownLifecycleState::Failed;
+                    shutdown.failure_reason = Some(err.to_string());
+                });
+                ControlPlaneError::Internal {
+                    message: err.to_string(),
+                }
+            })?;
+        Ok(initial_status)
+    }
+}
diff --git a/rust/otap-dataflow/crates/controller/src/live_control/runtime.rs b/rust/otap-dataflow/crates/controller/src/live_control/runtime.rs
new file mode 100644
index 0000000000..2b21eff1b5
--- /dev/null
+++ b/rust/otap-dataflow/crates/controller/src/live_control/runtime.rs
@@ -0,0 +1,497 @@
+// Copyright The OpenTelemetry Authors
+// SPDX-License-Identifier: Apache-2.0
+
+//! Runtime-instance launch, shutdown, and exit reporting.
+//!
+//! This module owns the boundary between controller state and actual pipeline
+//! threads. It registers launched instances, reconciles early exits, sends
+//! shutdown control messages, waits for readiness/exit transitions, and exposes
+//! global runtime shutdown/error helpers used by controller teardown.
+
+use super::*;
+
+/// Formats a deployed instance compactly for aggregated operator errors.
+fn deployed_instance_label(deployed_key: &DeployedPipelineKey) -> String {
+    format!(
+        "{}:{} core={} generation={}",
+        deployed_key.pipeline_group_id.as_ref(),
+        deployed_key.pipeline_id.as_ref(),
+        deployed_key.core_id,
+        deployed_key.deployment_generation
+    )
+}
+
+impl<PData: 'static + Clone + Send + Sync + std::fmt::Debug + ReceivedAtNode + Unwindable>
+    ControllerRuntime<PData>
+{
+    /// Launches one regular pipeline instance on a specific core and generation.
+    pub(super) fn launch_regular_pipeline_instance(
+        self: &Arc<Self>,
+        resolved_pipeline: &ResolvedPipelineConfig,
+        core_id: usize,
+        deployment_generation: u64,
+    ) -> Result<DeployedPipelineKey, Error> {
+        let thread_id = self.next_thread_id();
+        let num_cores = self
+            .assigned_cores_for_resolved(resolved_pipeline)
+            .map_err(|err| Error::PipelineRuntimeError {
+                source: Box::new(io::Error::other(format!("{err:?}"))),
+            })?
+            .len();
+        let deployed_key = DeployedPipelineKey {
+            pipeline_group_id: resolved_pipeline.pipeline_group_id.clone(),
+            pipeline_id: resolved_pipeline.pipeline_id.clone(),
+            core_id,
+            deployment_generation,
+        };
+        let launched = Controller::<PData>::launch_pipeline_thread(
+            self.pipeline_factory,
+            deployed_key.clone(),
+            CoreId { id: core_id },
+            num_cores,
+            resolved_pipeline.pipeline.clone(),
+            resolved_pipeline.policies.channel_capacity.clone(),
+            resolved_pipeline.policies.telemetry.clone(),
+            resolved_pipeline.policies.transport_headers.clone(),
+            self.controller_context.clone(),
+            self.metrics_reporter.clone(),
+            self.engine_event_reporter.clone(),
+            self.engine_tracing_setup.clone(),
+            self.telemetry_reporting_interval,
+            self.memory_pressure_tx.clone(),
+            &self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner())
+                .live_config,
+            &self.declared_topics,
+            Arc::downgrade(self),
+            thread_id,
+            None,
+        )?;
+        self.register_launched_instance(launched);
+        Ok(deployed_key)
+    }
+
+    /// Registers a launched instance and reconciles the race where the thread exited first.
+    ///
+    /// The launch path inserts the instance as Active here, while the runtime thread reports its
+    /// terminal exit independently through note_instance_exit(). If that exit arrived first, it
+    /// was parked in pending_instance_exits and is applied immediately during registration.
+    pub(crate) fn register_launched_instance(
+        self: &Arc<Self>,
+        launched: LaunchedPipelineThread<PData>,
+    ) {
+        let should_compact = {
+            let mut state = self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            _ = state.runtime_instances.insert(
+                launched.pipeline_key.clone(),
+                RuntimeInstanceRecord {
+                    control_sender: Some(launched.control_sender.clone()),
+                    lifecycle: RuntimeInstanceLifecycle::Active,
+                },
+            );
+            state.active_instances += 1;
+            let pending_exit = state.pending_instance_exits.remove(&launched.pipeline_key);
+            let should_compact = if let Some(exit) = pending_exit.as_ref() {
+                Self::apply_instance_exit_locked(&mut state, &launched.pipeline_key, exit)
+            } else {
+                false
+            };
+            self.state_changed.notify_all();
+            should_compact
+        };
+
+        if should_compact {
+            let logical_pipeline_key = PipelineKey::new(
+                launched.pipeline_key.pipeline_group_id.clone(),
+                launched.pipeline_key.pipeline_id.clone(),
+            );
+            self.observed_state_store
+                .compact_pipeline_instances(&logical_pipeline_key);
+        }
+    }
+
+    /// Records a pipeline instance exit and closes the registration-before/after-exit race.
+    ///
+    /// If the instance is already visible in runtime_instances, the exit is applied immediately.
+    /// Otherwise we store it in pending_instance_exits so register_launched_instance() can
+    /// reconcile it as soon as registration becomes visible.
+    pub(crate) fn note_instance_exit(
+        &self,
+        pipeline_key: DeployedPipelineKey,
+        exit: RuntimeInstanceExit,
+    ) {
+        match &exit {
+            RuntimeInstanceExit::Success => {
+                self.engine_event_reporter
+                    .report(EngineEvent::drained(pipeline_key.clone(), None));
+            }
+            RuntimeInstanceExit::Error(error) => {
+                self.engine_event_reporter
+                    .report(EngineEvent::pipeline_runtime_error(
+                        pipeline_key.clone(),
+                        "Pipeline encountered a runtime error.",
+                        error.error_summary(),
+                    ));
+            }
+        }
+
+        let should_compact = {
+            let mut state = self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            if state.runtime_instances.contains_key(&pipeline_key) {
+                Self::apply_instance_exit_locked(&mut state, &pipeline_key, &exit)
+            } else {
+                _ = state
+                    .pending_instance_exits
+                    .insert(pipeline_key.clone(), exit.clone());
+                false
+            }
+        };
+        if should_compact {
+            let logical_pipeline_key = PipelineKey::new(
+                pipeline_key.pipeline_group_id.clone(),
+                pipeline_key.pipeline_id.clone(),
+            );
+            self.observed_state_store
+                .compact_pipeline_instances(&logical_pipeline_key);
+        }
+        self.state_changed.notify_all();
+    }
+
+    /// Waits for a specific deployed instance to report admitted plus ready.
+    pub(super) fn wait_for_pipeline_ready(
+        &self,
+        deployed_key: &DeployedPipelineKey,
+        deadline: Instant,
+    ) -> Result<(), String> {
+        let pipeline_key = PipelineKey::new(
+            deployed_key.pipeline_group_id.clone(),
+            deployed_key.pipeline_id.clone(),
+        );
+        loop {
+            if let Some(status) = self.observed_state_handle.pipeline_status(&pipeline_key) {
+                if let Some(instance) =
+                    status.instance_status(deployed_key.core_id, deployed_key.deployment_generation)
+                {
+                    let accepted = instance.accepted_condition().status == ConditionStatus::True;
+                    let ready = instance.ready_condition().status == ConditionStatus::True;
+                    if accepted && ready {
+                        return Ok(());
+                    }
+                    match instance.phase() {
+                        PipelinePhase::Failed(_)
+                        | PipelinePhase::Rejected(_)
+                        | PipelinePhase::Deleted
+                        | PipelinePhase::Stopped => {
+                            return Err(format!(
+                                "pipeline failed to become ready on core {} (generation {})",
+                                deployed_key.core_id, deployed_key.deployment_generation
+                            ));
+                        }
+                        _ => {}
+                    }
+                }
+            }
+
+            if let Some(exit) = self.instance_exit(deployed_key) {
+                return match exit {
+                    RuntimeInstanceExit::Success => Err(format!(
+                        "pipeline exited before reporting ready on core {} (generation {})",
+                        deployed_key.core_id, deployed_key.deployment_generation
+                    )),
+                    RuntimeInstanceExit::Error(error) => Err(error.message),
+                };
+            }
+
+            if Instant::now() >= deadline {
+                return Err(format!(
+                    "timed out waiting for admitted+ready on core {} (generation {})",
+                    deployed_key.core_id, deployed_key.deployment_generation
+                ));
+            }
+            thread::sleep(Duration::from_millis(50));
+        }
+    }
+
+    /// Returns the terminal exit result for one deployed instance, if any.
+    pub(super) fn instance_exit(
+        &self,
+        deployed_key: &DeployedPipelineKey,
+    ) -> Option<RuntimeInstanceExit> {
+        let state = self
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        state
+            .runtime_instances
+            .get(deployed_key)
+            .and_then(|instance| match &instance.lifecycle {
+                RuntimeInstanceLifecycle::Active => None,
+                RuntimeInstanceLifecycle::Exited(exit) => Some(exit.clone()),
+            })
+    }
+
+    /// Sends shutdown to one instance and releases the retained control sender.
+    pub(super) fn request_instance_shutdown(
+        &self,
+        deployed_key: &DeployedPipelineKey,
+        timeout_secs: u64,
+        reason: &str,
+    ) -> Result<(), String> {
+        let sender = {
+            let state = self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            let Some(instance) = state.runtime_instances.get(deployed_key) else {
+                return Err(format!(
+                    "pipeline instance {}:{} core={} generation={} is not registered",
+                    deployed_key.pipeline_group_id.as_ref(),
+                    deployed_key.pipeline_id.as_ref(),
+                    deployed_key.core_id,
+                    deployed_key.deployment_generation
+                ));
+            };
+
+            match &instance.lifecycle {
+                RuntimeInstanceLifecycle::Exited(RuntimeInstanceExit::Success) => return Ok(()),
+                RuntimeInstanceLifecycle::Exited(RuntimeInstanceExit::Error(error)) => {
+                    return Err(error.message.clone());
+                }
+                RuntimeInstanceLifecycle::Active => {}
+            }
+
+            instance.control_sender.clone().ok_or_else(|| {
+                format!(
+                    "shutdown already requested for pipeline {}:{} core={} generation={}",
+                    deployed_key.pipeline_group_id.as_ref(),
+                    deployed_key.pipeline_id.as_ref(),
+                    deployed_key.core_id,
+                    deployed_key.deployment_generation
+                )
+            })?
+        };
+
+        if let Err(err) = sender.try_send_shutdown(
+            Instant::now() + Duration::from_secs(timeout_secs.max(1)),
+            reason.to_owned(),
+        ) {
+            return match self.instance_exit(deployed_key) {
+                Some(RuntimeInstanceExit::Success) => Ok(()),
+                Some(RuntimeInstanceExit::Error(error)) => Err(error.message),
+                None => Err(err.to_string()),
+            };
+        }
+        self.release_instance_control_sender(deployed_key);
+        Ok(())
+    }
+
+    /// Waits until a specific deployed instance exits or the deadline expires.
+    pub(super) fn wait_for_instance_exit(
+        &self,
+        deployed_key: &DeployedPipelineKey,
+        deadline: Instant,
+    ) -> Result<(), String> {
+        let mut state = self
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        loop {
+            if let Some(instance) = state.runtime_instances.get(deployed_key) {
+                match &instance.lifecycle {
+                    RuntimeInstanceLifecycle::Active => {}
+                    RuntimeInstanceLifecycle::Exited(RuntimeInstanceExit::Success) => {
+                        return Ok(());
+                    }
+                    RuntimeInstanceLifecycle::Exited(RuntimeInstanceExit::Error(error)) => {
+                        return Err(error.message.clone());
+                    }
+                }
+            }
+
+            let Some(remaining) = deadline.checked_duration_since(Instant::now()) else {
+                return Err(format!(
+                    "timed out waiting for pipeline {}:{} core={} generation={} to drain",
+                    deployed_key.pipeline_group_id.as_ref(),
+                    deployed_key.pipeline_id.as_ref(),
+                    deployed_key.core_id,
+                    deployed_key.deployment_generation
+                ));
+            };
+
+            // Runtime registration and exit reporting both publish through this
+            // mutex/condvar pair, so exit waits can sleep until real controller
+            // state changes instead of polling every 50ms.
+            let (next_state, _) = self
+                .state_changed
+                .wait_timeout(state, remaining)
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            state = next_state;
+        }
+    }
+
+    /// Requests shutdown for one instance and waits until it exits.
+    pub(super) fn shutdown_instance(
+        &self,
+        deployed_key: &DeployedPipelineKey,
+        timeout_secs: u64,
+        reason: &str,
+    ) -> Result<(), String> {
+        self.request_instance_shutdown(deployed_key, timeout_secs, reason)?;
+        self.wait_for_instance_exit(
+            deployed_key,
+            Instant::now() + Duration::from_secs(timeout_secs.max(1)),
+        )
+    }
+
+    /// Drops the retained admin sender after shutdown has been accepted.
+    ///
+    /// The retained sender is the controller's "not yet signaled" marker for
+    /// an active instance. Releasing it makes shutdown dispatch idempotent for
+    /// that instance and lets the pipeline control loop observe channel closure
+    /// once node tasks have exited.
+    pub(super) fn release_instance_control_sender(&self, deployed_key: &DeployedPipelineKey) {
+        let mut state = self
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        if let Some(instance) = state.runtime_instances.get_mut(deployed_key) {
+            instance.control_sender = None;
+        }
+    }
+
+    /// Broadcasts shutdown to every currently active runtime instance.
+    ///
+    /// This is best-effort across the snapshot: one failed send must not prevent
+    /// later instances from receiving shutdown. It is also idempotent at the
+    /// dispatch boundary: instances that already accepted shutdown have released
+    /// their retained control sender and are skipped by later calls.
+    pub(super) fn request_shutdown_all(&self, timeout_secs: u64) -> Result<(), ControlPlaneError> {
+        // Snapshot under the state lock, then send outside the lock so runtime
+        // callbacks can report exits while shutdown dispatch is in progress.
+        // Only active instances with a retained sender are eligible; a missing
+        // sender means shutdown was already accepted by a previous request.
+        let mut senders: Vec<_> = {
+            let state = self
+                .state
+                .lock()
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+            state
+                .runtime_instances
+                .iter()
+                .filter_map(|(deployed_key, instance)| match instance.lifecycle {
+                    RuntimeInstanceLifecycle::Active => instance
+                        .control_sender
+                        .as_ref()
+                        .map(|sender| (deployed_key.clone(), sender.clone())),
+                    RuntimeInstanceLifecycle::Exited(_) => None,
+                })
+                .collect()
+        };
+        // Stabilize both test assertions and the aggregated error message.
+        senders.sort_by_key(|(deployed_key, _)| {
+            (
+                deployed_key.pipeline_group_id.as_ref().to_owned(),
+                deployed_key.pipeline_id.as_ref().to_owned(),
+                deployed_key.core_id,
+                deployed_key.deployment_generation,
+            )
+        });
+
+        let mut failures = Vec::new();
+        for (deployed_key, sender) in senders {
+            if let Err(err) = sender.try_send_shutdown(
+                Instant::now() + Duration::from_secs(timeout_secs.max(1)),
+                "global shutdown".to_owned(),
+            ) {
+                // A failed send can race with the runtime thread exiting after
+                // the snapshot was taken. Treat clean exit as success, report a
+                // terminal runtime error if one was recorded, and otherwise keep
+                // the retained sender so a later shutdown-all can retry it.
+                match self.instance_exit(&deployed_key) {
+                    Some(RuntimeInstanceExit::Success) => {
+                        self.release_instance_control_sender(&deployed_key);
+                    }
+                    Some(RuntimeInstanceExit::Error(error)) => {
+                        failures.push(format!(
+                            "{}: {}",
+                            deployed_instance_label(&deployed_key),
+                            error.message
+                        ));
+                    }
+                    None => {
+                        failures.push(format!(
+                            "{}: {}",
+                            deployed_instance_label(&deployed_key),
+                            err
+                        ));
+                    }
+                }
+            } else {
+                // After a successful send, the controller should not send
+                // another shutdown message to this same active instance.
+                self.release_instance_control_sender(&deployed_key);
+            }
+        }
+
+        if failures.is_empty() {
+            Ok(())
+        } else {
+            // Report all failures together after every eligible instance has
+            // been attempted, preserving best-effort shutdown semantics.
+            Err(ControlPlaneError::Internal {
+                message: format!(
+                    "failed to send global shutdown to {} runtime instance(s): {}",
+                    failures.len(),
+                    failures.join("; ")
+                ),
+            })
+        }
+    }
+
+    /// Starts a tracked shutdown operation for one logical pipeline.
+    pub(super) fn request_shutdown_pipeline(
+        self: &Arc<Self>,
+        pipeline_group_id: &str,
+        pipeline_id: &str,
+        timeout_secs: u64,
+    ) -> Result<ShutdownStatus, ControlPlaneError> {
+        let plan = self.prepare_shutdown_plan(pipeline_group_id, pipeline_id, timeout_secs)?;
+        self.spawn_shutdown(plan)
+    }
+
+    /// Blocks until all active runtime instances have exited.
+    pub(crate) fn wait_until_all_instances_exit(&self) {
+        let mut state = self
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        while state.active_instances > 0 {
+            state = self
+                .state_changed
+                .wait(state)
+                .unwrap_or_else(|poisoned| poisoned.into_inner());
+        }
+    }
+
+    /// Returns the first runtime error observed by any watched pipeline thread.
+    pub(crate) fn take_runtime_error(&self) -> Option<Error> {
+        let state = self
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        state
+            .first_error
+            .as_ref()
+            .map(|message| Error::PipelineRuntimeError {
+                source: Box::new(io::Error::other(message.clone())),
+            })
+    }
+}
diff --git a/rust/otap-dataflow/crates/controller/src/live_control/state.rs b/rust/otap-dataflow/crates/controller/src/live_control/state.rs
new file mode 100644
index 0000000000..bd71c231f9
--- /dev/null
+++ b/rust/otap-dataflow/crates/controller/src/live_control/state.rs
@@ -0,0 +1,569 @@
+// Copyright The OpenTelemetry Authors
+// SPDX-License-Identifier: Apache-2.0
+
+//! Shared state types for the live-control runtime.
+//!
+//! This module intentionally contains mostly data and small conversion helpers.
+//! Planning, execution, and runtime-instance management all mutate these
+//! records through `ControllerRuntime` while holding the runtime mutex.
+
+use super::*;
+
+/// Maximum terminal rollout records retained per logical pipeline.
+pub(super) const TERMINAL_ROLLOUT_RETENTION_LIMIT: usize = 32;
+/// Maximum terminal shutdown records retained per logical pipeline.
+pub(super) const TERMINAL_SHUTDOWN_RETENTION_LIMIT: usize = 32;
+/// Maximum age for terminal rollout/shutdown records kept in memory.
+pub(super) const TERMINAL_OPERATION_RETENTION_TTL: Duration = Duration::from_secs(24 * 60 * 60);
+
+fn panic_payload_message(payload: &(dyn Any + Send)) -> String {
+    if let Some(message) = payload.downcast_ref::<&str>() {
+        (*message).to_owned()
+    } else if let Some(message) = payload.downcast_ref::<String>() {
+        message.clone()
+    } else {
+        "non-string panic payload".to_owned()
+    }
+}
+
+#[derive(Debug, Clone)]
+/// Structured panic capture with public and diagnostic renderings.
+///
+/// Rollout/shutdown workers and runtime thread watchers use this type to keep
+/// operator-visible failure messages concise while preserving thread context
+/// and a forced backtrace for internal telemetry.
+pub(crate) struct PanicReport {
+    pub(super) kind: &'static str,
+    pub(super) payload_message: String,
+    pub(super) thread_name: Option<String>,
+    pub(super) thread_id: Option<usize>,
+    pub(super) core_id: Option<usize>,
+    pub(super) backtrace: String,
+}
+
+impl PanicReport {
+    /// Captures a panic payload plus best-effort worker/thread context.
+    pub(crate) fn capture(
+        kind: &'static str,
+        panic: Box<dyn Any + Send>,
+        thread_name: Option<String>,
+        thread_id: Option<usize>,
+        core_id: Option<usize>,
+    ) -> Self {
+        Self {
+            kind,
+            payload_message: panic_payload_message(&*panic),
+            thread_name,
+            thread_id,
+            core_id,
+            backtrace: Backtrace::force_capture().to_string(),
+        }
+    }
+
+    /// Returns the short message stored in public rollout/shutdown status.
+    pub(super) fn summary_message(&self) -> String {
+        format!("{} panicked: {}", self.kind, self.payload_message)
+    }
+
+    /// Returns the diagnostic message used as internal error source detail.
+    pub(super) fn detail_message(&self) -> String {
+        let mut context = Vec::new();
+        if let Some(thread_name) = &self.thread_name {
+            context.push(format!("thread_name={thread_name}"));
+        }
+        if let Some(thread_id) = self.thread_id {
+            context.push(format!("thread_id={thread_id}"));
+        }
+        if let Some(core_id) = self.core_id {
+            context.push(format!("core_id={core_id}"));
+        }
+
+        let mut detail = self.summary_message();
+        if !context.is_empty() {
+            detail.push_str("\ncontext: ");
+            detail.push_str(&context.join(", "));
+        }
+        detail.push_str("\nbacktrace:\n");
+        detail.push_str(&self.backtrace);
+        detail
+    }
+
+    /// Converts the panic report into the observed-state error payload.
+    pub(super) fn error_summary(&self) -> ErrorSummary {
+        ErrorSummary::Pipeline {
+            error_kind: "panic".into(),
+            message: self.summary_message(),
+            source: Some(self.detail_message()),
+        }
+    }
+}
+
+#[derive(Debug, Clone)]
+/// Error recorded when a deployed runtime instance exits unsuccessfully.
+pub(crate) struct RuntimeInstanceError {
+    pub(super) error_kind: String,
+    pub(super) message: String,
+    pub(super) detail: Option<String>,
+}
+
+impl RuntimeInstanceError {
+    /// Builds a plain runtime error without panic diagnostics.
+    pub(crate) fn runtime(message: String) -> Self {
+        Self {
+            error_kind: "runtime".into(),
+            message,
+            detail: None,
+        }
+    }
+
+    /// Builds a runtime error from structured panic diagnostics.
+    pub(crate) fn from_panic(report: PanicReport) -> Self {
+        Self {
+            error_kind: "panic".into(),
+            message: report.summary_message(),
+            detail: Some(report.detail_message()),
+        }
+    }
+
+    /// Converts the runtime error into the observed-state error payload.
+    pub(super) fn error_summary(&self) -> ErrorSummary {
+        ErrorSummary::Pipeline {
+            error_kind: self.error_kind.clone(),
+            message: self.message.clone(),
+            source: self.detail.clone(),
+        }
+    }
+}
+
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+/// Execution strategy selected for a rollout request.
+pub(super) enum RolloutAction {
+    Create,
+    NoOp,
+    Replace,
+    Resize,
+}
+
+impl RolloutAction {
+    const fn as_str(self) -> &'static str {
+        match self {
+            Self::Create => "create",
+            Self::NoOp => "noop",
+            Self::Replace => "replace",
+            Self::Resize => "resize",
+        }
+    }
+}
+
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+/// Internal lifecycle for one rollout operation.
+pub(super) enum RolloutLifecycleState {
+    Pending,
+    Running,
+    Succeeded,
+    Failed,
+    RollingBack,
+    RollbackFailed,
+}
+
+impl RolloutLifecycleState {
+    const fn as_pipeline_rollout_state(self) -> PipelineRolloutState {
+        match self {
+            Self::Pending => PipelineRolloutState::Pending,
+            Self::Running => PipelineRolloutState::Running,
+            Self::Succeeded => PipelineRolloutState::Succeeded,
+            Self::Failed => PipelineRolloutState::Failed,
+            Self::RollingBack => PipelineRolloutState::RollingBack,
+            Self::RollbackFailed => PipelineRolloutState::RollbackFailed,
+        }
+    }
+
+    const fn as_api_pipeline_rollout_state(self) -> ApiPipelineRolloutState {
+        match self {
+            Self::Pending => ApiPipelineRolloutState::Pending,
+            Self::Running => ApiPipelineRolloutState::Running,
+            Self::Succeeded => ApiPipelineRolloutState::Succeeded,
+            Self::Failed => ApiPipelineRolloutState::Failed,
+            Self::RollingBack => ApiPipelineRolloutState::RollingBack,
+            Self::RollbackFailed => ApiPipelineRolloutState::RollbackFailed,
+        }
+    }
+
+    pub(super) const fn is_terminal(self) -> bool {
+        matches!(self, Self::Succeeded | Self::Failed | Self::RollbackFailed)
+    }
+}
+
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+/// Internal lifecycle for one pipeline shutdown operation.
+pub(super) enum ShutdownLifecycleState {
+    Pending,
+    Running,
+    Succeeded,
+    Failed,
+}
+
+impl ShutdownLifecycleState {
+    const fn as_str(self) -> &'static str {
+        match self {
+            Self::Pending => "pending",
+            Self::Running => "running",
+            Self::Succeeded => "succeeded",
+            Self::Failed => "failed",
+        }
+    }
+
+    pub(super) const fn is_terminal(self) -> bool {
+        matches!(self, Self::Succeeded | Self::Failed)
+    }
+}
+
+#[derive(Debug, Clone)]
+/// Per-core progress row within a rollout operation.
+pub(super) struct RolloutCoreProgress {
+    pub(super) core_id: usize,
+    pub(super) previous_generation: Option<u64>,
+    pub(super) target_generation: u64,
+    pub(super) state: String,
+    pub(super) updated_at: String,
+    pub(super) detail: Option<String>,
+}
+
+#[derive(Debug, Clone)]
+/// In-memory rollout record retained for active and recent terminal lookups.
+pub(super) struct RolloutRecord {
+    pub(super) rollout_id: String,
+    pub(super) pipeline_group_id: PipelineGroupId,
+    pub(super) pipeline_id: PipelineId,
+    pub(super) action: RolloutAction,
+    pub(super) state: RolloutLifecycleState,
+    pub(super) target_generation: u64,
+    pub(super) previous_generation: Option<u64>,
+    /// Drain timeout requested with the rollout, reused for panic cleanup.
+    pub(super) drain_timeout_secs: u64,
+    pub(super) started_at: String,
+    pub(super) updated_at: String,
+    pub(super) failure_reason: Option<String>,
+    pub(super) cores: Vec<RolloutCoreProgress>,
+    pub(super) completed_at: Option<Instant>,
+}
+
+impl RolloutRecord {
+    /// Creates the initial in-memory record for a rollout operation.
+    pub(super) fn new(
+        rollout_id: String,
+        pipeline_group_id: PipelineGroupId,
+        pipeline_id: PipelineId,
+        action: RolloutAction,
+        target_generation: u64,
+        previous_generation: Option<u64>,
+        drain_timeout_secs: u64,
+        cores: Vec<RolloutCoreProgress>,
+    ) -> Self {
+        let now = timestamp_now();
+        Self {
+            rollout_id,
+            pipeline_group_id,
+            pipeline_id,
+            action,
+            state: RolloutLifecycleState::Pending,
+            target_generation,
+            previous_generation,
+            drain_timeout_secs,
+            started_at: now.clone(),
+            updated_at: now,
+            failure_reason: None,
+            cores,
+            completed_at: None,
+        }
+    }
+
+    /// Builds the compact rollout summary exposed through observed state.
+    pub(super) fn summary(&self) -> PipelineRolloutSummary {
+        PipelineRolloutSummary {
+            rollout_id: self.rollout_id.clone(),
+            state: self.state.as_pipeline_rollout_state(),
+            target_generation: self.target_generation,
+            started_at: self.started_at.clone(),
+            updated_at: self.updated_at.clone(),
+            failure_reason: self.failure_reason.clone(),
+        }
+    }
+
+    /// Builds the admin-facing rollout summary embedded in pipeline details.
+    pub(super) fn api_summary(&self) -> ApiPipelineRolloutSummary {
+        ApiPipelineRolloutSummary {
+            rollout_id: self.rollout_id.clone(),
+            state: self.state.as_api_pipeline_rollout_state(),
+            target_generation: self.target_generation,
+            started_at: self.started_at.clone(),
+            updated_at: self.updated_at.clone(),
+            failure_reason: self.failure_reason.clone(),
+        }
+    }
+
+    /// Materializes the full rollout status returned by the control plane.
+    pub(super) fn status(&self) -> RolloutStatus {
+        RolloutStatus {
+            rollout_id: self.rollout_id.clone(),
+            pipeline_group_id: self.pipeline_group_id.clone(),
+            pipeline_id: self.pipeline_id.clone(),
+            action: self.action.as_str().to_owned(),
+            state: self.state.as_api_pipeline_rollout_state(),
+            target_generation: self.target_generation,
+            previous_generation: self.previous_generation,
+            started_at: self.started_at.clone(),
+            updated_at: self.updated_at.clone(),
+            failure_reason: self.failure_reason.clone(),
+            cores: self
+                .cores
+                .iter()
+                .map(|core| RolloutCoreStatus {
+                    core_id: core.core_id,
+                    previous_generation: core.previous_generation,
+                    target_generation: core.target_generation,
+                    state: core.state.clone(),
+                    updated_at: core.updated_at.clone(),
+                    detail: core.detail.clone(),
+                })
+                .collect(),
+        }
+    }
+}
+
+#[derive(Debug, Clone)]
+/// Per-instance progress row within a shutdown operation.
+pub(super) struct ShutdownCoreProgress {
+    pub(super) core_id: usize,
+    pub(super) deployment_generation: u64,
+    pub(super) state: String,
+    pub(super) updated_at: String,
+    pub(super) detail: Option<String>,
+}
+
+#[derive(Debug, Clone)]
+/// In-memory shutdown record retained for active and recent terminal lookups.
+pub(super) struct ShutdownRecord {
+    pub(super) shutdown_id: String,
+    pub(super) pipeline_group_id: PipelineGroupId,
+    pub(super) pipeline_id: PipelineId,
+    pub(super) state: ShutdownLifecycleState,
+    pub(super) started_at: String,
+    pub(super) updated_at: String,
+    pub(super) failure_reason: Option<String>,
+    pub(super) cores: Vec<ShutdownCoreProgress>,
+    pub(super) completed_at: Option<Instant>,
+}
+
+impl ShutdownRecord {
+    /// Creates the initial in-memory record for a pipeline shutdown operation.
+    pub(super) fn new(
+        shutdown_id: String,
+        pipeline_group_id: PipelineGroupId,
+        pipeline_id: PipelineId,
+        cores: Vec<ShutdownCoreProgress>,
+    ) -> Self {
+        let now = timestamp_now();
+        Self {
+            shutdown_id,
+            pipeline_group_id,
+            pipeline_id,
+            state: ShutdownLifecycleState::Pending,
+            started_at: now.clone(),
+            updated_at: now,
+            failure_reason: None,
+            cores,
+            completed_at: None,
+        }
+    }
+
+    /// Materializes the full shutdown status returned by the control plane.
+    pub(super) fn status(&self) -> ShutdownStatus {
+        ShutdownStatus {
+            shutdown_id: self.shutdown_id.clone(),
+            pipeline_group_id: self.pipeline_group_id.clone(),
+            pipeline_id: self.pipeline_id.clone(),
+            state: self.state.as_str().to_owned(),
+            started_at: self.started_at.clone(),
+            updated_at: self.updated_at.clone(),
+            failure_reason: self.failure_reason.clone(),
+            cores: self
+                .cores
+                .iter()
+                .map(|core| ShutdownCoreStatus {
+                    core_id: core.core_id,
+                    deployment_generation: core.deployment_generation,
+                    state: core.state.clone(),
+                    updated_at: core.updated_at.clone(),
+                    detail: core.detail.clone(),
+                })
+                .collect(),
+        }
+    }
+}
+
+/// Controller-owned record for one deployed runtime instance.
+pub(super) struct RuntimeInstanceRecord {
+    // The controller drops this sender once shutdown is requested so the
+    // pipeline control loop can observe channel closure after node tasks exit.
+    pub(super) control_sender: Option<Arc<dyn PipelineAdminSender>>,
+    pub(super) lifecycle: RuntimeInstanceLifecycle,
+}
+
+#[derive(Debug, Clone)]
+/// Runtime-instance liveness as understood by the controller.
+pub(super) enum RuntimeInstanceLifecycle {
+    /// The pipeline thread is still expected to be running.
+    Active,
+    /// The pipeline thread reported a terminal exit.
+    Exited(RuntimeInstanceExit),
+}
+
+#[derive(Debug, Clone)]
+/// Terminal result reported by a deployed pipeline runtime thread.
+pub(crate) enum RuntimeInstanceExit {
+    /// The runtime exited normally after drain/shutdown.
+    Success,
+    /// The runtime exited due to a pipeline error or panic.
+    Error(RuntimeInstanceError),
+}
+
+#[derive(Debug, Clone)]
+/// Committed logical pipeline config plus the active deployment generation.
+pub(super) struct LogicalPipelineRecord {
+    pub(super) resolved: ResolvedPipelineConfig,
+    pub(super) active_generation: u64,
+}
+
+#[derive(Debug, Clone, PartialEq, Eq)]
+/// Topic runtime properties that cannot be mutated by live rollout.
+pub(super) struct TopicRuntimeProfile {
+    pub(super) backend: TopicBackendKind,
+    pub(super) policies: otap_df_config::topic::TopicPolicies,
+    pub(super) selected_mode: InferredTopicMode,
+}
+
+/// Complete mutable state protected by `ControllerRuntime::state`.
+///
+/// Keep this type as plain data: methods that enforce lifecycle invariants
+/// should live on `ControllerRuntime` so mutations can update observed state
+/// and wake condition variables consistently.
+pub(super) struct ControllerRuntimeState {
+    /// Latest accepted full engine config, including committed live changes.
+    pub(super) live_config: OtelDataflowSpec,
+    /// Committed logical pipelines keyed by group/pipeline id.
+    pub(super) logical_pipelines: HashMap<PipelineKey, LogicalPipelineRecord>,
+    /// Deployed runtime instances keyed by group/pipeline/core/generation.
+    pub(super) runtime_instances: HashMap<DeployedPipelineKey, RuntimeInstanceRecord>,
+    // A pipeline thread can finish before register_launched_instance() publishes it as Active.
+    // We park that exit here and reconcile it during registration instead of leaving stale
+    // liveness behind.
+    pub(super) pending_instance_exits: HashMap<DeployedPipelineKey, RuntimeInstanceExit>,
+    /// Rollout snapshots retained for active and recent terminal lookups.
+    pub(super) rollouts: HashMap<String, RolloutRecord>,
+    /// Active rollout id per logical pipeline; presence causes operation conflict.
+    pub(super) active_rollouts: HashMap<PipelineKey, String>,
+    /// FIFO terminal rollout ids per logical pipeline for cap/TTL eviction.
+    pub(super) terminal_rollouts: HashMap<PipelineKey, VecDeque<String>>,
+    /// Shutdown snapshots retained for active and recent terminal lookups.
+    pub(super) shutdowns: HashMap<String, ShutdownRecord>,
+    /// Active shutdown id per logical pipeline; presence causes operation conflict.
+    pub(super) active_shutdowns: HashMap<PipelineKey, String>,
+    /// FIFO terminal shutdown ids per logical pipeline for cap/TTL eviction.
+    pub(super) terminal_shutdowns: HashMap<PipelineKey, VecDeque<String>>,
+    /// Next deployment generation to assign for each logical pipeline.
+    pub(super) generation_counters: HashMap<PipelineKey, u64>,
+    /// Count of runtime instances still considered active by the controller.
+    pub(super) active_instances: usize,
+    /// Monotonic rollout id suffix.
+    pub(super) next_rollout_id: u64,
+    /// Monotonic shutdown id suffix.
+    pub(super) next_shutdown_id: u64,
+    /// Monotonic logical runtime-thread id used for diagnostics.
+    pub(super) next_thread_id: usize,
+    /// First runtime failure surfaced to global controller shutdown handling.
+    pub(super) first_error: Option<String>,
+}
+
+#[derive(Debug)]
+/// Fully validated rollout plan ready for background execution.
+///
+/// The planner precomputes generation ids, target core sets, resize deltas,
+/// operation records, and timeouts so the worker can execute without
+/// reinterpreting the admin request.
+pub(super) struct CandidateRolloutPlan {
+    /// Logical pipeline targeted by the rollout.
+    pub(super) pipeline_key: PipelineKey,
+    pub(super) pipeline_group_id: PipelineGroupId,
+    pub(super) pipeline_id: PipelineId,
+    /// Execution strategy selected by request classification.
+    pub(super) action: RolloutAction,
+    /// Resolved target pipeline config after applying the request.
+    pub(super) resolved_pipeline: ResolvedPipelineConfig,
+    /// Current committed record, absent for create rollouts.
+    pub(super) current_record: Option<LogicalPipelineRecord>,
+    /// Core allocation from the committed record.
+    pub(super) current_assigned_cores: Vec<usize>,
+    /// Core allocation requested by the candidate config.
+    pub(super) target_assigned_cores: Vec<usize>,
+    /// Cores present in both current and target assignments.
+    pub(super) common_assigned_cores: Vec<usize>,
+    /// Cores present only in the target assignment.
+    pub(super) added_assigned_cores: Vec<usize>,
+    /// Cores present only in the current assignment.
+    pub(super) removed_assigned_cores: Vec<usize>,
+    /// Cores to launch for resize-only rollouts.
+    pub(super) resize_start_cores: Vec<usize>,
+    /// Cores to drain for resize-only rollouts.
+    pub(super) resize_stop_cores: Vec<usize>,
+    /// Deployment generation assigned to the target runtime instances.
+    pub(super) target_generation: u64,
+    /// Initial rollout status record to insert before spawning a worker.
+    pub(super) rollout: RolloutRecord,
+    /// Per-step readiness timeout in seconds.
+    pub(super) step_timeout_secs: u64,
+    /// Drain timeout in seconds for old instances.
+    pub(super) drain_timeout_secs: u64,
+}
+
+#[derive(Debug)]
+/// Fully validated shutdown plan ready for background execution.
+pub(super) struct CandidateShutdownPlan {
+    /// Logical pipeline targeted by the shutdown.
+    pub(super) pipeline_key: PipelineKey,
+    /// Initial shutdown status record to insert before spawning a worker.
+    pub(super) shutdown: ShutdownRecord,
+    /// Active deployed instances that must exit for shutdown success.
+    pub(super) target_instances: Vec<DeployedPipelineKey>,
+    /// Per-instance shutdown timeout in seconds.
+    pub(super) timeout_secs: u64,
+}
+
+/// Snapshot of active cores for the current committed generation.
+pub(super) struct ActiveRuntimeCoreState {
+    /// Active cores still running the committed generation.
+    pub(super) current_generation_cores: Vec<usize>,
+    /// Whether another active generation exists for the same logical pipeline.
+    pub(super) has_foreign_active_generations: bool,
+}
+
+/// Returns a fresh RFC3339 timestamp for externally visible status updates.
+pub(super) fn timestamp_now() -> String {
+    Utc::now().to_rfc3339()
+}
+
+/// Returns whether a terminal operation snapshot has exceeded retention TTL.
+pub(super) fn is_expired(completed_at: Option<Instant>, now: Instant) -> bool {
+    completed_at
+        .and_then(|completed_at| now.checked_duration_since(completed_at))
+        .is_some_and(|age| age >= TERMINAL_OPERATION_RETENTION_TTL)
+}
+
+#[derive(Debug)]
+/// Rollout worker failure category used to distinguish rollback failures.
+pub(super) enum RolloutExecutionError {
+    /// The rollout failed before or outside rollback handling.
+    Failed(String),
+    /// Rollback was attempted but did not restore the previous runtime shape.
+    RollbackFailed(String),
+}
diff --git a/rust/otap-dataflow/crates/controller/src/live_control_tests.rs b/rust/otap-dataflow/crates/controller/src/live_control_tests.rs
new file mode 100644
index 0000000000..4955e9b794
--- /dev/null
+++ b/rust/otap-dataflow/crates/controller/src/live_control_tests.rs
@@ -0,0 +1,2707 @@
+// Copyright The OpenTelemetry Authors
+// SPDX-License-Identifier: Apache-2.0
+
+use super::*;
+use otap_df_config::engine::ResolvedPipelineRole;
+use otap_df_config::observed_state::ObservedStateSettings;
+use otap_df_config::settings::telemetry::logs::LogLevel;
+use otap_df_engine::ExporterFactory;
+use otap_df_engine::ReceiverFactory;
+use otap_df_engine::config::{ExporterConfig, ReceiverConfig};
+use otap_df_engine::control::{
+    RuntimeControlMsg, RuntimeCtrlMsgReceiver, runtime_ctrl_msg_channel,
+};
+use otap_df_engine::error::Error as EngineError;
+use otap_df_engine::exporter::ExporterWrapper;
+use otap_df_engine::receiver::ReceiverWrapper;
+use otap_df_engine::wiring_contract::WiringContract;
+use otap_df_state::pipeline_status::PipelineStatus;
+use otap_df_telemetry::TracingSetup;
+use otap_df_telemetry::event::EngineEvent;
+use otap_df_telemetry::tracing_init::ProviderSetup;
+use tokio_util::sync::CancellationToken;
+
+fn available_core_ids() -> Vec<CoreId> {
+    vec![
+        CoreId { id: 0 },
+        CoreId { id: 1 },
+        CoreId { id: 2 },
+        CoreId { id: 3 },
+        CoreId { id: 4 },
+        CoreId { id: 5 },
+        CoreId { id: 6 },
+        CoreId { id: 7 },
+    ]
+}
+
+fn test_validate_config(_config: &serde_json::Value) -> Result<(), otap_df_config::error::Error> {
+    Ok(())
+}
+
+fn test_receiver_create(
+    _pipeline_ctx: PipelineContext,
+    _node: otap_df_engine::node::NodeId,
+    _node_config: Arc<NodeUserConfig>,
+    _receiver_config: &ReceiverConfig,
+) -> Result<ReceiverWrapper<()>, otap_df_config::error::Error> {
+    panic!("test receiver factory should not be constructed")
+}
+
+fn test_exporter_create(
+    _pipeline_ctx: PipelineContext,
+    _node: otap_df_engine::node::NodeId,
+    _node_config: Arc<NodeUserConfig>,
+    _exporter_config: &ExporterConfig,
+) -> Result<ExporterWrapper<()>, otap_df_config::error::Error> {
+    panic!("test exporter factory should not be constructed")
+}
+
+static TEST_RECEIVER_FACTORIES: &[ReceiverFactory<()>] = &[
+    ReceiverFactory {
+        name: "urn:test:receiver:example",
+        create: test_receiver_create,
+        wiring_contract: WiringContract::UNRESTRICTED,
+        validate_config: test_validate_config,
+    },
+    ReceiverFactory {
+        name: "urn:otel:receiver:topic",
+        create: test_receiver_create,
+        wiring_contract: WiringContract::UNRESTRICTED,
+        validate_config: test_validate_config,
+    },
+];
+
+static TEST_EXPORTER_FACTORIES: &[ExporterFactory<()>] = &[
+    ExporterFactory {
+        name: "urn:test:exporter:example",
+        create: test_exporter_create,
+        wiring_contract: WiringContract::UNRESTRICTED,
+        validate_config: test_validate_config,
+    },
+    ExporterFactory {
+        name: "urn:otel:exporter:topic",
+        create: test_exporter_create,
+        wiring_contract: WiringContract::UNRESTRICTED,
+        validate_config: test_validate_config,
+    },
+];
+
+static TEST_PIPELINE_FACTORY: PipelineFactory<()> =
+    PipelineFactory::new(TEST_RECEIVER_FACTORIES, &[], TEST_EXPORTER_FACTORIES, &[]);
+
+fn test_runtime(config: &OtelDataflowSpec) -> Arc<ControllerRuntime<()>> {
+    let registry = TelemetryRegistryHandle::new();
+    let observed_state_store =
+        ObservedStateStore::new(&ObservedStateSettings::default(), registry.clone());
+    let observed_state_handle = observed_state_store.handle();
+    let engine_event_reporter = observed_state_store.reporter(Default::default());
+    let (_metrics_rx, metrics_reporter) = MetricsReporter::create_new_and_receiver(8);
+    let declared_topics =
+        Controller::<()>::declare_topics(config).expect("declared topics should be valid");
+    let (memory_pressure_tx, _memory_pressure_rx) =
+        tokio::sync::watch::channel(MemoryPressureChanged::initial());
+
+    Arc::new(ControllerRuntime::new(
+        &TEST_PIPELINE_FACTORY,
+        ControllerContext::new(registry),
+        observed_state_store,
+        observed_state_handle,
+        engine_event_reporter,
+        metrics_reporter,
+        declared_topics,
+        available_core_ids(),
+        TracingSetup::new(ProviderSetup::Noop, LogLevel::default(), engine_context),
+        Duration::from_secs(1),
+        memory_pressure_tx,
+        config.clone(),
+    ))
+}
+
+struct ObservedStateRunner {
+    cancel: CancellationToken,
+    join: Option<thread::JoinHandle<()>>,
+}
+
+impl ObservedStateRunner {
+    fn start(runtime: &ControllerRuntime<()>) -> Self {
+        let cancel = CancellationToken::new();
+        let store = runtime.observed_state_store.clone();
+        let cancel_clone = cancel.clone();
+        let join = thread::spawn(move || {
+            let runtime = tokio::runtime::Builder::new_current_thread()
+                .enable_all()
+                .build()
+                .expect("observed-state test runtime should build");
+            runtime
+                .block_on(store.run(cancel_clone))
+                .expect("observed-state consumer should exit cleanly");
+        });
+        Self {
+            cancel,
+            join: Some(join),
+        }
+    }
+}
+
+impl Drop for ObservedStateRunner {
+    fn drop(&mut self) {
+        self.cancel.cancel();
+        if let Some(join) = self.join.take() {
+            join.join()
+                .expect("observed-state consumer thread should join cleanly");
+        }
+    }
+}
+
+fn deployed_key(
+    pipeline_group_id: &str,
+    pipeline_id: &str,
+    core_id: usize,
+    generation: u64,
+) -> DeployedPipelineKey {
+    DeployedPipelineKey {
+        pipeline_group_id: pipeline_group_id.to_owned().into(),
+        pipeline_id: pipeline_id.to_owned().into(),
+        core_id,
+        deployment_generation: generation,
+    }
+}
+
+fn report_ready(runtime: &ControllerRuntime<()>, key: DeployedPipelineKey) {
+    runtime
+        .engine_event_reporter
+        .report(EngineEvent::admitted(key.clone(), None));
+    runtime
+        .engine_event_reporter
+        .report(EngineEvent::ready(key, None));
+}
+
+fn report_stopped(runtime: &ControllerRuntime<()>, key: DeployedPipelineKey) {
+    runtime
+        .engine_event_reporter
+        .report(EngineEvent::admitted(key.clone(), None));
+    runtime
+        .engine_event_reporter
+        .report(EngineEvent::ready(key.clone(), None));
+    runtime
+        .engine_event_reporter
+        .report(EngineEvent::shutdown_requested(key.clone(), None));
+    runtime
+        .engine_event_reporter
+        .report(EngineEvent::drained(key, None));
+}
+
+fn wait_for_observed_status<F>(
+    runtime: &ControllerRuntime<()>,
+    pipeline_key: &PipelineKey,
+    predicate: F,
+) -> PipelineStatus
+where
+    F: Fn(&PipelineStatus) -> bool,
+{
+    let deadline = Instant::now() + Duration::from_secs(5);
+    loop {
+        if let Some(status) = runtime.observed_state_handle.pipeline_status(pipeline_key) {
+            if predicate(&status) {
+                return status;
+            }
+        }
+        assert!(
+            Instant::now() < deadline,
+            "timed out waiting for observed status predicate on {}:{}",
+            pipeline_key.pipeline_group_id(),
+            pipeline_key.pipeline_id()
+        );
+        thread::sleep(Duration::from_millis(25));
+    }
+}
+
+fn engine_config_with_pipeline(pipeline_yaml: &str) -> OtelDataflowSpec {
+    OtelDataflowSpec::from_yaml(&format!(
+        r#"
+version: otel_dataflow/v1
+groups:
+  g1:
+    pipelines:
+      p1:
+{pipeline_yaml}
+"#
+    ))
+    .expect("engine config should parse")
+}
+
+fn simple_pipeline_yaml() -> &'static str {
+    r#"
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#
+}
+
+fn register_existing_pipeline(runtime: &ControllerRuntime<()>, config: &OtelDataflowSpec) {
+    register_pipeline(runtime, config, "g1", "p1");
+}
+
+fn register_pipeline(
+    runtime: &ControllerRuntime<()>,
+    config: &OtelDataflowSpec,
+    group_id: &str,
+    pipeline_id: &str,
+) {
+    let resolved = config
+        .resolve()
+        .pipelines
+        .into_iter()
+        .find(|pipeline| {
+            pipeline.role == ResolvedPipelineRole::Regular
+                && pipeline.pipeline_group_id.as_ref() == group_id
+                && pipeline.pipeline_id.as_ref() == pipeline_id
+        })
+        .expect("resolved pipeline should exist");
+    runtime.register_committed_pipeline(resolved, 0);
+}
+
+fn register_runtime_instance(
+    runtime: &ControllerRuntime<()>,
+    pipeline_group_id: &str,
+    pipeline_id: &str,
+    core_id: usize,
+    generation: u64,
+    lifecycle: RuntimeInstanceLifecycle,
+) -> RuntimeCtrlMsgReceiver<()> {
+    let (tx, rx) = runtime_ctrl_msg_channel::<()>(4);
+    let control_sender: Arc<dyn PipelineAdminSender> = Arc::new(tx.clone());
+    let is_active = matches!(&lifecycle, RuntimeInstanceLifecycle::Active);
+    let mut state = runtime
+        .state
+        .lock()
+        .unwrap_or_else(|poisoned| poisoned.into_inner());
+    _ = state.runtime_instances.insert(
+        DeployedPipelineKey {
+            pipeline_group_id: pipeline_group_id.to_owned().into(),
+            pipeline_id: pipeline_id.to_owned().into(),
+            core_id,
+            deployment_generation: generation,
+        },
+        RuntimeInstanceRecord {
+            control_sender: Some(control_sender),
+            lifecycle,
+        },
+    );
+    if is_active {
+        state.active_instances += 1;
+    }
+    rx
+}
+
+fn register_runtime_instance_with_sender(
+    runtime: &ControllerRuntime<()>,
+    pipeline_key: DeployedPipelineKey,
+    control_sender: Arc<dyn PipelineAdminSender>,
+    lifecycle: RuntimeInstanceLifecycle,
+) {
+    let is_active = matches!(&lifecycle, RuntimeInstanceLifecycle::Active);
+    let mut state = runtime
+        .state
+        .lock()
+        .unwrap_or_else(|poisoned| poisoned.into_inner());
+    _ = state.runtime_instances.insert(
+        pipeline_key,
+        RuntimeInstanceRecord {
+            control_sender: Some(control_sender),
+            lifecycle,
+        },
+    );
+    if is_active {
+        state.active_instances += 1;
+    }
+}
+
+struct RecordingPipelineAdminSender {
+    calls: Arc<Mutex<Vec<String>>>,
+    failure: Option<String>,
+}
+
+impl PipelineAdminSender for RecordingPipelineAdminSender {
+    fn try_send_shutdown(&self, _deadline: Instant, reason: String) -> Result<(), EngineError> {
+        self.calls
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner())
+            .push(reason);
+        if let Some(failure) = &self.failure {
+            Err(EngineError::RuntimeMsgError {
+                error: failure.clone(),
+            })
+        } else {
+            Ok(())
+        }
+    }
+}
+
+fn recording_admin_sender(
+    failure: Option<&str>,
+) -> (Arc<dyn PipelineAdminSender>, Arc<Mutex<Vec<String>>>) {
+    let calls = Arc::new(Mutex::new(Vec::new()));
+    let sender = Arc::new(RecordingPipelineAdminSender {
+        calls: Arc::clone(&calls),
+        failure: failure.map(ToOwned::to_owned),
+    });
+    (sender, calls)
+}
+
+fn launched_runtime_instance(
+    pipeline_group_id: &str,
+    pipeline_id: &str,
+    core_id: usize,
+    generation: u64,
+) -> LaunchedPipelineThread<()> {
+    let (tx, _rx) = runtime_ctrl_msg_channel::<()>(4);
+    let control_sender: Arc<dyn PipelineAdminSender> = Arc::new(tx);
+    LaunchedPipelineThread {
+        pipeline_key: DeployedPipelineKey {
+            pipeline_group_id: pipeline_group_id.to_owned().into(),
+            pipeline_id: pipeline_id.to_owned().into(),
+            core_id,
+            deployment_generation: generation,
+        },
+        control_sender,
+        _marker: std::marker::PhantomData,
+    }
+}
+
+fn wait_for_shutdown_state(
+    runtime: &ControllerRuntime<()>,
+    shutdown_id: &str,
+    expected_state: &str,
+) -> ShutdownStatus {
+    let deadline = Instant::now() + Duration::from_secs(5);
+    loop {
+        let status = runtime
+            .shutdown_status_snapshot(shutdown_id)
+            .expect("shutdown should exist");
+        if status.state == expected_state {
+            return status;
+        }
+        assert!(
+            Instant::now() < deadline,
+            "timed out waiting for shutdown {shutdown_id} to reach state {expected_state}, current state: {}",
+            status.state
+        );
+        thread::sleep(Duration::from_millis(25));
+    }
+}
+
+fn wait_for_shutdown_message(receiver: &mut RuntimeCtrlMsgReceiver<()>) -> RuntimeControlMsg<()> {
+    let deadline = Instant::now() + Duration::from_secs(2);
+    loop {
+        if let Ok(message) = receiver.try_recv() {
+            return message;
+        }
+        assert!(
+            Instant::now() < deadline,
+            "timed out waiting for shutdown control message"
+        );
+        thread::sleep(Duration::from_millis(25));
+    }
+}
+
+fn complete_instance_exit_on_shutdown(
+    runtime: Arc<ControllerRuntime<()>>,
+    mut receiver: RuntimeCtrlMsgReceiver<()>,
+    deployed_key: DeployedPipelineKey,
+    expected_reason: &'static str,
+) -> thread::JoinHandle<()> {
+    thread::spawn(move || {
+        assert!(matches!(
+            wait_for_shutdown_message(&mut receiver),
+            RuntimeControlMsg::Shutdown { reason, .. } if reason == expected_reason
+        ));
+        runtime.note_instance_exit(deployed_key, RuntimeInstanceExit::Success);
+    })
+}
+
+fn terminal_rollout_record(
+    pipeline_group_id: &str,
+    pipeline_id: &str,
+    rollout_id: &str,
+) -> RolloutRecord {
+    let mut rollout = RolloutRecord::new(
+        rollout_id.to_owned(),
+        pipeline_group_id.to_owned().into(),
+        pipeline_id.to_owned().into(),
+        RolloutAction::Replace,
+        1,
+        Some(0),
+        60,
+        Vec::new(),
+    );
+    rollout.state = RolloutLifecycleState::Succeeded;
+    rollout
+}
+
+fn terminal_shutdown_record(
+    pipeline_group_id: &str,
+    pipeline_id: &str,
+    shutdown_id: &str,
+) -> ShutdownRecord {
+    let mut shutdown = ShutdownRecord::new(
+        shutdown_id.to_owned(),
+        pipeline_group_id.to_owned().into(),
+        pipeline_id.to_owned().into(),
+        Vec::new(),
+    );
+    shutdown.state = ShutdownLifecycleState::Succeeded;
+    shutdown
+}
+
+/// Scenario: a reconfigure request changes only the effective core
+/// allocation from one assigned core to two.
+/// Guarantees: rollout planning classifies the change as a resize, starts
+/// only the added core, and keeps the current generation unchanged.
+#[test]
+fn prepare_rollout_plan_accepts_core_allocation_scale_up() {
+    let config = engine_config_with_pipeline(
+        r#"
+        policies:
+          resources:
+            core_allocation:
+              type: core_count
+              count: 1
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+    let _receiver =
+        register_runtime_instance(&runtime, "g1", "p1", 0, 0, RuntimeInstanceLifecycle::Active);
+
+    let replacement = PipelineConfig::from_yaml(
+        "g1".into(),
+        "p1".into(),
+        r#"
+policies:
+  resources:
+    core_allocation:
+      type: core_count
+      count: 2
+nodes:
+  receiver:
+    type: "urn:test:receiver:example"
+    config: null
+  exporter:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: receiver
+    to: exporter
+"#,
+    )
+    .expect("replacement should parse");
+
+    let plan = runtime
+        .prepare_rollout_plan(
+            "g1",
+            "p1",
+            &ReconfigureRequest {
+                pipeline: replacement,
+                step_timeout_secs: 60,
+                drain_timeout_secs: 60,
+            },
+        )
+        .expect("core allocation changes should be planned");
+
+    assert_eq!(plan.action, RolloutAction::Resize);
+    assert_eq!(plan.current_assigned_cores, vec![0]);
+    assert_eq!(plan.target_assigned_cores, vec![0, 1]);
+    assert_eq!(plan.common_assigned_cores, vec![0]);
+    assert_eq!(plan.added_assigned_cores, vec![1]);
+    assert!(plan.removed_assigned_cores.is_empty());
+    assert_eq!(plan.resize_start_cores, vec![1]);
+    assert!(plan.resize_stop_cores.is_empty());
+    assert_eq!(plan.target_generation, 0);
+    assert_eq!(
+        plan.rollout
+            .cores
+            .iter()
+            .map(|core| core.core_id)
+            .collect::<Vec<_>>(),
+        vec![1]
+    );
+}
+
+/// Scenario: a reconfigure request changes only the effective core
+/// allocation from two assigned cores to one.
+/// Guarantees: rollout planning classifies the change as a resize, stops
+/// only the removed core, and keeps the current generation unchanged.
+#[test]
+fn prepare_rollout_plan_accepts_core_allocation_scale_down() {
+    let config = engine_config_with_pipeline(
+        r#"
+        policies:
+          resources:
+            core_allocation:
+              type: core_count
+              count: 2
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+    let _receiver0 =
+        register_runtime_instance(&runtime, "g1", "p1", 0, 0, RuntimeInstanceLifecycle::Active);
+    let _receiver1 =
+        register_runtime_instance(&runtime, "g1", "p1", 1, 0, RuntimeInstanceLifecycle::Active);
+
+    let replacement = PipelineConfig::from_yaml(
+        "g1".into(),
+        "p1".into(),
+        r#"
+policies:
+  resources:
+    core_allocation:
+      type: core_count
+      count: 1
+nodes:
+  receiver:
+    type: "urn:test:receiver:example"
+    config: null
+  exporter:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: receiver
+    to: exporter
+"#,
+    )
+    .expect("replacement should parse");
+
+    let plan = runtime
+        .prepare_rollout_plan(
+            "g1",
+            "p1",
+            &ReconfigureRequest {
+                pipeline: replacement,
+                step_timeout_secs: 60,
+                drain_timeout_secs: 60,
+            },
+        )
+        .expect("core allocation changes should be planned");
+
+    assert_eq!(plan.action, RolloutAction::Resize);
+    assert_eq!(plan.current_assigned_cores, vec![0, 1]);
+    assert_eq!(plan.target_assigned_cores, vec![0]);
+    assert_eq!(plan.common_assigned_cores, vec![0]);
+    assert!(plan.added_assigned_cores.is_empty());
+    assert_eq!(plan.removed_assigned_cores, vec![1]);
+    assert!(plan.resize_start_cores.is_empty());
+    assert_eq!(plan.resize_stop_cores, vec![1]);
+    assert_eq!(plan.target_generation, 0);
+    assert_eq!(
+        plan.rollout
+            .cores
+            .iter()
+            .map(|core| core.core_id)
+            .collect::<Vec<_>>(),
+        vec![1]
+    );
+}
+
+/// Scenario: the submitted pipeline config is effectively identical to the
+/// committed active pipeline and serving footprint.
+/// Guarantees: rollout planning short-circuits to `NoOp` rather than
+/// scheduling a replace or resize operation.
+#[test]
+fn prepare_rollout_plan_returns_noop_for_identical_active_pipeline() {
+    let config = engine_config_with_pipeline(
+        r#"
+        policies:
+          resources:
+            core_allocation:
+              type: core_count
+              count: 1
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+    let _receiver =
+        register_runtime_instance(&runtime, "g1", "p1", 0, 0, RuntimeInstanceLifecycle::Active);
+
+    let replacement = PipelineConfig::from_yaml(
+        "g1".into(),
+        "p1".into(),
+        r#"
+policies:
+  resources:
+    core_allocation:
+      type: core_count
+      count: 1
+nodes:
+  receiver:
+    type: "urn:test:receiver:example"
+    config: null
+  exporter:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: receiver
+    to: exporter
+"#,
+    )
+    .expect("replacement should parse");
+
+    let plan = runtime
+        .prepare_rollout_plan(
+            "g1",
+            "p1",
+            &ReconfigureRequest {
+                pipeline: replacement,
+                step_timeout_secs: 60,
+                drain_timeout_secs: 60,
+            },
+        )
+        .expect("identical updates should be planned");
+
+    assert_eq!(plan.action, RolloutAction::NoOp);
+    assert_eq!(plan.target_generation, 0);
+    assert!(plan.rollout.cores.is_empty());
+    assert!(plan.resize_start_cores.is_empty());
+    assert!(plan.resize_stop_cores.is_empty());
+}
+
+/// Scenario: the controller executes a rollout plan that has already been
+/// classified as `NoOp`.
+/// Guarantees: the controller returns an immediate successful rollout
+/// snapshot, preserves the committed generation, and leaves no in-flight
+/// rollout summary behind.
+#[test]
+fn spawn_rollout_returns_immediate_success_for_noop() {
+    let config = engine_config_with_pipeline(
+        r#"
+        policies:
+          resources:
+            core_allocation:
+              type: core_count
+              count: 1
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+    let _receiver =
+        register_runtime_instance(&runtime, "g1", "p1", 0, 0, RuntimeInstanceLifecycle::Active);
+
+    let replacement = PipelineConfig::from_yaml(
+        "g1".into(),
+        "p1".into(),
+        r#"
+policies:
+  resources:
+    core_allocation:
+      type: core_count
+      count: 1
+nodes:
+  receiver:
+    type: "urn:test:receiver:example"
+    config: null
+  exporter:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: receiver
+    to: exporter
+"#,
+    )
+    .expect("replacement should parse");
+
+    let plan = runtime
+        .prepare_rollout_plan(
+            "g1",
+            "p1",
+            &ReconfigureRequest {
+                pipeline: replacement,
+                step_timeout_secs: 60,
+                drain_timeout_secs: 60,
+            },
+        )
+        .expect("identical updates should be planned");
+
+    let status = runtime
+        .spawn_rollout(plan)
+        .expect("noop rollout should succeed");
+
+    assert_eq!(status.action, "noop");
+    assert_eq!(status.state, ApiPipelineRolloutState::Succeeded);
+    assert_eq!(status.target_generation, 0);
+    assert!(status.cores.is_empty());
+
+    let pipeline_key = PipelineKey::new("g1".into(), "p1".into());
+    let details = runtime
+        .pipeline_details_snapshot(&pipeline_key)
+        .expect("group should exist")
+        .expect("pipeline should exist");
+    assert_eq!(details.active_generation, Some(0));
+    assert!(details.rollout.is_none());
+
+    let rollout = runtime
+        .rollout_status_snapshot(&status.rollout_id)
+        .expect("completed rollout should remain queryable");
+    assert_eq!(rollout.state, ApiPipelineRolloutState::Succeeded);
+}
+
+/// Scenario: a reconfigure request changes the runtime graph shape while
+/// also changing the resource footprint.
+/// Guarantees: planning keeps the safer replace path instead of collapsing
+/// the update into a resource-only resize.
+#[test]
+fn prepare_rollout_plan_keeps_replace_when_runtime_shape_changes() {
+    let config = engine_config_with_pipeline(
+        r#"
+        policies:
+          resources:
+            core_allocation:
+              type: core_count
+              count: 1
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+    let _receiver =
+        register_runtime_instance(&runtime, "g1", "p1", 0, 0, RuntimeInstanceLifecycle::Active);
+
+    let replacement = PipelineConfig::from_yaml(
+        "g1".into(),
+        "p1".into(),
+        r#"
+policies:
+  resources:
+    core_allocation:
+      type: core_count
+      count: 2
+nodes:
+  input:
+    type: "urn:test:receiver:example"
+    config: null
+  output:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: input
+    to: output
+"#,
+    )
+    .expect("replacement should parse");
+
+    let plan = runtime
+        .prepare_rollout_plan(
+            "g1",
+            "p1",
+            &ReconfigureRequest {
+                pipeline: replacement,
+                step_timeout_secs: 60,
+                drain_timeout_secs: 60,
+            },
+        )
+        .expect("runtime shape changes should still be planned");
+
+    assert_eq!(plan.action, RolloutAction::Replace);
+    assert_eq!(plan.target_generation, 1);
+    assert_eq!(plan.common_assigned_cores, vec![0]);
+    assert_eq!(plan.added_assigned_cores, vec![1]);
+    assert!(plan.resize_start_cores.is_empty());
+    assert!(plan.resize_stop_cores.is_empty());
+    assert_eq!(
+        plan.rollout
+            .cores
+            .iter()
+            .map(|core| core.core_id)
+            .collect::<Vec<_>>(),
+        vec![0, 1]
+    );
+}
+
+/// Scenario: a reconfigure request would require a runtime topic-broker
+/// mutation for an existing logical pipeline.
+/// Guarantees: planning rejects the request before rollout starts and
+/// surfaces an invalid-request error to the caller.
+#[test]
+fn prepare_rollout_plan_rejects_topic_runtime_mutation() {
+    let config = OtelDataflowSpec::from_yaml(
+        r#"
+version: otel_dataflow/v1
+topics:
+  shared: {}
+groups:
+  g1:
+    pipelines:
+      p1:
+        policies:
+          resources:
+            core_allocation:
+              type: core_count
+              count: 1
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          to_topic:
+            type: "urn:otel:exporter:topic"
+            config:
+              topic: shared
+        connections:
+          - from: receiver
+            to: to_topic
+"#,
+    )
+    .expect("config should parse");
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+
+    let replacement = PipelineConfig::from_yaml(
+        "g1".into(),
+        "p1".into(),
+        r#"
+policies:
+  resources:
+    core_allocation:
+      type: core_count
+      count: 1
+nodes:
+  from_topic:
+    type: "urn:otel:receiver:topic"
+    config:
+      topic: shared
+      subscription:
+        mode: balanced
+        group: workers
+  exporter:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: from_topic
+    to: exporter
+"#,
+    )
+    .expect("replacement should parse");
+
+    let err = runtime
+        .prepare_rollout_plan(
+            "g1",
+            "p1",
+            &ReconfigureRequest {
+                pipeline: replacement,
+                step_timeout_secs: 60,
+                drain_timeout_secs: 60,
+            },
+        )
+        .expect_err("topic runtime changes should be rejected");
+
+    match err {
+        ControlPlaneError::InvalidRequest { message } => {
+            assert!(message.contains("topic broker mutation"));
+        }
+        other => panic!("unexpected error: {other:?}"),
+    }
+}
+
+/// Scenario: a second rollout is requested for a logical pipeline that
+/// already has an active rollout record.
+/// Guarantees: planning rejects the new request with a rollout conflict
+/// instead of interleaving two rollout state machines.
+#[test]
+fn prepare_rollout_plan_rejects_concurrent_rollout_for_same_pipeline() {
+    let config = engine_config_with_pipeline(
+        r#"
+        policies:
+          resources:
+            core_allocation:
+              type: core_count
+              count: 1
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+
+    let replacement = PipelineConfig::from_yaml(
+        "g1".into(),
+        "p1".into(),
+        r#"
+policies:
+  resources:
+    core_allocation:
+      type: core_count
+      count: 1
+nodes:
+  input:
+    type: "urn:test:receiver:example"
+    config: null
+  output:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: input
+    to: output
+"#,
+    )
+    .expect("replacement should parse");
+    let plan = runtime
+        .prepare_rollout_plan(
+            "g1",
+            "p1",
+            &ReconfigureRequest {
+                pipeline: replacement.clone(),
+                step_timeout_secs: 60,
+                drain_timeout_secs: 60,
+            },
+        )
+        .expect("first rollout plan should be accepted");
+    runtime
+        .insert_rollout(&plan.pipeline_key, plan.rollout.clone())
+        .expect("rollout should register");
+
+    let err = runtime
+        .prepare_rollout_plan(
+            "g1",
+            "p1",
+            &ReconfigureRequest {
+                pipeline: replacement,
+                step_timeout_secs: 60,
+                drain_timeout_secs: 60,
+            },
+        )
+        .expect_err("second rollout should conflict");
+
+    assert_eq!(err, ControlPlaneError::RolloutConflict);
+}
+
+/// Scenario: a new rollout has been registered for a logical pipeline but
+/// has not yet committed its candidate config.
+/// Guarantees: pipeline details still return the committed config while
+/// exposing the pending rollout summary separately.
+#[test]
+fn pipeline_details_returns_committed_config_while_rollout_is_pending() {
+    let config = engine_config_with_pipeline(
+        r#"
+        policies:
+          resources:
+            core_allocation:
+              type: core_count
+              count: 1
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+
+    let replacement = PipelineConfig::from_yaml(
+        "g1".into(),
+        "p1".into(),
+        r#"
+policies:
+  resources:
+    core_allocation:
+      type: core_count
+      count: 1
+nodes:
+  input:
+    type: "urn:test:receiver:example"
+    config: null
+  output:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: input
+    to: output
+"#,
+    )
+    .expect("replacement should parse");
+    let plan = runtime
+        .prepare_rollout_plan(
+            "g1",
+            "p1",
+            &ReconfigureRequest {
+                pipeline: replacement.clone(),
+                step_timeout_secs: 60,
+                drain_timeout_secs: 60,
+            },
+        )
+        .expect("rollout plan should be accepted");
+    runtime
+        .insert_rollout(&plan.pipeline_key, plan.rollout.clone())
+        .expect("rollout should register");
+
+    let details = runtime
+        .pipeline_details_snapshot(&PipelineKey::new("g1".into(), "p1".into()))
+        .expect("group should exist")
+        .expect("pipeline details should exist");
+
+    let mut committed_nodes = details
+        .pipeline
+        .node_iter()
+        .map(|(node_id, _)| node_id.as_ref().to_owned())
+        .collect::<Vec<_>>();
+    committed_nodes.sort();
+    assert_eq!(
+        committed_nodes,
+        vec!["exporter".to_owned(), "receiver".to_owned()]
+    );
+    assert_eq!(details.active_generation, Some(0));
+    assert_eq!(
+        details
+            .rollout
+            .expect("pending rollout summary should be present")
+            .target_generation,
+        1
+    );
+}
+
+/// Scenario: panic diagnostics are captured for a worker panic with explicit
+/// thread metadata.
+/// Guarantees: the short summary stays operator-friendly while the detailed
+/// form includes thread context and a captured backtrace.
+#[test]
+fn panic_report_formats_summary_and_detail() {
+    let report = PanicReport::capture(
+        "rollout worker",
+        Box::new("boom"),
+        Some("rollout-g1-p1".to_owned()),
+        Some(17),
+        Some(3),
+    );
+
+    assert_eq!(report.summary_message(), "rollout worker panicked: boom");
+    let detail = report.detail_message();
+    assert!(detail.contains("rollout worker panicked: boom"));
+    assert!(detail.contains("thread_name=rollout-g1-p1"));
+    assert!(detail.contains("thread_id=17"));
+    assert!(detail.contains("core_id=3"));
+    assert!(detail.contains("backtrace:"));
+}
+
+/// Scenario: a panic is raised with a non-string payload.
+/// Guarantees: the captured panic summary stays readable and avoids the older
+/// generic placeholder text.
+#[test]
+fn panic_report_non_string_payload_has_useful_fallback() {
+    let report = PanicReport::capture("shutdown worker", Box::new(7usize), None, None, None);
+
+    assert_eq!(
+        report.summary_message(),
+        "shutdown worker panicked: non-string panic payload"
+    );
+    assert!(
+        !report
+            .summary_message()
+            .contains("panic payload was not a string")
+    );
+}
+
+/// Scenario: a detached rollout worker panics before it reaches the normal
+/// terminal-state bookkeeping path.
+/// Guarantees: the rollout is forced into a failed terminal state and the
+/// logical pipeline no longer stays blocked by a stale active-rollout entry.
+#[test]
+fn rollout_worker_panic_marks_failed_and_clears_conflict() {
+    let config = engine_config_with_pipeline(simple_pipeline_yaml());
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+
+    let replacement = PipelineConfig::from_yaml(
+        "g1".into(),
+        "p1".into(),
+        r#"
+nodes:
+  input:
+    type: "urn:test:receiver:example"
+    config: null
+  output:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: input
+    to: output
+"#,
+    )
+    .expect("replacement should parse");
+    let plan = runtime
+        .prepare_rollout_plan(
+            "g1",
+            "p1",
+            &ReconfigureRequest {
+                pipeline: replacement.clone(),
+                step_timeout_secs: 60,
+                drain_timeout_secs: 60,
+            },
+        )
+        .expect("rollout plan should be accepted");
+    runtime
+        .insert_rollout(&plan.pipeline_key, plan.rollout.clone())
+        .expect("rollout should register");
+
+    runtime.handle_rollout_worker_panic(
+        &plan.pipeline_key,
+        &plan.rollout.rollout_id,
+        "rollout-g1-p1".to_owned(),
+        Box::new("boom"),
+    );
+
+    let status = runtime
+        .rollout_status_snapshot(&plan.rollout.rollout_id)
+        .expect("rollout should remain queryable");
+    assert_eq!(status.state, ApiPipelineRolloutState::Failed);
+    assert!(
+        status
+            .failure_reason
+            .as_deref()
+            .is_some_and(|message| message.contains("rollout worker panicked: boom"))
+    );
+    assert!(
+        status
+            .failure_reason
+            .as_deref()
+            .is_some_and(|message| !message.contains("backtrace:"))
+    );
+
+    let state = runtime
+        .state
+        .lock()
+        .unwrap_or_else(|poisoned| poisoned.into_inner());
+    assert!(!state.active_rollouts.contains_key(&plan.pipeline_key));
+    drop(state);
+
+    let _next_plan = runtime
+        .prepare_rollout_plan(
+            "g1",
+            "p1",
+            &ReconfigureRequest {
+                pipeline: replacement,
+                step_timeout_secs: 60,
+                drain_timeout_secs: 60,
+            },
+        )
+        .expect("rollout conflict should be cleared after panic cleanup");
+}
+
+/// Scenario: a rollout worker panics after launching an uncommitted candidate
+/// generation.
+/// Guarantees: panic cleanup requests shutdown for the candidate generation
+/// before clearing the active rollout, avoiding active orphan instances.
+#[test]
+fn rollout_worker_panic_requests_shutdown_for_uncommitted_candidate_generation() {
+    let config = engine_config_with_pipeline(simple_pipeline_yaml());
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+
+    let replacement = PipelineConfig::from_yaml(
+        "g1".into(),
+        "p1".into(),
+        r#"
+nodes:
+  input:
+    type: "urn:test:receiver:example"
+    config: null
+  output:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: input
+    to: output
+"#,
+    )
+    .expect("replacement should parse");
+    let plan = runtime
+        .prepare_rollout_plan(
+            "g1",
+            "p1",
+            &ReconfigureRequest {
+                pipeline: replacement,
+                step_timeout_secs: 60,
+                drain_timeout_secs: 60,
+            },
+        )
+        .expect("rollout plan should be accepted");
+    runtime
+        .insert_rollout(&plan.pipeline_key, plan.rollout.clone())
+        .expect("rollout should register");
+
+    let candidate_key = deployed_key("g1", "p1", 0, plan.target_generation);
+    let mut candidate_rx = register_runtime_instance(
+        &runtime,
+        "g1",
+        "p1",
+        0,
+        plan.target_generation,
+        RuntimeInstanceLifecycle::Active,
+    );
+
+    runtime.handle_rollout_worker_panic(
+        &plan.pipeline_key,
+        &plan.rollout.rollout_id,
+        "rollout-g1-p1".to_owned(),
+        Box::new("boom"),
+    );
+
+    assert!(matches!(
+        wait_for_shutdown_message(&mut candidate_rx),
+        RuntimeControlMsg::Shutdown { reason, .. } if reason == "rollout panic cleanup"
+    ));
+
+    let state = runtime
+        .state
+        .lock()
+        .unwrap_or_else(|poisoned| poisoned.into_inner());
+    assert!(
+        state
+            .runtime_instances
+            .get(&candidate_key)
+            .expect("candidate instance should still be tracked until exit")
+            .control_sender
+            .is_none(),
+        "panic cleanup should release the retained sender after shutdown dispatch"
+    );
+    assert!(!state.active_rollouts.contains_key(&plan.pipeline_key));
+}
+
+/// Scenario: a rollout worker panics after the target generation was already
+/// committed as serving.
+/// Guarantees: panic cleanup does not shut down the committed generation,
+/// which would turn a late bookkeeping panic into runtime outage.
+#[test]
+fn rollout_worker_panic_does_not_shutdown_committed_target_generation() {
+    let config = engine_config_with_pipeline(simple_pipeline_yaml());
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+
+    let replacement = PipelineConfig::from_yaml(
+        "g1".into(),
+        "p1".into(),
+        r#"
+nodes:
+  input:
+    type: "urn:test:receiver:example"
+    config: null
+  output:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: input
+    to: output
+"#,
+    )
+    .expect("replacement should parse");
+    let plan = runtime
+        .prepare_rollout_plan(
+            "g1",
+            "p1",
+            &ReconfigureRequest {
+                pipeline: replacement,
+                step_timeout_secs: 60,
+                drain_timeout_secs: 60,
+            },
+        )
+        .expect("rollout plan should be accepted");
+    runtime
+        .insert_rollout(&plan.pipeline_key, plan.rollout.clone())
+        .expect("rollout should register");
+    runtime.commit_pipeline_record(&plan, plan.target_generation);
+
+    let mut candidate_rx = register_runtime_instance(
+        &runtime,
+        "g1",
+        "p1",
+        0,
+        plan.target_generation,
+        RuntimeInstanceLifecycle::Active,
+    );
+
+    runtime.handle_rollout_worker_panic(
+        &plan.pipeline_key,
+        &plan.rollout.rollout_id,
+        "rollout-g1-p1".to_owned(),
+        Box::new("boom"),
+    );
+
+    assert!(
+        candidate_rx.try_recv().is_err(),
+        "committed target generation must not receive panic-cleanup shutdown"
+    );
+}
+
+/// Scenario: a resize rollback must clean up cores that were already started
+/// before a later step fails.
+/// Guarantees: rollback sends shutdown to those started cores instead of
+/// leaving them running after the rollout fails.
+#[test]
+fn rollback_resize_rollout_cleans_up_started_cores() {
+    let config = engine_config_with_pipeline(
+        r#"
+        policies:
+          resources:
+            core_allocation:
+              type: core_count
+              count: 1
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+    let _existing =
+        register_runtime_instance(&runtime, "g1", "p1", 0, 0, RuntimeInstanceLifecycle::Active);
+
+    let replacement = PipelineConfig::from_yaml(
+        "g1".into(),
+        "p1".into(),
+        r#"
+policies:
+  resources:
+    core_allocation:
+      type: core_count
+      count: 2
+nodes:
+  receiver:
+    type: "urn:test:receiver:example"
+    config: null
+  exporter:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: receiver
+    to: exporter
+"#,
+    )
+    .expect("replacement should parse");
+    let plan = runtime
+        .prepare_rollout_plan(
+            "g1",
+            "p1",
+            &ReconfigureRequest {
+                pipeline: replacement,
+                step_timeout_secs: 60,
+                drain_timeout_secs: 60,
+            },
+        )
+        .expect("resize rollout plan should be accepted");
+    runtime
+        .insert_rollout(&plan.pipeline_key, plan.rollout.clone())
+        .expect("rollout should register");
+
+    let started_key = deployed_key("g1", "p1", 1, plan.target_generation);
+    let started_rx = register_runtime_instance(
+        &runtime,
+        "g1",
+        "p1",
+        1,
+        plan.target_generation,
+        RuntimeInstanceLifecycle::Active,
+    );
+    let exit_thread = complete_instance_exit_on_shutdown(
+        Arc::clone(&runtime),
+        started_rx,
+        started_key.clone(),
+        "rollback cleanup",
+    );
+
+    let result = runtime.rollback_resize_rollout(&plan, &[1], &[], "boom".to_owned());
+
+    assert!(matches!(
+        result,
+        Err(RolloutExecutionError::Failed(reason)) if reason == "boom"
+    ));
+    exit_thread
+        .join()
+        .expect("resize rollback shutdown helper should join cleanly");
+    assert!(matches!(
+        runtime.instance_exit(&started_key),
+        Some(RuntimeInstanceExit::Success)
+    ));
+}
+
+/// Scenario: a replace rollback must clean up added candidate cores that were
+/// already serving the target generation before a later step fails.
+/// Guarantees: rollback sends shutdown to those activated added cores instead
+/// of leaving the candidate generation running.
+#[test]
+fn rollback_replace_rollout_cleans_up_activated_added_cores() {
+    let config = engine_config_with_pipeline(
+        r#"
+        policies:
+          resources:
+            core_allocation:
+              type: core_count
+              count: 1
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+    let _existing =
+        register_runtime_instance(&runtime, "g1", "p1", 0, 0, RuntimeInstanceLifecycle::Active);
+
+    let replacement = PipelineConfig::from_yaml(
+        "g1".into(),
+        "p1".into(),
+        r#"
+policies:
+  resources:
+    core_allocation:
+      type: core_count
+      count: 2
+nodes:
+  input:
+    type: "urn:test:receiver:example"
+    config: null
+  output:
+    type: "urn:test:exporter:example"
+    config: null
+connections:
+  - from: input
+    to: output
+"#,
+    )
+    .expect("replacement should parse");
+    let plan = runtime
+        .prepare_rollout_plan(
+            "g1",
+            "p1",
+            &ReconfigureRequest {
+                pipeline: replacement,
+                step_timeout_secs: 60,
+                drain_timeout_secs: 60,
+            },
+        )
+        .expect("replace rollout plan should be accepted");
+    runtime
+        .insert_rollout(&plan.pipeline_key, plan.rollout.clone())
+        .expect("rollout should register");
+
+    let added_key = deployed_key("g1", "p1", 1, plan.target_generation);
+    let added_rx = register_runtime_instance(
+        &runtime,
+        "g1",
+        "p1",
+        1,
+        plan.target_generation,
+        RuntimeInstanceLifecycle::Active,
+    );
+    let exit_thread = complete_instance_exit_on_shutdown(
+        Arc::clone(&runtime),
+        added_rx,
+        added_key.clone(),
+        "rollback cleanup",
+    );
+
+    let result = runtime.rollback_replace_rollout(&plan, &[], &[1], &[], "boom".to_owned());
+
+    assert!(matches!(
+        result,
+        Err(RolloutExecutionError::Failed(reason)) if reason == "boom"
+    ));
+    exit_thread
+        .join()
+        .expect("replace rollback shutdown helper should join cleanly");
+    assert!(matches!(
+        runtime.instance_exit(&added_key),
+        Some(RuntimeInstanceExit::Success)
+    ));
+}
+
+/// Scenario: a shutdown request targets a group id that does not exist in
+/// the controller's committed config.
+/// Guarantees: per-pipeline shutdown fails fast with `GroupNotFound`
+/// instead of creating a shutdown record.
+#[test]
+fn request_shutdown_pipeline_rejects_missing_group() {
+    let config = engine_config_with_pipeline(
+        r#"
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+
+    let err = runtime
+        .request_shutdown_pipeline("missing", "p1", 5)
+        .expect_err("missing group should be rejected");
+
+    assert_eq!(err, ControlPlaneError::GroupNotFound);
+}
+
+/// Scenario: a shutdown request targets a pipeline id that is not present
+/// in an existing group.
+/// Guarantees: per-pipeline shutdown rejects the request with
+/// `PipelineNotFound` before any runtime instances are touched.
+#[test]
+fn request_shutdown_pipeline_rejects_missing_pipeline() {
+    let config = engine_config_with_pipeline(
+        r#"
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+
+    let err = runtime
+        .request_shutdown_pipeline("g1", "missing", 5)
+        .expect_err("missing pipeline should be rejected");
+
+    assert_eq!(err, ControlPlaneError::PipelineNotFound);
+}
+
+/// Scenario: a detached shutdown worker panics before it reaches the normal
+/// terminal-state bookkeeping path.
+/// Guarantees: the shutdown is forced into a failed terminal state and the
+/// logical pipeline no longer stays blocked by a stale active-shutdown entry.
+#[test]
+fn shutdown_worker_panic_marks_failed_and_clears_conflict() {
+    let config = engine_config_with_pipeline(simple_pipeline_yaml());
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+    let _rx =
+        register_runtime_instance(&runtime, "g1", "p1", 0, 0, RuntimeInstanceLifecycle::Active);
+
+    let plan = runtime
+        .prepare_shutdown_plan("g1", "p1", 5)
+        .expect("shutdown plan should be accepted");
+    runtime
+        .insert_shutdown(&plan.pipeline_key, plan.shutdown.clone())
+        .expect("shutdown should register");
+
+    runtime.handle_shutdown_worker_panic(
+        &plan.pipeline_key,
+        &plan.shutdown.shutdown_id,
+        "shutdown-g1-p1".to_owned(),
+        Box::new("boom"),
+    );
+
+    let status = runtime
+        .shutdown_status_snapshot(&plan.shutdown.shutdown_id)
+        .expect("shutdown should remain queryable");
+    assert_eq!(status.state, "failed");
+    assert!(
+        status
+            .failure_reason
+            .as_deref()
+            .is_some_and(|message| message.contains("shutdown worker panicked: boom"))
+    );
+    assert!(
+        status
+            .failure_reason
+            .as_deref()
+            .is_some_and(|message| !message.contains("backtrace:"))
+    );
+
+    let state = runtime
+        .state
+        .lock()
+        .unwrap_or_else(|poisoned| poisoned.into_inner());
+    assert!(!state.active_shutdowns.contains_key(&plan.pipeline_key));
+    drop(state);
+
+    let _next_plan = runtime
+        .prepare_shutdown_plan("g1", "p1", 5)
+        .expect("shutdown conflict should be cleared after panic cleanup");
+}
+
+/// Scenario: a shutdown request arrives while the same logical pipeline is
+/// already under rollout.
+/// Guarantees: shutdown is rejected with a rollout conflict so the rollout
+/// controller remains the single owner of that pipeline's lifecycle.
+#[test]
+fn request_shutdown_pipeline_rejects_active_rollout() {
+    let config = engine_config_with_pipeline(
+        r#"
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+
+    let pipeline_key = PipelineKey::new("g1".into(), "p1".into());
+    let mut state = runtime
+        .state
+        .lock()
+        .unwrap_or_else(|poisoned| poisoned.into_inner());
+    _ = state
+        .active_rollouts
+        .insert(pipeline_key, "rollout-42".to_owned());
+    drop(state);
+
+    let err = runtime
+        .request_shutdown_pipeline("g1", "p1", 5)
+        .expect_err("active rollout should conflict");
+
+    assert_eq!(err, ControlPlaneError::RolloutConflict);
+}
+
+/// Scenario: a second shutdown request targets a logical pipeline that
+/// already has an active shutdown operation.
+/// Guarantees: the controller rejects the duplicate request instead of
+/// creating competing shutdown records.
+#[test]
+fn request_shutdown_pipeline_rejects_active_shutdown() {
+    let config = engine_config_with_pipeline(
+        r#"
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+    let pipeline_key = PipelineKey::new("g1".into(), "p1".into());
+    let shutdown = ShutdownRecord::new(
+        "shutdown-0".to_owned(),
+        "g1".into(),
+        "p1".into(),
+        vec![ShutdownCoreProgress {
+            core_id: 0,
+            deployment_generation: 0,
+            state: "pending".to_owned(),
+            updated_at: timestamp_now(),
+            detail: None,
+        }],
+    );
+    runtime
+        .insert_shutdown(&pipeline_key, shutdown)
+        .expect("shutdown should register");
+
+    let err = runtime
+        .request_shutdown_pipeline("g1", "p1", 5)
+        .expect_err("active shutdown should conflict");
+
+    assert_eq!(err, ControlPlaneError::RolloutConflict);
+}
+
+/// Scenario: a shutdown request targets a committed pipeline that currently
+/// has no active runtime instances.
+/// Guarantees: the controller rejects the request as an invalid already
+/// stopped pipeline instead of synthesizing a no-op shutdown operation.
+#[test]
+fn request_shutdown_pipeline_rejects_already_stopped_pipeline() {
+    let config = engine_config_with_pipeline(
+        r#"
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+
+    let err = runtime
+        .request_shutdown_pipeline("g1", "p1", 5)
+        .expect_err("already stopped pipeline should be rejected");
+
+    match err {
+        ControlPlaneError::InvalidRequest { message } => {
+            assert!(message.contains("already stopped"));
+        }
+        other => panic!("unexpected error: {other:?}"),
+    }
+}
+
+/// Scenario: a shutdown request targets one logical pipeline while other
+/// pipelines and exited instances still exist in the runtime registry.
+/// Guarantees: only active instances for the requested logical pipeline
+/// receive shutdown control messages and relinquish their control senders.
+#[test]
+fn request_shutdown_pipeline_targets_only_active_instances_for_pipeline() {
+    let config = OtelDataflowSpec::from_yaml(
+        r#"
+version: otel_dataflow/v1
+groups:
+  g1:
+    pipelines:
+      p1:
+        policies:
+          resources:
+            core_allocation:
+              type: core_count
+              count: 2
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+      p2:
+        policies:
+          resources:
+            core_allocation:
+              type: core_count
+              count: 1
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    )
+    .expect("config should parse");
+    let runtime = test_runtime(&config);
+    register_pipeline(&runtime, &config, "g1", "p1");
+    register_pipeline(&runtime, &config, "g1", "p2");
+
+    let mut p1_core0 =
+        register_runtime_instance(&runtime, "g1", "p1", 0, 0, RuntimeInstanceLifecycle::Active);
+    let mut p1_core1 =
+        register_runtime_instance(&runtime, "g1", "p1", 1, 0, RuntimeInstanceLifecycle::Active);
+    let mut p1_exited = register_runtime_instance(
+        &runtime,
+        "g1",
+        "p1",
+        2,
+        0,
+        RuntimeInstanceLifecycle::Exited(RuntimeInstanceExit::Success),
+    );
+    let mut p2_core0 =
+        register_runtime_instance(&runtime, "g1", "p2", 3, 0, RuntimeInstanceLifecycle::Active);
+
+    let _shutdown = runtime
+        .request_shutdown_pipeline("g1", "p1", 5)
+        .expect("shutdown request should be accepted");
+
+    assert!(matches!(
+        wait_for_shutdown_message(&mut p1_core0),
+        RuntimeControlMsg::Shutdown { reason, .. } if reason == "pipeline shutdown"
+    ));
+    assert!(matches!(
+        wait_for_shutdown_message(&mut p1_core1),
+        RuntimeControlMsg::Shutdown { reason, .. } if reason == "pipeline shutdown"
+    ));
+    assert!(
+        p1_exited.try_recv().is_err(),
+        "exited runtime should not receive shutdown"
+    );
+    assert!(
+        p2_core0.try_recv().is_err(),
+        "other pipelines must not receive shutdown"
+    );
+    let deadline = Instant::now() + Duration::from_secs(2);
+    loop {
+        let state = runtime
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        let p1_core0_released = state
+            .runtime_instances
+            .get(&DeployedPipelineKey {
+                pipeline_group_id: "g1".into(),
+                pipeline_id: "p1".into(),
+                core_id: 0,
+                deployment_generation: 0,
+            })
+            .and_then(|instance| instance.control_sender.as_ref())
+            .is_none();
+        let p1_core1_released = state
+            .runtime_instances
+            .get(&DeployedPipelineKey {
+                pipeline_group_id: "g1".into(),
+                pipeline_id: "p1".into(),
+                core_id: 1,
+                deployment_generation: 0,
+            })
+            .and_then(|instance| instance.control_sender.as_ref())
+            .is_none();
+        let p2_core0_retained = state
+            .runtime_instances
+            .get(&DeployedPipelineKey {
+                pipeline_group_id: "g1".into(),
+                pipeline_id: "p2".into(),
+                core_id: 3,
+                deployment_generation: 0,
+            })
+            .and_then(|instance| instance.control_sender.as_ref())
+            .is_some();
+        drop(state);
+
+        if p1_core0_released && p1_core1_released && p2_core0_retained {
+            break;
+        }
+        assert!(
+            Instant::now() < deadline,
+            "timed out waiting for targeted control senders to be released"
+        );
+        thread::sleep(Duration::from_millis(25));
+    }
+}
+
+/// Scenario: global shutdown dispatch encounters a send failure for one
+/// active runtime instance while other active instances still need the signal.
+/// Guarantees: shutdown dispatch is best effort across the whole snapshot:
+/// every active sender is attempted, successful sends relinquish their retained
+/// control sender, repeated calls do not re-signal instances that already
+/// accepted shutdown, and failures are reported only after the full pass.
+#[test]
+fn request_shutdown_all_attempts_all_active_instances_before_returning_error() {
+    let runtime = test_runtime(&engine_config_with_pipeline(simple_pipeline_yaml()));
+    let key0 = deployed_key("g1", "p1", 0, 0);
+    let key1 = deployed_key("g1", "p1", 1, 0);
+    let key2 = deployed_key("g1", "p1", 2, 0);
+    let (sender0, calls0) = recording_admin_sender(None);
+    let (sender1, calls1) = recording_admin_sender(Some("simulated send failure"));
+    let (sender2, calls2) = recording_admin_sender(None);
+
+    register_runtime_instance_with_sender(
+        &runtime,
+        key0.clone(),
+        sender0,
+        RuntimeInstanceLifecycle::Active,
+    );
+    register_runtime_instance_with_sender(
+        &runtime,
+        key1.clone(),
+        sender1,
+        RuntimeInstanceLifecycle::Active,
+    );
+    register_runtime_instance_with_sender(
+        &runtime,
+        key2.clone(),
+        sender2,
+        RuntimeInstanceLifecycle::Active,
+    );
+
+    let err = runtime
+        .request_shutdown_all(5)
+        .expect_err("shutdown-all should report the failed sender after dispatching all sends");
+
+    assert_eq!(
+        *calls0
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner()),
+        vec!["global shutdown".to_owned()]
+    );
+    assert_eq!(
+        *calls1
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner()),
+        vec!["global shutdown".to_owned()]
+    );
+    assert_eq!(
+        *calls2
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner()),
+        vec!["global shutdown".to_owned()]
+    );
+
+    let ControlPlaneError::Internal { message } = err else {
+        panic!("unexpected shutdown-all error: {err:?}");
+    };
+    assert!(message.contains("g1:p1 core=1 generation=0"));
+    assert!(message.contains("simulated send failure"));
+
+    let state = runtime
+        .state
+        .lock()
+        .unwrap_or_else(|poisoned| poisoned.into_inner());
+    assert!(
+        state
+            .runtime_instances
+            .get(&key0)
+            .and_then(|instance| instance.control_sender.as_ref())
+            .is_none(),
+        "successful shutdown send should release key0 control sender"
+    );
+    assert!(
+        state
+            .runtime_instances
+            .get(&key1)
+            .and_then(|instance| instance.control_sender.as_ref())
+            .is_some(),
+        "failed shutdown send should retain key1 control sender"
+    );
+    assert!(
+        state
+            .runtime_instances
+            .get(&key2)
+            .and_then(|instance| instance.control_sender.as_ref())
+            .is_none(),
+        "successful shutdown send should release key2 control sender"
+    );
+    drop(state);
+
+    // The first pass released the control sender for successful instances, so
+    // a retry should only reattempt the instance whose shutdown send failed.
+    let err = runtime
+        .request_shutdown_all(5)
+        .expect_err("shutdown-all retry should still report the failed sender");
+
+    let ControlPlaneError::Internal { message } = err else {
+        panic!("unexpected shutdown-all retry error: {err:?}");
+    };
+    assert!(message.contains("g1:p1 core=1 generation=0"));
+    assert!(message.contains("simulated send failure"));
+    assert_eq!(
+        *calls0
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner()),
+        vec!["global shutdown".to_owned()]
+    );
+    assert_eq!(
+        *calls1
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner()),
+        vec!["global shutdown".to_owned(), "global shutdown".to_owned()]
+    );
+    assert_eq!(
+        *calls2
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner()),
+        vec!["global shutdown".to_owned()]
+    );
+}
+
+/// Scenario: all targeted runtime instances exit cleanly after a pipeline
+/// shutdown request is accepted.
+/// Guarantees: the shutdown record reaches `succeeded`, tracks per-core
+/// completion, and removes the active shutdown lock for that pipeline.
+#[test]
+fn request_shutdown_pipeline_tracks_completion() {
+    let config = engine_config_with_pipeline(
+        r#"
+        policies:
+          resources:
+            core_allocation:
+              type: core_count
+              count: 2
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+
+    let mut core0 =
+        register_runtime_instance(&runtime, "g1", "p1", 0, 0, RuntimeInstanceLifecycle::Active);
+    let mut core1 =
+        register_runtime_instance(&runtime, "g1", "p1", 1, 0, RuntimeInstanceLifecycle::Active);
+
+    let shutdown = runtime
+        .request_shutdown_pipeline("g1", "p1", 5)
+        .expect("shutdown request should be accepted");
+    assert_eq!(shutdown.state, "pending");
+
+    assert!(matches!(
+        wait_for_shutdown_message(&mut core0),
+        RuntimeControlMsg::Shutdown { reason, .. } if reason == "pipeline shutdown"
+    ));
+    assert!(matches!(
+        wait_for_shutdown_message(&mut core1),
+        RuntimeControlMsg::Shutdown { reason, .. } if reason == "pipeline shutdown"
+    ));
+
+    runtime.note_instance_exit(
+        DeployedPipelineKey {
+            pipeline_group_id: "g1".into(),
+            pipeline_id: "p1".into(),
+            core_id: 0,
+            deployment_generation: 0,
+        },
+        RuntimeInstanceExit::Success,
+    );
+    {
+        let state = runtime
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        assert!(
+            state.runtime_instances.contains_key(&DeployedPipelineKey {
+                pipeline_group_id: "g1".into(),
+                pipeline_id: "p1".into(),
+                core_id: 0,
+                deployment_generation: 0,
+            }),
+            "active shutdown should retain exited instances until completion"
+        );
+    }
+    runtime.note_instance_exit(
+        DeployedPipelineKey {
+            pipeline_group_id: "g1".into(),
+            pipeline_id: "p1".into(),
+            core_id: 1,
+            deployment_generation: 0,
+        },
+        RuntimeInstanceExit::Success,
+    );
+
+    let status = wait_for_shutdown_state(&runtime, &shutdown.shutdown_id, "succeeded");
+    assert_eq!(status.cores.len(), 2);
+    assert!(status.cores.iter().all(|core| core.state == "exited"));
+
+    let state = runtime
+        .state
+        .lock()
+        .unwrap_or_else(|poisoned| poisoned.into_inner());
+    assert!(
+        !state
+            .active_shutdowns
+            .contains_key(&PipelineKey::new("g1".into(), "p1".into()))
+    );
+    assert!(!state.runtime_instances.contains_key(&DeployedPipelineKey {
+        pipeline_group_id: "g1".into(),
+        pipeline_id: "p1".into(),
+        core_id: 0,
+        deployment_generation: 0,
+    }));
+    assert!(!state.runtime_instances.contains_key(&DeployedPipelineKey {
+        pipeline_group_id: "g1".into(),
+        pipeline_id: "p1".into(),
+        core_id: 1,
+        deployment_generation: 0,
+    }));
+}
+
+/// Scenario: a pipeline shutdown request is accepted but the targeted
+/// runtime instance never exits before the shutdown deadline.
+/// Guarantees: the shutdown record transitions to `failed`, preserves the
+/// timeout reason, and records the failed per-core state for callers.
+#[test]
+fn request_shutdown_pipeline_tracks_timeout_failure() {
+    let config = engine_config_with_pipeline(
+        r#"
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+
+    let mut core0 =
+        register_runtime_instance(&runtime, "g1", "p1", 0, 0, RuntimeInstanceLifecycle::Active);
+
+    let shutdown = runtime
+        .request_shutdown_pipeline("g1", "p1", 1)
+        .expect("shutdown request should be accepted");
+    assert!(matches!(
+        wait_for_shutdown_message(&mut core0),
+        RuntimeControlMsg::Shutdown { reason, .. } if reason == "pipeline shutdown"
+    ));
+
+    let status = wait_for_shutdown_state(&runtime, &shutdown.shutdown_id, "failed");
+    assert!(
+        status
+            .failure_reason
+            .as_deref()
+            .is_some_and(|reason| reason.contains("timed out waiting"))
+    );
+    assert_eq!(status.cores.len(), 1);
+    assert_eq!(status.cores[0].state, "failed");
+}
+
+/// Scenario: terminal rollout history grows beyond the retention cap for one
+/// logical pipeline while another pipeline also retains rollout history.
+/// Guarantees: eviction is oldest-first and scoped per logical pipeline rather
+/// than dropping unrelated rollout history.
+#[test]
+fn terminal_rollout_history_is_bounded_per_pipeline() {
+    let runtime = test_runtime(&engine_config_with_pipeline(simple_pipeline_yaml()));
+    let pipeline_key = PipelineKey::new("g1".into(), "p1".into());
+    let other_pipeline_key = PipelineKey::new("g1".into(), "p2".into());
+
+    let mut state = runtime
+        .state
+        .lock()
+        .unwrap_or_else(|poisoned| poisoned.into_inner());
+    for index in 0..=TERMINAL_ROLLOUT_RETENTION_LIMIT {
+        let rollout_id = format!("rollout-{index}");
+        _ = state.rollouts.insert(
+            rollout_id.clone(),
+            terminal_rollout_record("g1", "p1", &rollout_id),
+        );
+        ControllerRuntime::<()>::record_terminal_rollout_locked(
+            &mut state,
+            &pipeline_key,
+            &rollout_id,
+            Instant::now(),
+        );
+    }
+
+    let other_rollout_id = "rollout-other".to_owned();
+    _ = state.rollouts.insert(
+        other_rollout_id.clone(),
+        terminal_rollout_record("g1", "p2", &other_rollout_id),
+    );
+    ControllerRuntime::<()>::record_terminal_rollout_locked(
+        &mut state,
+        &other_pipeline_key,
+        &other_rollout_id,
+        Instant::now(),
+    );
+
+    assert!(!state.rollouts.contains_key("rollout-0"));
+    assert!(state.rollouts.contains_key("rollout-1"));
+    assert!(state.rollouts.contains_key(&other_rollout_id));
+    assert_eq!(
+        state
+            .terminal_rollouts
+            .get(&pipeline_key)
+            .map(|queue| queue.len()),
+        Some(TERMINAL_ROLLOUT_RETENTION_LIMIT)
+    );
+    assert_eq!(
+        state
+            .terminal_rollouts
+            .get(&other_pipeline_key)
+            .map(|queue| queue.len()),
+        Some(1)
+    );
+}
+
+/// Scenario: terminal shutdown history grows beyond the retention cap for one
+/// logical pipeline while another pipeline also retains shutdown history.
+/// Guarantees: shutdown eviction is oldest-first and scoped per logical
+/// pipeline rather than trimming unrelated shutdown history.
+#[test]
+fn terminal_shutdown_history_is_bounded_per_pipeline() {
+    let runtime = test_runtime(&engine_config_with_pipeline(simple_pipeline_yaml()));
+    let pipeline_key = PipelineKey::new("g1".into(), "p1".into());
+    let other_pipeline_key = PipelineKey::new("g1".into(), "p2".into());
+
+    let mut state = runtime
+        .state
+        .lock()
+        .unwrap_or_else(|poisoned| poisoned.into_inner());
+    for index in 0..=TERMINAL_SHUTDOWN_RETENTION_LIMIT {
+        let shutdown_id = format!("shutdown-{index}");
+        _ = state.shutdowns.insert(
+            shutdown_id.clone(),
+            terminal_shutdown_record("g1", "p1", &shutdown_id),
+        );
+        ControllerRuntime::<()>::record_terminal_shutdown_locked(
+            &mut state,
+            &pipeline_key,
+            &shutdown_id,
+            Instant::now(),
+        );
+    }
+
+    let other_shutdown_id = "shutdown-other".to_owned();
+    _ = state.shutdowns.insert(
+        other_shutdown_id.clone(),
+        terminal_shutdown_record("g1", "p2", &other_shutdown_id),
+    );
+    ControllerRuntime::<()>::record_terminal_shutdown_locked(
+        &mut state,
+        &other_pipeline_key,
+        &other_shutdown_id,
+        Instant::now(),
+    );
+
+    assert!(!state.shutdowns.contains_key("shutdown-0"));
+    assert!(state.shutdowns.contains_key("shutdown-1"));
+    assert!(state.shutdowns.contains_key(&other_shutdown_id));
+    assert_eq!(
+        state
+            .terminal_shutdowns
+            .get(&pipeline_key)
+            .map(|queue| queue.len()),
+        Some(TERMINAL_SHUTDOWN_RETENTION_LIMIT)
+    );
+    assert_eq!(
+        state
+            .terminal_shutdowns
+            .get(&other_pipeline_key)
+            .map(|queue| queue.len()),
+        Some(1)
+    );
+}
+
+/// Scenario: terminal rollout and shutdown ids outlive their retention TTL in
+/// the controller's in-memory history.
+/// Guarantees: history pruning expires those terminal records and subsequent
+/// by-id lookups return not found instead of growing unboundedly.
+#[test]
+fn terminal_operation_history_expires_after_ttl() {
+    let runtime = test_runtime(&engine_config_with_pipeline(simple_pipeline_yaml()));
+    let pipeline_key = PipelineKey::new("g1".into(), "p1".into());
+    let rollout_id = "rollout-old".to_owned();
+    let shutdown_id = "shutdown-old".to_owned();
+    let prune_now = Instant::now()
+        .checked_add(TERMINAL_OPERATION_RETENTION_TTL + Duration::from_secs(2))
+        .expect("synthetic prune deadline should be representable");
+    let expired_at = prune_now
+        .checked_sub(TERMINAL_OPERATION_RETENTION_TTL + Duration::from_secs(1))
+        .expect("synthetic completed_at should be representable");
+
+    {
+        let mut state = runtime
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+
+        let mut rollout = terminal_rollout_record("g1", "p1", &rollout_id);
+        rollout.completed_at = Some(expired_at);
+        _ = state.rollouts.insert(rollout_id.clone(), rollout);
+        state
+            .terminal_rollouts
+            .entry(pipeline_key.clone())
+            .or_default()
+            .push_back(rollout_id.clone());
+
+        let mut shutdown = terminal_shutdown_record("g1", "p1", &shutdown_id);
+        shutdown.completed_at = Some(expired_at);
+        _ = state.shutdowns.insert(shutdown_id.clone(), shutdown);
+        state
+            .terminal_shutdowns
+            .entry(pipeline_key.clone())
+            .or_default()
+            .push_back(shutdown_id.clone());
+
+        // Use a synthetic future `now` here instead of relying on
+        // `Instant::now() - ttl`, which can underflow on Windows near the
+        // monotonic clock origin.
+        ControllerRuntime::<()>::prune_terminal_operation_history_locked(&mut state, prune_now);
+    }
+
+    assert!(runtime.rollout_status_snapshot(&rollout_id).is_none());
+    assert!(runtime.shutdown_status_snapshot(&shutdown_id).is_none());
+
+    let state = runtime
+        .state
+        .lock()
+        .unwrap_or_else(|poisoned| poisoned.into_inner());
+    assert!(!state.rollouts.contains_key(&rollout_id));
+    assert!(!state.shutdowns.contains_key(&shutdown_id));
+    assert!(!state.terminal_rollouts.contains_key(&pipeline_key));
+    assert!(!state.terminal_shutdowns.contains_key(&pipeline_key));
+}
+
+/// Scenario: an instance exits when there is no active rollout or shutdown for
+/// its logical pipeline.
+/// Guarantees: the controller does not retain that exited runtime instance as
+/// history once no active control-plane operation depends on it.
+#[test]
+fn exited_runtime_instances_without_active_operation_are_pruned_immediately() {
+    let config = engine_config_with_pipeline(simple_pipeline_yaml());
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+    let _rx =
+        register_runtime_instance(&runtime, "g1", "p1", 0, 0, RuntimeInstanceLifecycle::Active);
+
+    let deployed_key = DeployedPipelineKey {
+        pipeline_group_id: "g1".into(),
+        pipeline_id: "p1".into(),
+        core_id: 0,
+        deployment_generation: 0,
+    };
+    runtime.note_instance_exit(deployed_key.clone(), RuntimeInstanceExit::Success);
+
+    let state = runtime
+        .state
+        .lock()
+        .unwrap_or_else(|poisoned| poisoned.into_inner());
+    assert!(!state.runtime_instances.contains_key(&deployed_key));
+}
+
+/// Scenario: a runtime thread reports exit before the controller finishes
+/// registering the launched instance as active.
+/// Guarantees: early exit bookkeeping is reconciled during registration, so
+/// active-instance tracking does not leak and the pending-exit entry is cleared.
+#[test]
+fn register_launched_instance_reconciles_early_exit_without_leaking_active_count() {
+    let config = engine_config_with_pipeline(simple_pipeline_yaml());
+    let runtime = test_runtime(&config);
+    register_existing_pipeline(&runtime, &config);
+
+    let deployed_key = deployed_key("g1", "p1", 0, 0);
+    runtime.note_instance_exit(deployed_key.clone(), RuntimeInstanceExit::Success);
+
+    runtime.register_launched_instance(launched_runtime_instance("g1", "p1", 0, 0));
+
+    let state = runtime
+        .state
+        .lock()
+        .unwrap_or_else(|poisoned| poisoned.into_inner());
+    assert_eq!(state.active_instances, 0);
+    assert!(!state.pending_instance_exits.contains_key(&deployed_key));
+    assert!(!state.runtime_instances.contains_key(&deployed_key));
+}
+
+/// Scenario: a completed rollout has advanced the committed active generation,
+/// but observed state still contains the older generation for the same core.
+/// Guarantees: controller cleanup compacts observed state to the selected
+/// active generation so retained instance memory no longer grows with rollout
+/// count after completion.
+#[test]
+fn prune_pipeline_runtime_and_history_compacts_observed_state_to_active_generation() {
+    let config = engine_config_with_pipeline(simple_pipeline_yaml());
+    let runtime = test_runtime(&config);
+    let _runner = ObservedStateRunner::start(&runtime);
+    register_existing_pipeline(&runtime, &config);
+
+    let pipeline_key = PipelineKey::new("g1".into(), "p1".into());
+    report_ready(&runtime, deployed_key("g1", "p1", 0, 0));
+    report_ready(&runtime, deployed_key("g1", "p1", 0, 1));
+    let status = wait_for_observed_status(&runtime, &pipeline_key, |status| {
+        status.per_instance().len() == 2
+    });
+    assert!(status.instance_status(0, 0).is_some());
+    assert!(status.instance_status(0, 1).is_some());
+
+    runtime
+        .observed_state_store
+        .set_pipeline_active_generation(pipeline_key.clone(), 1);
+    runtime.prune_pipeline_runtime_and_history(&pipeline_key);
+
+    let status = wait_for_observed_status(&runtime, &pipeline_key, |status| {
+        status.per_instance().len() == 1
+    });
+    assert!(status.instance_status(0, 1).is_some());
+    assert!(status.instance_status(0, 0).is_none());
+}
+
+/// Scenario: a logical pipeline has fully shut down and observed state still
+/// contains an older generation alongside the final stopped generation.
+/// Guarantees: controller cleanup keeps the last stopped generation per core so
+/// `/status` remains useful after shutdown while superseded generations are
+/// released.
+#[test]
+fn prune_pipeline_runtime_and_history_keeps_last_stopped_generation_view() {
+    let config = engine_config_with_pipeline(simple_pipeline_yaml());
+    let runtime = test_runtime(&config);
+    let _runner = ObservedStateRunner::start(&runtime);
+    register_existing_pipeline(&runtime, &config);
+
+    let pipeline_key = PipelineKey::new("g1".into(), "p1".into());
+    report_stopped(&runtime, deployed_key("g1", "p1", 0, 0));
+    report_stopped(&runtime, deployed_key("g1", "p1", 0, 1));
+    let status = wait_for_observed_status(&runtime, &pipeline_key, |status| {
+        status.per_instance().len() == 2
+    });
+    assert!(status.instance_status(0, 0).is_some());
+    assert!(status.instance_status(0, 1).is_some());
+
+    runtime
+        .observed_state_store
+        .set_pipeline_active_generation(pipeline_key.clone(), 1);
+    runtime.prune_pipeline_runtime_and_history(&pipeline_key);
+
+    let status = wait_for_observed_status(&runtime, &pipeline_key, |status| {
+        status.per_instance().len() == 1
+    });
+    assert_eq!(status.total_cores(), 1);
+    assert_eq!(status.running_cores(), 0);
+    assert!(matches!(
+        status
+            .instance_status(0, 1)
+            .expect("latest stopped generation should remain")
+            .phase(),
+        PipelinePhase::Stopped
+    ));
+    assert!(status.instance_status(0, 0).is_none());
+}
+
+/// Scenario: a pure resize-down retires one core without changing the active
+/// generation, and observed state still retains both core instances on that
+/// same generation.
+/// Guarantees: controller cleanup compacts observed state to the committed
+/// active core footprint so `/status` stops counting the drained core as
+/// serving after the resize completes.
+#[test]
+fn prune_pipeline_runtime_and_history_compacts_resize_down_same_generation() {
+    let config = engine_config_with_pipeline(
+        r#"
+        policies:
+          resources:
+            core_allocation:
+              type: core_count
+              count: 2
+        nodes:
+          receiver:
+            type: "urn:test:receiver:example"
+            config: null
+          exporter:
+            type: "urn:test:exporter:example"
+            config: null
+        connections:
+          - from: receiver
+            to: exporter
+"#,
+    );
+    let runtime = test_runtime(&config);
+    let _runner = ObservedStateRunner::start(&runtime);
+    register_existing_pipeline(&runtime, &config);
+
+    let pipeline_key = PipelineKey::new("g1".into(), "p1".into());
+    report_ready(&runtime, deployed_key("g1", "p1", 0, 0));
+    report_stopped(&runtime, deployed_key("g1", "p1", 1, 0));
+    let status = wait_for_observed_status(&runtime, &pipeline_key, |status| {
+        status.per_instance().len() == 2
+    });
+    assert_eq!(status.total_cores(), 2);
+    assert_eq!(status.running_cores(), 1);
+    assert!(status.instance_status(0, 0).is_some());
+    assert!(status.instance_status(1, 0).is_some());
+
+    runtime
+        .observed_state_store
+        .set_pipeline_active_cores(pipeline_key.clone(), [0]);
+    runtime.prune_pipeline_runtime_and_history(&pipeline_key);
+
+    let status = wait_for_observed_status(&runtime, &pipeline_key, |status| {
+        status.per_instance().len() == 1
+    });
+    assert_eq!(status.total_cores(), 1);
+    assert_eq!(status.running_cores(), 1);
+    assert!(status.instance_status(0, 0).is_some());
+    assert!(status.instance_status(1, 0).is_none());
+}
+
+/// Scenario: a runtime instance exits while a shutdown operation for the same
+/// logical pipeline is still active and observed state contains overlapping
+/// generations.
+/// Guarantees: observed state is not compacted early, so controller wait paths
+/// can continue reading generation-specific status until the shutdown finishes.
+#[test]
+fn note_instance_exit_does_not_compact_observed_state_while_shutdown_is_active() {
+    let config = engine_config_with_pipeline(simple_pipeline_yaml());
+    let runtime = test_runtime(&config);
+    let _runner = ObservedStateRunner::start(&runtime);
+    register_existing_pipeline(&runtime, &config);
+    let _rx =
+        register_runtime_instance(&runtime, "g1", "p1", 0, 0, RuntimeInstanceLifecycle::Active);
+
+    let pipeline_key = PipelineKey::new("g1".into(), "p1".into());
+    report_ready(&runtime, deployed_key("g1", "p1", 0, 0));
+    report_ready(&runtime, deployed_key("g1", "p1", 0, 1));
+    let status = wait_for_observed_status(&runtime, &pipeline_key, |status| {
+        status.per_instance().len() == 2
+    });
+    assert!(status.instance_status(0, 0).is_some());
+    assert!(status.instance_status(0, 1).is_some());
+
+    {
+        let mut state = runtime
+            .state
+            .lock()
+            .unwrap_or_else(|poisoned| poisoned.into_inner());
+        let _ = state
+            .active_shutdowns
+            .insert(pipeline_key.clone(), "shutdown-0".to_owned());
+    }
+
+    runtime.note_instance_exit(deployed_key("g1", "p1", 0, 0), RuntimeInstanceExit::Success);
+
+    let status = wait_for_observed_status(&runtime, &pipeline_key, |status| {
+        status.per_instance().len() == 2
+    });
+    assert!(status.instance_status(0, 0).is_some());
+    assert!(status.instance_status(0, 1).is_some());
+}
+
+/// Scenario: a watched runtime thread panics after the runtime instance has
+/// already been admitted and marked ready in observed state.
+/// Guarantees: the public runtime error message stays short while the recent
+/// event stores richer panic diagnostics in `ErrorSummary::source`.
+#[test]
+fn runtime_thread_panic_populates_error_source_in_observed_status() {
+    let config = engine_config_with_pipeline(simple_pipeline_yaml());
+    let runtime = test_runtime(&config);
+    let _runner = ObservedStateRunner::start(&runtime);
+    register_existing_pipeline(&runtime, &config);
+
+    let deployed_key = deployed_key("g1", "p1", 0, 0);
+    let _rx =
+        register_runtime_instance(&runtime, "g1", "p1", 0, 0, RuntimeInstanceLifecycle::Active);
+    report_ready(&runtime, deployed_key.clone());
+
+    let pipeline_key = PipelineKey::new("g1".into(), "p1".into());
+    let _ = wait_for_observed_status(&runtime, &pipeline_key, |status| {
+        matches!(
+            status
+                .instance_status(0, 0)
+                .map(|instance| instance.phase()),
+            Some(PipelinePhase::Running)
+        )
+    });
+
+    runtime.note_instance_exit(
+        deployed_key,
+        RuntimeInstanceExit::Error(RuntimeInstanceError::from_panic(PanicReport::capture(
+            "runtime thread",
+            Box::new("boom"),
+            Some("pipeline-g1-p1-core-0".to_owned()),
+            Some(11),
+            Some(0),
+        ))),
+    );
+
+    let status = wait_for_observed_status(&runtime, &pipeline_key, |status| {
+        matches!(
+            status
+                .instance_status(0, 0)
+                .map(|instance| instance.phase()),
+            Some(PipelinePhase::Failed(_))
+        )
+    });
+    let json = serde_json::to_value(&status).expect("status should serialize");
+    let recent_event = &json["instances"][0]["status"]["recentEvents"][0]["Engine"];
+    let error = &recent_event["type"]["Error"]["RuntimeError"]["Pipeline"];
+    assert_eq!(
+        recent_event["message"],
+        "Pipeline encountered a runtime error."
+    );
+    assert_eq!(error["error_kind"], "panic");
+    assert_eq!(error["message"], "runtime thread panicked: boom");
+    let source = error["source"]
+        .as_str()
+        .expect("runtime panic source should be serialized");
+    assert!(source.contains("thread_name=pipeline-g1-p1-core-0"));
+    assert!(source.contains("thread_id=11"));
+    assert!(source.contains("core_id=0"));
+    assert!(source.contains("backtrace:"));
+}
diff --git a/rust/otap-dataflow/crates/core-nodes/src/exporters/topic_exporter/mod.rs b/rust/otap-dataflow/crates/core-nodes/src/exporters/topic_exporter/mod.rs
index e2d4102e67..7c522cad97 100644
--- a/rust/otap-dataflow/crates/core-nodes/src/exporters/topic_exporter/mod.rs
+++ b/rust/otap-dataflow/crates/core-nodes/src/exporters/topic_exporter/mod.rs
@@ -263,6 +263,8 @@ impl TopicExporter {
         match queue_on_full {
             TopicQueueOnFullPolicy::Block => {
                 let published = Arc::new(data.clone_without_context());
+                // Preserve a cheap uncontended fast path: only retain a blocked
+                // publish when the topic runtime reports real backpressure.
                 if should_track_end_to_end {
                     let tracked_publisher = tracked_publisher
                         .expect("tracked publisher should exist when ack propagation is auto");
@@ -385,6 +387,10 @@ impl Exporter<OtapPdata> for TopicExporter {
         let mut pending_outcomes: FuturesUnordered<
             Pin<Box<dyn Future<Output = (u64, TrackedPublishOutcome)> + Send>>,
         > = FuturesUnordered::new();
+        // The exporter owns at most one blocked publish at a time. While that
+        // future is waiting inside the topic runtime, the inbox is switched to
+        // control-only reads so shutdown stays responsive and ownership of the
+        // blocked pdata remains unambiguous.
         let mut blocked_publish: Option<BlockedPublish> = None;
         let tracked_publisher = (ack_propagation_mode == TopicAckPropagationMode::Auto)
             .then(|| topic.tracked_publisher());
diff --git a/rust/otap-dataflow/crates/core-nodes/src/receivers/fake_data_generator/mod.rs b/rust/otap-dataflow/crates/core-nodes/src/receivers/fake_data_generator/mod.rs
index ff0107f07c..19678fae54 100644
--- a/rust/otap-dataflow/crates/core-nodes/src/receivers/fake_data_generator/mod.rs
+++ b/rust/otap-dataflow/crates/core-nodes/src/receivers/fake_data_generator/mod.rs
@@ -1068,6 +1068,65 @@ mod tests {
             .run_validation(drain_validation);
     }
 
+    /// Scenario: receiver-first shutdown reaches the pre-generated hot path
+    /// while it is sending many batches in one iteration.
+    /// Guarantees: the generated send loop yields often enough for the outer
+    /// control select to observe `DrainIngress` promptly instead of timing out
+    /// behind a long uncapped send burst.
+    #[test]
+    fn test_drain_ingress_exits_promptly_during_high_throughput_send_loop() {
+        let test_runtime = TestRuntime::new();
+
+        let registry_path = VirtualDirectoryPath::GitRepo {
+            url: "https://github.com/open-telemetry/semantic-conventions.git".to_owned(),
+            sub_folder: Some("model".to_owned()),
+            refspec: None,
+        };
+
+        let traffic_config = TrafficConfig::new(Some(1000), None, 1, 0, 0, 1);
+        let config = Config::new(traffic_config, registry_path)
+            .with_data_source(DataSource::Static)
+            .with_generation_strategy(GenerationStrategy::PreGenerated);
+
+        let node_config = Arc::new(NodeUserConfig::new_receiver_config(
+            OTAP_FAKE_DATA_GENERATOR_URN,
+        ));
+        let telemetry_registry_handle = TelemetryRegistryHandle::new();
+        let controller_ctx = ControllerContext::new(telemetry_registry_handle.clone());
+        let pipeline_ctx =
+            controller_ctx.pipeline_context_with("grp".into(), "pipeline".into(), 0, 1, 0);
+        let receiver = ReceiverWrapper::local(
+            FakeGeneratorReceiver::new(pipeline_ctx, config),
+            test_node("fake_receiver_hot_drain"),
+            node_config,
+            test_runtime.config(),
+        );
+
+        let drain_scenario =
+            move |ctx: TestContext<OtapPdata>| -> Pin<Box<dyn Future<Output = ()>>> {
+                Box::pin(async move {
+                    sleep(Duration::from_millis(200)).await;
+                    let deadline = std::time::Instant::now() + Duration::from_secs(5);
+                    ctx.send_control_msg(NodeControlMsg::DrainIngress {
+                        deadline,
+                        reason: "test hot drain".to_owned(),
+                    })
+                    .await
+                    .expect("Failed to send DrainIngress");
+                })
+            };
+
+        let drain_validation =
+            |_ctx: NotSendValidateContext<OtapPdata>| -> Pin<Box<dyn Future<Output = ()>>> {
+                Box::pin(async {})
+            };
+
+        test_runtime
+            .set_receiver(receiver)
+            .run_test(drain_scenario)
+            .run_validation(drain_validation);
+    }
+
     /// Regression test: verifies that a non-terminal control message
     /// (CollectTelemetry) arriving during the rate-limit sleep does NOT
     /// break the sleep early – the receiver should still respect the
diff --git a/rust/otap-dataflow/crates/core-nodes/src/receivers/topic_receiver/mod.rs b/rust/otap-dataflow/crates/core-nodes/src/receivers/topic_receiver/mod.rs
index feb1766256..a770b9ff9e 100644
--- a/rust/otap-dataflow/crates/core-nodes/src/receivers/topic_receiver/mod.rs
+++ b/rust/otap-dataflow/crates/core-nodes/src/receivers/topic_receiver/mod.rs
@@ -255,6 +255,13 @@ impl local::Receiver<OtapPdata> for TopicReceiver {
         );
         let mut draining_deadline: Option<Instant> = None;
         let mut draining_reason: Option<String> = None;
+        // These represent two different handoff stages:
+        // - `pending_forward` is one permitted topic delivery that is still trying
+        //   to enter the downstream pipeline and therefore must not be
+        //   committed yet.
+        // - `pending_tracked_message_ids` are deliveries that were already
+        //   forwarded and committed locally, but whose downstream Ack/Nack has
+        //   not been bridged back to the topic runtime yet.
         let mut pending_tracked_message_ids = HashSet::new();
         let mut pending_forward: Option<PendingForward> = None;
 
@@ -659,6 +666,12 @@ impl local::Receiver<OtapPdata> for TopicReceiver {
                                 let send_started_at = Instant::now();
                                 match effect_handler.try_send_message_with_source_node(pdata) {
                                     Ok(()) => {
+                                        // Commit the topic delivery permit only after the
+                                        // downstream pipeline accepts the
+                                        // message. That keeps drain precise:
+                                        // an unadmitted message can still be
+                                        // aborted locally instead of turning
+                                        // into tracked in-flight work.
                                         delivery.commit();
                                         if let Some(message_id) = tracked_message_id {
                                             _ = pending_tracked_message_ids.insert(message_id);
diff --git a/rust/otap-dataflow/crates/engine/src/attributes.rs b/rust/otap-dataflow/crates/engine/src/attributes.rs
index 303ce09217..761816b34d 100644
--- a/rust/otap-dataflow/crates/engine/src/attributes.rs
+++ b/rust/otap-dataflow/crates/engine/src/attributes.rs
@@ -91,6 +91,10 @@ pub struct PipelineAttributeSet {
     /// Pipeline group identifier.
     #[attribute]
     pub pipeline_group_id: Cow<'static, str>,
+
+    /// Deployment generation for this runtime instance.
+    #[attribute]
+    pub deployment_generation: u64,
 }
 
 /// Node attributes.
diff --git a/rust/otap-dataflow/crates/engine/src/context.rs b/rust/otap-dataflow/crates/engine/src/context.rs
index 0e42692a2b..2eb733173d 100644
--- a/rust/otap-dataflow/crates/engine/src/context.rs
+++ b/rust/otap-dataflow/crates/engine/src/context.rs
@@ -126,6 +126,7 @@ pub struct PipelineContextParams {
 pub struct PipelineContext {
     controller_context: ControllerContext,
     pipeline_context_params: PipelineContextParams,
+    deployment_generation: u64,
     pipeline_telemetry_attrs: HashMap<String, TelemetryAttribute>,
     node_id: ConfigNodeId,
     node_urn: NodeUrn,
@@ -169,7 +170,28 @@ impl ControllerContext {
         num_cores: usize,
         thread_id: usize,
     ) -> PipelineContext {
-        PipelineContext::new(
+        self.pipeline_context_with_generation(
+            pipeline_group_id,
+            pipeline_id,
+            core_id,
+            num_cores,
+            thread_id,
+            0,
+        )
+    }
+
+    /// Returns a new pipeline context with an explicit deployment generation.
+    #[must_use]
+    pub fn pipeline_context_with_generation(
+        &self,
+        pipeline_group_id: PipelineGroupId,
+        pipeline_id: PipelineId,
+        core_id: usize,
+        num_cores: usize,
+        thread_id: usize,
+        deployment_generation: u64,
+    ) -> PipelineContext {
+        PipelineContext::new_with_generation(
             self.clone(),
             PipelineContextParams {
                 pipeline_group_id,
@@ -178,6 +200,7 @@ impl ControllerContext {
                 num_cores,
                 thread_id,
             },
+            deployment_generation,
         )
     }
 
@@ -212,13 +235,24 @@ impl ControllerContext {
 
 impl PipelineContext {
     /// Creates a new `PipelineContext`.
+    #[allow(dead_code)]
     pub(crate) fn new(
         parent_ctx: ControllerContext,
         pipeline_context_params: PipelineContextParams,
+    ) -> Self {
+        Self::new_with_generation(parent_ctx, pipeline_context_params, 0)
+    }
+
+    /// Creates a new `PipelineContext` with an explicit deployment generation.
+    pub(crate) fn new_with_generation(
+        parent_ctx: ControllerContext,
+        pipeline_context_params: PipelineContextParams,
+        deployment_generation: u64,
     ) -> Self {
         Self {
             controller_context: parent_ctx,
             pipeline_context_params,
+            deployment_generation,
             node_id: Default::default(),
             node_urn: Default::default(),
             node_kind: Default::default(),
@@ -248,6 +282,12 @@ impl PipelineContext {
         self.pipeline_context_params.core_id
     }
 
+    /// Returns the deployment generation associated with this pipeline runtime.
+    #[must_use]
+    pub const fn deployment_generation(&self) -> u64 {
+        self.deployment_generation
+    }
+
     /// Returns the total number of cores allocated to this pipeline.
     ///
     /// This is useful for nodes that need to share resources (like disk budgets)
@@ -451,6 +491,7 @@ impl PipelineContext {
             engine_attrs: self.engine_attribute_set(),
             pipeline_id: self.pipeline_context_params.pipeline_id.clone(),
             pipeline_group_id: self.pipeline_context_params.pipeline_group_id.clone(),
+            deployment_generation: self.deployment_generation,
         }
     }
 
@@ -542,6 +583,7 @@ impl PipelineContext {
         Self {
             controller_context: self.controller_context.clone(),
             pipeline_context_params: self.pipeline_context_params.clone(),
+            deployment_generation: self.deployment_generation,
             pipeline_telemetry_attrs: self.pipeline_telemetry_attrs.clone(),
             node_id,
             node_urn,
diff --git a/rust/otap-dataflow/crates/engine/src/local/message.rs b/rust/otap-dataflow/crates/engine/src/local/message.rs
index c6fa64bc5f..c56e6b0a3a 100644
--- a/rust/otap-dataflow/crates/engine/src/local/message.rs
+++ b/rust/otap-dataflow/crates/engine/src/local/message.rs
@@ -255,4 +255,14 @@ impl<T> LocalReceiver<T> {
             LocalReceiverInner::Mpmc(receiver) => receiver.is_empty(),
         }
     }
+
+    /// Returns `true` once all senders are gone or the channel has been
+    /// explicitly closed.
+    #[must_use]
+    pub fn is_closed(&self) -> bool {
+        match &self.inner {
+            LocalReceiverInner::Mpsc(receiver) => receiver.is_closed(),
+            LocalReceiverInner::Mpmc(receiver) => receiver.is_closed(),
+        }
+    }
 }
diff --git a/rust/otap-dataflow/crates/engine/src/message.rs b/rust/otap-dataflow/crates/engine/src/message.rs
index 9957fedabf..39acff3ea1 100644
--- a/rust/otap-dataflow/crates/engine/src/message.rs
+++ b/rust/otap-dataflow/crates/engine/src/message.rs
@@ -182,6 +182,15 @@ impl<T> Receiver<T> {
             Receiver::Shared(receiver) => receiver.is_empty(),
         }
     }
+
+    /// Checks whether the receive side has observed channel closure.
+    #[must_use]
+    pub fn is_closed(&self) -> bool {
+        match self {
+            Receiver::Local(receiver) => receiver.is_closed(),
+            Receiver::Shared(receiver) => receiver.is_closed(),
+        }
+    }
 }
 
 /// Small private adapter trait used by [`InboxCore`].
@@ -201,6 +210,8 @@ trait ChannelReceiver<T> {
     fn try_recv(&mut self) -> Result<T, RecvError>;
 
     fn is_empty(&self) -> bool;
+
+    fn is_closed(&self) -> bool;
 }
 
 impl<T> ChannelReceiver<T> for Receiver<T> {
@@ -215,6 +226,10 @@ impl<T> ChannelReceiver<T> for Receiver<T> {
     fn is_empty(&self) -> bool {
         Receiver::is_empty(self)
     }
+
+    fn is_closed(&self) -> bool {
+        Receiver::is_closed(self)
+    }
 }
 
 impl<T> ChannelReceiver<T> for SharedReceiver<T> {
@@ -229,6 +244,10 @@ impl<T> ChannelReceiver<T> for SharedReceiver<T> {
     fn is_empty(&self) -> bool {
         SharedReceiver::is_empty(self)
     }
+
+    fn is_closed(&self) -> bool {
+        SharedReceiver::is_closed(self)
+    }
 }
 
 /// Shutdown-drain policy for [`InboxCore::recv_with_policy`].
@@ -344,10 +363,14 @@ where
     }
 
     fn shutdown_drain_complete(&self) -> bool {
-        self.pdata_rx
-            .as_ref()
-            .expect("pdata_rx must exist")
-            .is_empty()
+        let pdata_rx = self.pdata_rx.as_ref().expect("pdata_rx must exist");
+        // Shutdown may only be released once no upstream sender can still
+        // deliver more work into this inbox. Queue emptiness alone is not
+        // sufficient because an upstream node can still finish one already
+        // admitted message outside the channel and send it after we observe an
+        // empty buffer.
+        pdata_rx.is_closed()
+            && pdata_rx.is_empty()
             && self
                 .local_scheduler
                 .as_ref()
@@ -1095,4 +1118,58 @@ mod tests {
             Message::Control(NodeControlMsg::Shutdown { .. })
         ));
     }
+
+    /// Scenario: an exporter has latched shutdown, its bounded inbox is
+    /// temporarily empty, but an upstream sender is still alive and may still
+    /// forward one already-admitted message later.
+    /// Guarantees: the exporter does not release the latched shutdown on queue
+    /// emptiness alone; it stays alive until that late pdata arrives and the
+    /// upstream channel closes.
+    #[tokio::test]
+    async fn exporter_inbox_waits_for_upstream_closure_before_shutdown() {
+        let (control_tx, control_rx) = mpsc::Channel::<NodeControlMsg<TestMsg>>::new(16);
+        let (pdata_tx, pdata_rx) = mpsc::Channel::<TestMsg>::new(16);
+        let mut inbox = ExporterInbox::new(
+            Receiver::Local(LocalReceiver::mpsc(control_rx)),
+            Receiver::Local(LocalReceiver::mpsc(pdata_rx)),
+            9,
+            Interests::empty(),
+        );
+
+        control_tx
+            .send_async(NodeControlMsg::Shutdown {
+                deadline: Instant::now() + Duration::from_secs(1),
+                reason: "shutdown".to_owned(),
+            })
+            .await
+            .expect("shutdown should enqueue");
+
+        let pending = tokio::time::timeout(Duration::from_millis(20), inbox.recv_when(false)).await;
+        assert!(
+            pending.is_err(),
+            "shutdown should stay latched while upstream senders can still deliver pdata"
+        );
+
+        pdata_tx
+            .send_async(TestMsg::new("late"))
+            .await
+            .expect("late pdata should enqueue");
+
+        let drained = tokio::time::timeout(Duration::from_millis(50), inbox.recv_when(false))
+            .await
+            .expect("late pdata should unblock the exporter inbox")
+            .expect("late pdata should drain");
+        assert!(matches!(drained, Message::PData(TestMsg(ref value)) if value == "late"));
+
+        drop(pdata_tx);
+
+        let shutdown = tokio::time::timeout(Duration::from_millis(50), inbox.recv_when(false))
+            .await
+            .expect("shutdown should follow once upstream closes")
+            .expect("shutdown control should arrive");
+        assert!(matches!(
+            shutdown,
+            Message::Control(NodeControlMsg::Shutdown { .. })
+        ));
+    }
 }
diff --git a/rust/otap-dataflow/crates/engine/src/pipeline_ctrl.rs b/rust/otap-dataflow/crates/engine/src/pipeline_ctrl.rs
index 60827c7a35..49b209a54b 100644
--- a/rust/otap-dataflow/crates/engine/src/pipeline_ctrl.rs
+++ b/rust/otap-dataflow/crates/engine/src/pipeline_ctrl.rs
@@ -1332,6 +1332,7 @@ mod tests {
                 pipeline_group_id,
                 pipeline_id,
                 core_id,
+                deployment_generation: 0,
             },
             pipeline_context,
             pipeline_rx,
@@ -1813,6 +1814,7 @@ mod tests {
                     pipeline_group_id: pipeline_group_id.clone(),
                     pipeline_id: pipeline_id.clone(),
                     core_id,
+                    deployment_generation: 0,
                 };
                 let controller_context = ControllerContext::new(metrics_system.registry());
                 let pipeline_context_params = PipelineContextParams {
@@ -3175,6 +3177,7 @@ mod tests {
                 pipeline_group_id,
                 pipeline_id,
                 core_id: 0,
+                deployment_generation: 0,
             },
             pipeline_context.clone(),
             pipeline_rx,
@@ -3416,6 +3419,7 @@ mod tests {
                 pipeline_group_id,
                 pipeline_id,
                 core_id: 0,
+                deployment_generation: 0,
             },
             pipeline_context,
             pipeline_rx,
@@ -3492,6 +3496,7 @@ mod tests {
                 pipeline_group_id,
                 pipeline_id,
                 core_id: 0,
+                deployment_generation: 0,
             },
             pipeline_context,
             pipeline_rx,
diff --git a/rust/otap-dataflow/crates/engine/src/shared/message.rs b/rust/otap-dataflow/crates/engine/src/shared/message.rs
index 5722e307c7..23a6139227 100644
--- a/rust/otap-dataflow/crates/engine/src/shared/message.rs
+++ b/rust/otap-dataflow/crates/engine/src/shared/message.rs
@@ -267,6 +267,16 @@ impl<T> SharedReceiver<T> {
             SharedReceiverInner::Mpmc(receiver) => receiver.is_empty(),
         }
     }
+
+    /// Returns `true` once all senders are gone or the channel has been
+    /// explicitly closed.
+    #[must_use]
+    pub fn is_closed(&self) -> bool {
+        match &self.inner {
+            SharedReceiverInner::Mpsc(receiver) => receiver.is_closed(),
+            SharedReceiverInner::Mpmc(receiver) => receiver.is_disconnected(),
+        }
+    }
 }
 
 #[cfg(test)]
diff --git a/rust/otap-dataflow/crates/engine/src/testing/dst/common.rs b/rust/otap-dataflow/crates/engine/src/testing/dst/common.rs
index b53f9530e7..14276dc1e3 100644
--- a/rust/otap-dataflow/crates/engine/src/testing/dst/common.rs
+++ b/rust/otap-dataflow/crates/engine/src/testing/dst/common.rs
@@ -139,6 +139,7 @@ pub(super) fn build_manager<PData>(
             pipeline_group_id,
             pipeline_id,
             core_id: 0,
+            deployment_generation: 0,
         },
         pipeline_context.clone(),
         pipeline_rx,
diff --git a/rust/otap-dataflow/crates/engine/src/testing/exporter.rs b/rust/otap-dataflow/crates/engine/src/testing/exporter.rs
index 6c239ad38d..dbf043ba86 100644
--- a/rust/otap-dataflow/crates/engine/src/testing/exporter.rs
+++ b/rust/otap-dataflow/crates/engine/src/testing/exporter.rs
@@ -40,7 +40,7 @@ pub struct TestContext<PData> {
     /// Sender for control messages
     control_tx: Sender<NodeControlMsg<PData>>,
     /// Sender for pipeline data
-    pdata_tx: Sender<PData>,
+    pdata_tx: Option<Sender<PData>>,
     /// Message counter for tracking processed messages
     counters: CtrlMsgCounters,
     /// Receiver for runtime control messages
@@ -71,7 +71,7 @@ impl<PData> TestContext<PData> {
     ) -> Self {
         Self {
             control_tx,
-            pdata_tx,
+            pdata_tx: Some(pdata_tx),
             counters,
             runtime_ctrl_msg_receiver: None,
             pipeline_completion_msg_receiver: None,
@@ -142,7 +142,11 @@ impl<PData> TestContext<PData> {
     ///
     /// Returns an error if the message could not be sent.
     pub async fn send_pdata(&self, content: PData) -> Result<(), SendError<PData>> {
-        self.pdata_tx.send(content).await
+        self.pdata_tx
+            .as_ref()
+            .expect("pdata sender must exist during the active test phase")
+            .send(content)
+            .await
     }
 
     /// Sleeps for the specified duration.
@@ -362,10 +366,15 @@ impl<PData> ValidationPhase<PData> {
         let ValidationPhase {
             rt,
             local_tasks,
-            context,
+            mut context,
             run_exporter_handle,
         } = self;
 
+        // Validation does not drive new pdata. Drop the harness-side sender
+        // clone before waiting for exporter shutdown so tests do not keep the
+        // exporter input channel artificially open after the scenario finishes.
+        let _ = context.pdata_tx.take();
+
         // First run all the spawned tasks to completion
         rt.block_on(local_tasks);
 
diff --git a/rust/otap-dataflow/crates/engine/src/topic/topic.rs b/rust/otap-dataflow/crates/engine/src/topic/topic.rs
index 806e77a5ad..5503aac5b9 100644
--- a/rust/otap-dataflow/crates/engine/src/topic/topic.rs
+++ b/rust/otap-dataflow/crates/engine/src/topic/topic.rs
@@ -692,8 +692,10 @@ impl<T: Send + Sync + 'static> MixedTopic<T> {
     }
 
     // Acquire balanced-group capacity atomically from the publisher's point of
-    // view: if one group is full, drop any partial acquisitions before waiting
-    // and retry from a fresh mixed-topic snapshot.
+    // view. Mixed topics are intentionally all-or-nothing across balanced and
+    // broadcast delivery, so publishers must not keep permits reserved in
+    // "fast" groups while they wait on a "slow" one. Drop any partial
+    // acquisitions before awaiting capacity and retry from a fresh snapshot.
     async fn acquire_balanced_admission(
         &self,
     ) -> Result<(Arc<[Arc<ConsumerGroup<T>>]>, BalancedPermitVec), Error> {
@@ -750,6 +752,9 @@ impl<T: Send + Sync + 'static> MixedTopic<T> {
             }
         }
 
+        // Broadcast is intentionally last so mixed async publish has the same
+        // visible contract as mixed try_publish: no broadcast subscriber can
+        // observe a message before the balanced side has admitted it.
         self.broadcast_ring.publish(Envelope {
             id,
             tracked: false,
@@ -772,6 +777,9 @@ impl<T: Send + Sync + 'static> MixedTopic<T> {
         let id = self.next_message_id();
         if self.has_balanced_groups.load(Ordering::Acquire) {
             let (groups, permits) = self.acquire_balanced_admission().await?;
+            // Tracked publish capacity is consumed only after admission
+            // succeeds. Waiting on a full balanced group must not hand back a
+            // receipt for a message that has not been accepted anywhere yet.
             let receipt = self.outcomes.register(id, timeout, permit);
             let envelope = Envelope {
                 id,
@@ -821,6 +829,8 @@ impl<T: Send + Sync + 'static> MixedTopic<T> {
         };
         let (permits, blocking_group) = Self::try_acquire_balanced_admission(&groups)?;
         if blocking_group.is_some() {
+            // Keep try_publish all-or-nothing for mixed topics: if balanced
+            // admission fails, nothing is published to broadcast either.
             Ok((PublishOutcome::DroppedOnFull, id))
         } else {
             for (group, permit) in groups.as_ref().iter().zip(permits) {
diff --git a/rust/otap-dataflow/crates/otap/tests/core_node_liveness_tests.rs b/rust/otap-dataflow/crates/otap/tests/core_node_liveness_tests.rs
index e0c45127dd..2a39bfe532 100644
--- a/rust/otap-dataflow/crates/otap/tests/core_node_liveness_tests.rs
+++ b/rust/otap-dataflow/crates/otap/tests/core_node_liveness_tests.rs
@@ -231,6 +231,7 @@ fn run_pipeline_with_condition<F>(
         pipeline_group_id: pipeline_group_id.clone(),
         pipeline_id: pipeline_id.clone(),
         core_id: 0,
+        deployment_generation: 0,
     };
     let metrics_reporter = telemetry_system.reporter();
     let event_reporter = observed_state_store.reporter(SendPolicy::default());
@@ -365,6 +366,7 @@ where
         pipeline_group_id: pipeline_group_id.clone(),
         pipeline_id: pipeline_id.clone(),
         core_id: 0,
+        deployment_generation: 0,
     };
     let metrics_reporter = telemetry_system.reporter();
     let event_reporter = observed_state_store.reporter(SendPolicy::default());
diff --git a/rust/otap-dataflow/crates/otap/tests/durable_buffer_processor_tests.rs b/rust/otap-dataflow/crates/otap/tests/durable_buffer_processor_tests.rs
index a7ffbca265..021c896ebf 100644
--- a/rust/otap-dataflow/crates/otap/tests/durable_buffer_processor_tests.rs
+++ b/rust/otap-dataflow/crates/otap/tests/durable_buffer_processor_tests.rs
@@ -510,6 +510,7 @@ where
         pipeline_group_id: pipeline_group_id.clone(),
         pipeline_id: pipeline_id.clone(),
         core_id: 0,
+        deployment_generation: 0,
     };
     // Create a metrics reporter with our own receiver so we can inspect metrics.
     // Use a very large channel so it never overflows, even on extremely slow CI
@@ -825,6 +826,7 @@ where
         pipeline_group_id: pipeline_group_id.clone(),
         pipeline_id: pipeline_id.clone(),
         core_id: 0,
+        deployment_generation: 0,
     };
     let metrics_reporter = telemetry_system.reporter();
     let event_reporter = observed_state_store.reporter(SendPolicy::default());
diff --git a/rust/otap-dataflow/crates/otap/tests/pipeline_tests.rs b/rust/otap-dataflow/crates/otap/tests/pipeline_tests.rs
index dc3d111cb2..c308b37e70 100644
--- a/rust/otap-dataflow/crates/otap/tests/pipeline_tests.rs
+++ b/rust/otap-dataflow/crates/otap/tests/pipeline_tests.rs
@@ -87,6 +87,7 @@ fn test_telemetry_registries_cleanup() {
         pipeline_group_id,
         pipeline_id,
         core_id: 0,
+        deployment_generation: 0,
     };
     let metrics_reporter = telemetry_system.reporter();
     let event_reporter = observed_state_store.reporter(SendPolicy::default());
diff --git a/rust/otap-dataflow/crates/state/src/pipeline_rt_status.rs b/rust/otap-dataflow/crates/state/src/pipeline_rt_status.rs
index 54acc4d79a..8b2fd73339 100644
--- a/rust/otap-dataflow/crates/state/src/pipeline_rt_status.rs
+++ b/rust/otap-dataflow/crates/state/src/pipeline_rt_status.rs
@@ -269,24 +269,33 @@ impl PipelineRuntimeStatus {
         }
     }
 
+    /// Returns the current Accepted condition for this runtime instance.
     #[must_use]
-    pub(crate) const fn accepted_condition(&self) -> &ConditionState {
+    pub const fn accepted_condition(&self) -> &ConditionState {
         &self.accepted_condition
     }
 
     #[must_use]
-    pub(crate) const fn ready_condition(&self) -> &ConditionState {
+    /// Returns the current Ready condition for this runtime instance.
+    pub const fn ready_condition(&self) -> &ConditionState {
         &self.ready_condition
     }
 
     #[must_use]
-    pub(crate) fn conditions(&self) -> [Condition; 2] {
+    /// Returns the serialized condition pair for this runtime instance.
+    pub fn conditions(&self) -> [Condition; 2] {
         [
             Condition::from_state(ConditionKind::Accepted, &self.accepted_condition),
             Condition::from_state(ConditionKind::Ready, &self.ready_condition),
         ]
     }
 
+    /// Returns the current phase for this runtime instance.
+    #[must_use]
+    pub const fn phase(&self) -> PipelinePhase {
+        self.phase
+    }
+
     /// Apply a single observed event to this pipeline.
     /// Returns what changed (if anything) or an Error::InvalidTransition.
     pub(crate) fn apply(&mut self, event_type: EventType) -> Result<ApplyOutcome, Error> {
@@ -382,6 +391,9 @@ impl PipelineRuntimeStatus {
             (PipelinePhase::Draining, EventType::Error(ErrEv::DrainError(_))) => {
                 self.goto(PipelinePhase::Failed(FailReason::DrainError))
             }
+            (PipelinePhase::Draining, EventType::Error(ErrEv::RuntimeError(_))) => {
+                self.goto(PipelinePhase::Failed(FailReason::RuntimeError))
+            }
             (PipelinePhase::Draining, EventType::Request(Req::DeleteRequested)) => {
                 if !self.delete_pending {
                     self.delete_pending = true;
@@ -680,6 +692,29 @@ mod tests {
         assert_eq!(p2.phase, PipelinePhase::Failed(FailReason::DeleteError));
     }
 
+    #[test]
+    fn draining_runtime_error_becomes_terminal_failure() {
+        let mut p = PipelineRuntimeStatus::default();
+        p.apply_many([
+            EventType::Success(OkEv::Admitted),
+            EventType::Success(OkEv::Ready),
+            EventType::Request(Req::ShutdownRequested),
+        ])
+        .unwrap();
+
+        _ = p
+            .apply(EventType::Error(ErrEv::RuntimeError(
+                ErrorSummary::Pipeline {
+                    error_kind: "".to_string(),
+                    message: "late send failed during shutdown".to_string(),
+                    source: None,
+                },
+            )))
+            .unwrap();
+
+        assert_eq!(p.phase, PipelinePhase::Failed(FailReason::RuntimeError));
+    }
+
     #[test]
     fn invalid_transition_is_error() {
         let mut p = PipelineRuntimeStatus::default(); // Pending
diff --git a/rust/otap-dataflow/crates/state/src/pipeline_status.rs b/rust/otap-dataflow/crates/state/src/pipeline_status.rs
index 1c281a69f8..230bdb3797 100644
--- a/rust/otap-dataflow/crates/state/src/pipeline_status.rs
+++ b/rust/otap-dataflow/crates/state/src/pipeline_status.rs
@@ -1,7 +1,7 @@
 // Copyright The OpenTelemetry Authors
 // SPDX-License-Identifier: Apache-2.0
 
-//! Observed pipeline status and aggregation logic per core.
+//! Observed pipeline status and aggregation logic per runtime instance.
 
 use crate::conditions::{
     Condition, ConditionKind, ConditionReason, ConditionState, ConditionStatus,
@@ -12,15 +12,81 @@ use otap_df_config::CoreId;
 use otap_df_config::health::{HealthPolicy, PhaseKind, Quorum};
 use serde::Serialize;
 use serde::ser::SerializeStruct;
-use std::collections::HashMap;
+use std::collections::{HashMap, HashSet};
 use std::time::SystemTime;
 
-/// Aggregated, controller-synthesized view for a pipeline across all targeted
-/// cores. This is what external APIs will return for `status`.
+/// Unique runtime-instance key for a logical pipeline.
+#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
+pub struct RuntimeInstanceKey {
+    /// CPU core hosting the runtime instance.
+    pub core_id: CoreId,
+    /// Deployment generation for this runtime instance.
+    pub deployment_generation: u64,
+}
+
+/// Rollout state summary exposed on pipeline status snapshots.
+#[derive(Debug, Clone, Serialize, PartialEq, Eq)]
+#[serde(rename_all = "snake_case")]
+pub enum PipelineRolloutState {
+    /// Rollout has been accepted but work has not started yet.
+    Pending,
+    /// Rollout is actively applying changes.
+    Running,
+    /// Rollout completed successfully and the target generation is serving.
+    Succeeded,
+    /// Rollout failed before completion.
+    Failed,
+    /// Automatic rollback is in progress.
+    RollingBack,
+    /// Rollback could not restore a fully healthy serving set.
+    RollbackFailed,
+}
+
+/// Lightweight rollout summary embedded into `/status` pipeline payloads.
+#[derive(Debug, Clone, Serialize, PartialEq, Eq)]
+#[serde(rename_all = "camelCase")]
+pub struct PipelineRolloutSummary {
+    /// Controller-assigned rollout identifier.
+    pub rollout_id: String,
+    /// Current rollout lifecycle state.
+    pub state: PipelineRolloutState,
+    /// Candidate generation being rolled out.
+    pub target_generation: u64,
+    /// RFC3339 timestamp for rollout creation.
+    pub started_at: String,
+    /// RFC3339 timestamp for the latest rollout state transition.
+    pub updated_at: String,
+    /// Human-readable failure or rollback reason when present.
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub failure_reason: Option<String>,
+}
+
+#[derive(Debug, Serialize)]
+#[serde(rename_all = "camelCase")]
+struct RuntimeInstanceStatusView<'a> {
+    core_id: CoreId,
+    deployment_generation: u64,
+    status: &'a PipelineRuntimeStatus,
+}
+
+/// Aggregated, controller-synthesized view for a logical pipeline.
 #[derive(Debug, Clone)]
 pub struct PipelineStatus {
-    /// Per-core details to aid debugging and aggregation.
-    pub(crate) cores: HashMap<CoreId, PipelineRuntimeStatus>,
+    /// Per-instance details to aid debugging and overlap-aware generation aggregation.
+    pub(crate) instances: HashMap<RuntimeInstanceKey, PipelineRuntimeStatus>,
+
+    /// Serving generation selected per core by the controller during rollout.
+    pub(crate) serving_generations: HashMap<CoreId, u64>,
+
+    /// Last committed generation for this logical pipeline.
+    pub(crate) active_generation: Option<u64>,
+
+    /// Committed core footprint for the active generation when no rollout-specific
+    /// per-core serving override is active.
+    pub(crate) active_cores: HashSet<CoreId>,
+
+    /// Optional rollout summary for UI/API consumers.
+    pub(crate) rollout: Option<PipelineRolloutSummary>,
 
     health_policy: HealthPolicy,
 }
@@ -28,40 +94,111 @@ pub struct PipelineStatus {
 impl PipelineStatus {
     pub(crate) fn new(health_policy: HealthPolicy) -> Self {
         Self {
-            cores: HashMap::new(),
+            instances: HashMap::new(),
+            serving_generations: HashMap::new(),
+            active_generation: None,
+            active_cores: HashSet::new(),
+            rollout: None,
             health_policy,
         }
     }
 
-    /// Returns the current per-core status map.
+    /// Returns the current per-instance status map.
     #[must_use]
-    pub const fn per_core(&self) -> &HashMap<CoreId, PipelineRuntimeStatus> {
-        &self.cores
+    pub const fn per_instance(&self) -> &HashMap<RuntimeInstanceKey, PipelineRuntimeStatus> {
+        &self.instances
+    }
+
+    /// Returns the current serving generation map keyed by core.
+    #[must_use]
+    pub const fn serving_generations(&self) -> &HashMap<CoreId, u64> {
+        &self.serving_generations
+    }
+
+    /// Returns the committed active generation, if known.
+    #[must_use]
+    pub const fn active_generation(&self) -> Option<u64> {
+        self.active_generation
+    }
+
+    /// Returns the runtime status for a specific `(core, generation)`.
+    #[must_use]
+    /// Returns the status for one observed runtime instance generation on a core.
+    pub fn instance_status(
+        &self,
+        core_id: CoreId,
+        deployment_generation: u64,
+    ) -> Option<&PipelineRuntimeStatus> {
+        self.instances.get(&RuntimeInstanceKey {
+            core_id,
+            deployment_generation,
+        })
+    }
+
+    /// Records the committed active generation for this logical pipeline.
+    pub(crate) fn set_active_generation(&mut self, generation: u64) {
+        self.active_generation = Some(generation);
+    }
+
+    /// Records the committed serving core footprint for the active generation.
+    pub(crate) fn set_active_cores<I>(&mut self, core_ids: I)
+    where
+        I: IntoIterator<Item = CoreId>,
+    {
+        self.active_cores = core_ids.into_iter().collect();
+    }
+
+    /// Pins the serving generation chosen for one logical core.
+    pub(crate) fn set_serving_generation(&mut self, core_id: CoreId, generation: u64) {
+        _ = self.serving_generations.insert(core_id, generation);
+    }
+
+    /// Removes the serving-generation override for one logical core.
+    pub(crate) fn clear_serving_generation(&mut self, core_id: CoreId) {
+        let _ = self.serving_generations.remove(&core_id);
+    }
+
+    /// Stores the rollout summary currently exposed for this pipeline.
+    pub(crate) fn set_rollout_summary(&mut self, rollout: PipelineRolloutSummary) {
+        self.rollout = Some(rollout);
+    }
+
+    /// Clears the rollout summary once no rollout is active anymore.
+    pub(crate) fn clear_rollout_summary(&mut self) {
+        self.rollout = None;
+    }
+
+    /// Compacts retained runtime instances down to the generations currently
+    /// selected for status aggregation.
+    pub(crate) fn compact_instances_to_selected(&mut self) {
+        let retained: HashSet<_> = self.selected_runtime_keys().into_iter().collect();
+        self.instances.retain(|key, _| retained.contains(key));
     }
 
     #[must_use]
-    /// Returns the number of cores currently tracked for this pipeline.
+    /// Returns the number of currently serving cores for this logical pipeline.
     pub fn total_cores(&self) -> usize {
-        self.cores.len()
+        self.selected_runtimes().len()
     }
 
     #[must_use]
-    /// Returns how many cores are presently in the running phase.
+    /// Returns how many serving cores are presently in the running phase.
     pub fn running_cores(&self) -> usize {
-        self.cores
-            .values()
-            .filter(|c| matches!(c.phase, PipelinePhase::Running))
+        self.selected_runtimes()
+            .into_iter()
+            .filter(|(_, runtime)| matches!(runtime.phase, PipelinePhase::Running))
             .count()
     }
 
     #[must_use]
-    /// Returns true if all cores have reached a terminal state (Stopped, Deleted, Failed, or Rejected).
-    /// Returns false if there are no cores tracked or if any core is still active.
+    /// Returns true if all observed runtime instances have reached a terminal state.
     pub fn is_terminated(&self) -> bool {
-        if self.cores.is_empty() {
+        if self.instances.is_empty() {
             return false;
         }
-        self.cores.values().all(|c| c.phase.is_terminal())
+        self.instances
+            .values()
+            .all(|runtime| runtime.phase.is_terminal())
     }
 
     #[must_use]
@@ -73,8 +210,10 @@ impl PipelineStatus {
         ]
     }
 
+    /// Aggregates the accepted condition across the selected serving runtimes.
     fn aggregate_accepted_condition(&self) -> Condition {
-        if self.cores.is_empty() {
+        let selected = self.selected_runtimes();
+        if selected.is_empty() {
             return Condition {
                 kind: ConditionKind::Accepted,
                 status: ConditionStatus::Unknown,
@@ -89,7 +228,7 @@ impl PipelineStatus {
         let mut any_unknown: Option<ConditionState> = None;
         let mut latest_true_time: Option<SystemTime> = None;
 
-        for runtime in self.cores.values() {
+        for (_, runtime) in selected {
             let cond = runtime.accepted_condition().clone();
             match cond.status {
                 ConditionStatus::True => {
@@ -122,7 +261,10 @@ impl PipelineStatus {
                 status: ConditionStatus::False,
                 reason: state.reason.clone().or(Some(ConditionReason::NotAccepted)),
                 message: state.message.clone().or_else(|| {
-                    Some("One or more cores have not accepted the configuration.".to_string())
+                    Some(
+                        "One or more serving cores have not accepted the configuration."
+                            .to_string(),
+                    )
                 }),
                 last_transition_time: state.last_transition_time,
             };
@@ -136,10 +278,9 @@ impl PipelineStatus {
                     .reason
                     .clone()
                     .or_else(|| Some(ConditionReason::unknown("Unknown"))),
-                message: state
-                    .message
-                    .clone()
-                    .or_else(|| Some("Acceptance is unknown for one or more cores.".to_string())),
+                message: state.message.clone().or_else(|| {
+                    Some("Acceptance is unknown for one or more serving cores.".to_string())
+                }),
                 last_transition_time: state.last_transition_time,
             };
         }
@@ -149,15 +290,17 @@ impl PipelineStatus {
             status: ConditionStatus::True,
             reason: Some(ConditionReason::ConfigValid),
             message: Some(
-                "Pipeline configuration validated and resource policy constraints are satisfied."
+                "Serving pipeline configuration validated and resource policy constraints are satisfied."
                     .to_string(),
             ),
             last_transition_time: latest_true_time,
         }
     }
 
+    /// Aggregates the ready condition across the selected serving runtimes.
     fn aggregate_ready_condition(&self) -> Condition {
-        if self.cores.is_empty() {
+        let selected = self.selected_runtimes();
+        if selected.is_empty() {
             return Condition {
                 kind: ConditionKind::Ready,
                 status: ConditionStatus::Unknown,
@@ -167,8 +310,9 @@ impl PipelineStatus {
             };
         }
 
-        let (ready_numer, ready_denom) = self.count_quorum(|c| {
-            c.phase.kind() != PhaseKind::Deleted && self.health_policy.is_ready(c.phase.kind())
+        let (ready_numer, ready_denom) = self.count_quorum_from(&selected, |runtime| {
+            runtime.phase.kind() != PhaseKind::Deleted
+                && self.health_policy.is_ready(runtime.phase.kind())
         });
         let required = required_ready_count(self.health_policy.ready_quorum, ready_denom);
         let readiness_met = ready_denom > 0 && ready_numer >= required;
@@ -178,7 +322,7 @@ impl PipelineStatus {
         let mut latest_false_time: Option<SystemTime> = None;
         let mut latest_unknown: Option<ConditionState> = None;
 
-        for runtime in self.cores.values() {
+        for (_, runtime) in selected {
             let cond = runtime.ready_condition().clone();
             match cond.status {
                 ConditionStatus::True => {
@@ -224,14 +368,16 @@ impl PipelineStatus {
                 kind: ConditionKind::Ready,
                 status: ConditionStatus::False,
                 reason: Some(ConditionReason::NoActiveCores),
-                message: Some("No active cores are available to evaluate readiness.".to_string()),
+                message: Some(
+                    "No active serving cores are available to evaluate readiness.".to_string(),
+                ),
                 last_transition_time: last_time,
             };
         }
 
         if let Some(state) = latest_false {
             let message = format!(
-                "Pipeline is not ready; ready quorum {} not met ({} of {} cores ready).",
+                "Pipeline is not ready; ready quorum {} not met ({} of {} serving cores ready).",
                 describe_quorum(self.health_policy.ready_quorum, required),
                 ready_numer,
                 ready_denom
@@ -253,10 +399,9 @@ impl PipelineStatus {
                     .reason
                     .clone()
                     .or_else(|| Some(ConditionReason::unknown("Unknown"))),
-                message: state
-                    .message
-                    .clone()
-                    .or_else(|| Some("Readiness is unknown for one or more cores.".to_string())),
+                message: state.message.clone().or_else(|| {
+                    Some("Readiness is unknown for one or more serving cores.".to_string())
+                }),
                 last_transition_time: state.last_transition_time,
             };
         }
@@ -270,43 +415,121 @@ impl PipelineStatus {
         }
     }
 
-    /// Returns a boolean representing the liveness across cores, governed by the aggregation
-    /// policy.
+    /// Returns a boolean representing the liveness across serving cores.
     #[must_use]
     pub fn liveness(&self) -> bool {
-        let (numer, denom) = self.count_quorum(|c| self.health_policy.is_live(c.phase.kind()));
+        let selected = self.selected_runtimes();
+        let (numer, denom) = self.count_quorum_from(&selected, |runtime| {
+            self.health_policy.is_live(runtime.phase.kind())
+        });
         quorum_satisfied(numer, denom, self.health_policy.live_quorum)
     }
 
-    /// Returns a boolean representing the readiness across cores, governed by the aggregation
-    /// policy.
+    /// Returns a boolean representing the readiness across serving cores.
     #[must_use]
     pub fn readiness(&self) -> bool {
-        let (numer, denom) = self.count_quorum(|c| {
-            c.phase.kind() != PhaseKind::Deleted && self.health_policy.is_ready(c.phase.kind())
+        let selected = self.selected_runtimes();
+        let (numer, denom) = self.count_quorum_from(&selected, |runtime| {
+            runtime.phase.kind() != PhaseKind::Deleted
+                && self.health_policy.is_ready(runtime.phase.kind())
         });
         denom > 0 && quorum_satisfied(numer, denom, self.health_policy.ready_quorum)
     }
 
-    /// Counts how many cores satisfy the given predicate, returning (numerator, denominator).
-    ///
-    /// The denominator excludes cores in `Deleted` phase.
-    /// The numerator excludes cores in `Deleted` phase and counts only cores satisfying the
-    /// predicate. The predicate is usually checking for liveness or readiness.
-    fn count_quorum<F>(&self, pred: F) -> (usize, usize)
+    /// Selects the runtime instances that currently represent this logical pipeline.
+    fn selected_runtime_keys(&self) -> Vec<RuntimeInstanceKey> {
+        if !self.serving_generations.is_empty() {
+            return self
+                .serving_generations
+                .iter()
+                .map(|(core_id, generation)| RuntimeInstanceKey {
+                    core_id: *core_id,
+                    deployment_generation: *generation,
+                })
+                .filter(|key| self.instances.contains_key(key))
+                .collect();
+        }
+
+        if let Some(active_generation) = self.active_generation {
+            let selected: Vec<_> = self
+                .instances
+                .iter()
+                .filter(|(key, _)| {
+                    key.deployment_generation == active_generation
+                        && (self.active_cores.is_empty()
+                            || self.active_cores.contains(&key.core_id))
+                })
+                .map(|(key, _)| *key)
+                .collect();
+            if !selected.is_empty() {
+                return selected;
+            }
+        }
+
+        let mut per_core: HashMap<CoreId, RuntimeInstanceKey> = HashMap::new();
+        for key in self.instances.keys() {
+            if !self.active_cores.is_empty() && !self.active_cores.contains(&key.core_id) {
+                continue;
+            }
+            let replace = per_core
+                .get(&key.core_id)
+                .is_none_or(|existing| key.deployment_generation > existing.deployment_generation);
+            if replace {
+                _ = per_core.insert(key.core_id, *key);
+            }
+        }
+        per_core.into_values().collect()
+    }
+
+    /// Selects the runtime instances that should contribute to aggregated status.
+    fn selected_runtimes(&self) -> Vec<(RuntimeInstanceKey, &PipelineRuntimeStatus)> {
+        self.selected_runtime_keys()
+            .into_iter()
+            .filter_map(|key| self.instances.get(&key).map(|runtime| (key, runtime)))
+            .collect()
+    }
+
+    /// Builds a per-core view of the selected runtime instances.
+    fn selected_core_map(&self) -> HashMap<CoreId, PipelineRuntimeStatus> {
+        self.selected_runtimes()
+            .into_iter()
+            .map(|(key, runtime)| (key.core_id, runtime.clone()))
+            .collect()
+    }
+
+    /// Builds the retained per-instance view exposed for overlap-aware status
+    /// debugging.
+    fn retained_instance_views(&self) -> Vec<RuntimeInstanceStatusView<'_>> {
+        let mut instances = self
+            .instances
+            .iter()
+            .map(|(key, status)| RuntimeInstanceStatusView {
+                core_id: key.core_id,
+                deployment_generation: key.deployment_generation,
+                status,
+            })
+            .collect::<Vec<_>>();
+        instances.sort_by_key(|instance| (instance.core_id, instance.deployment_generation));
+        instances
+    }
+
+    /// Counts how many selected runtimes satisfy a quorum predicate.
+    fn count_quorum_from<F>(
+        &self,
+        selected: &[(RuntimeInstanceKey, &PipelineRuntimeStatus)],
+        pred: F,
+    ) -> (usize, usize)
     where
         F: Fn(&PipelineRuntimeStatus) -> bool,
     {
-        let denom = self
-            .cores
-            .values()
-            .filter(|c| c.phase.kind() != PhaseKind::Deleted)
+        let denom = selected
+            .iter()
+            .filter(|(_, runtime)| runtime.phase.kind() != PhaseKind::Deleted)
             .count();
-        let numer = self
-            .cores
-            .values()
-            .filter(|c| c.phase.kind() != PhaseKind::Deleted)
-            .filter(|c| pred(c))
+        let numer = selected
+            .iter()
+            .filter(|(_, runtime)| runtime.phase.kind() != PhaseKind::Deleted)
+            .filter(|(_, runtime)| pred(runtime))
             .count();
         (numer, denom)
     }
@@ -368,12 +591,18 @@ impl Serialize for PipelineStatus {
     where
         S: serde::Serializer,
     {
-        let mut state = serializer.serialize_struct("PipelineStatus", 5)?;
-        let conditions = self.conditions();
-        state.serialize_field("conditions", &conditions)?;
+        let selected_cores = self.selected_core_map();
+        let retained_instances = self.retained_instance_views();
+
+        let mut state = serializer.serialize_struct("PipelineStatus", 8)?;
+        state.serialize_field("conditions", &self.conditions())?;
         state.serialize_field("totalCores", &self.total_cores())?;
         state.serialize_field("runningCores", &self.running_cores())?;
-        state.serialize_field("cores", &self.cores)?;
+        state.serialize_field("cores", &selected_cores)?;
+        state.serialize_field("instances", &retained_instances)?;
+        state.serialize_field("activeGeneration", &self.active_generation)?;
+        state.serialize_field("servingGenerations", &self.serving_generations)?;
+        state.serialize_field("rollout", &self.rollout)?;
         state.end()
     }
 }
@@ -383,7 +612,6 @@ mod tests {
     use super::*;
     use crate::conditions::{ConditionKind, ConditionReason, ConditionState, ConditionStatus};
     use crate::phase::FailReason;
-    use std::collections::HashMap;
     use std::time::{Duration, SystemTime};
 
     fn runtime(phase: PipelinePhase) -> PipelineRuntimeStatus {
@@ -393,11 +621,23 @@ mod tests {
         }
     }
 
+    fn insert_runtime(
+        status: &mut PipelineStatus,
+        core_id: CoreId,
+        generation: u64,
+        runtime: PipelineRuntimeStatus,
+    ) {
+        _ = status.instances.insert(
+            RuntimeInstanceKey {
+                core_id,
+                deployment_generation: generation,
+            },
+            runtime,
+        );
+    }
+
     fn new_status(policy: HealthPolicy) -> PipelineStatus {
-        PipelineStatus {
-            cores: HashMap::new(),
-            health_policy: policy,
-        }
+        PipelineStatus::new(policy)
     }
 
     fn runtime_with_conditions(
@@ -440,24 +680,31 @@ mod tests {
             ready_quorum: Quorum::Percent(100),
         };
         let mut status = new_status(policy);
-        _ = status.cores.insert(0, runtime(PipelinePhase::Running));
-        _ = status.cores.insert(1, runtime(PipelinePhase::Running));
-        _ = status
-            .cores
-            .insert(2, runtime(PipelinePhase::Failed(FailReason::RuntimeError)));
-        _ = status.cores.insert(3, runtime(PipelinePhase::Deleted));
+        insert_runtime(&mut status, 0, 0, runtime(PipelinePhase::Running));
+        insert_runtime(&mut status, 1, 0, runtime(PipelinePhase::Running));
+        insert_runtime(
+            &mut status,
+            2,
+            0,
+            runtime(PipelinePhase::Failed(FailReason::RuntimeError)),
+        );
+        insert_runtime(&mut status, 3, 0, runtime(PipelinePhase::Deleted));
+        status.set_active_generation(0);
 
         assert!(status.liveness());
 
-        _ = status
-            .cores
-            .insert(1, runtime(PipelinePhase::Failed(FailReason::RuntimeError)));
+        insert_runtime(
+            &mut status,
+            1,
+            0,
+            runtime(PipelinePhase::Failed(FailReason::RuntimeError)),
+        );
 
         assert!(!status.liveness());
     }
 
     #[test]
-    fn readiness_requires_all_non_deleted_cores_to_be_ready() {
+    fn readiness_requires_all_selected_cores_to_be_ready() {
         let policy = HealthPolicy {
             live_if: vec![PhaseKind::Running],
             ready_if: vec![PhaseKind::Running],
@@ -465,21 +712,24 @@ mod tests {
             ready_quorum: Quorum::Percent(100),
         };
         let mut status = new_status(policy);
-        _ = status.cores.insert(0, runtime(PipelinePhase::Running));
-        _ = status.cores.insert(1, runtime(PipelinePhase::Running));
+        insert_runtime(&mut status, 0, 0, runtime(PipelinePhase::Running));
+        insert_runtime(&mut status, 1, 0, runtime(PipelinePhase::Running));
+        status.set_active_generation(0);
 
         assert!(status.readiness());
 
-        _ = status.cores.insert(1, runtime(PipelinePhase::Updating));
+        insert_runtime(&mut status, 1, 0, runtime(PipelinePhase::Updating));
 
         assert!(!status.readiness());
     }
 
     #[test]
-    fn aggregated_accept_condition_false_if_any_core_not_accepted() {
+    fn aggregated_accept_condition_false_if_any_serving_core_not_accepted() {
         let policy = HealthPolicy::default();
         let mut status = new_status(policy);
-        _ = status.cores.insert(
+        insert_runtime(
+            &mut status,
+            0,
             0,
             runtime_with_conditions(
                 PipelinePhase::Running,
@@ -491,8 +741,10 @@ mod tests {
                 Some(ts(10)),
             ),
         );
-        _ = status.cores.insert(
+        insert_runtime(
+            &mut status,
             1,
+            0,
             runtime_with_conditions(
                 PipelinePhase::Pending,
                 ConditionStatus::False,
@@ -503,6 +755,7 @@ mod tests {
                 Some(ts(20)),
             ),
         );
+        status.set_active_generation(0);
 
         let accepted = status
             .conditions()
@@ -524,7 +777,9 @@ mod tests {
             ready_quorum: Quorum::Percent(100),
         };
         let mut status = new_status(policy);
-        _ = status.cores.insert(
+        insert_runtime(
+            &mut status,
+            0,
             0,
             runtime_with_conditions(
                 PipelinePhase::Running,
@@ -536,8 +791,10 @@ mod tests {
                 Some(ts(5)),
             ),
         );
-        _ = status.cores.insert(
+        insert_runtime(
+            &mut status,
             1,
+            0,
             runtime_with_conditions(
                 PipelinePhase::Failed(FailReason::RuntimeError),
                 ConditionStatus::True,
@@ -548,6 +805,7 @@ mod tests {
                 Some(ts(12)),
             ),
         );
+        status.set_active_generation(0);
 
         let ready = status
             .conditions()
@@ -565,4 +823,184 @@ mod tests {
         );
         assert_eq!(ready.last_transition_time, Some(ts(12)));
     }
+
+    /// Scenario: observed state contains overlapping generations during a
+    /// mixed-generation rollout and the controller marks which generation is
+    /// serving on each core.
+    /// Guarantees: aggregation selects the serving generation per core so
+    /// total/running core counts and readiness reflect the active serving set.
+    #[test]
+    fn serving_generation_selection_supports_mixed_blue_green_rollout() {
+        let mut status = new_status(HealthPolicy::default());
+        insert_runtime(&mut status, 0, 0, runtime(PipelinePhase::Stopped));
+        insert_runtime(&mut status, 0, 1, runtime(PipelinePhase::Running));
+        insert_runtime(&mut status, 1, 0, runtime(PipelinePhase::Running));
+        status.set_active_generation(0);
+        status.set_serving_generation(0, 1);
+        status.set_serving_generation(1, 0);
+
+        assert_eq!(status.total_cores(), 2);
+        assert_eq!(status.running_cores(), 2);
+        assert!(status.readiness());
+    }
+
+    /// Scenario: observed state contains multiple generations for the same
+    /// cores while the controller has already pinned the serving generation on
+    /// each core.
+    /// Guarantees: compaction retains only the serving `(core, generation)`
+    /// instances and removes superseded generations from retained state.
+    #[test]
+    fn compact_instances_retains_only_serving_generations() {
+        let mut status = new_status(HealthPolicy::default());
+        insert_runtime(&mut status, 0, 0, runtime(PipelinePhase::Stopped));
+        insert_runtime(&mut status, 0, 1, runtime(PipelinePhase::Running));
+        insert_runtime(&mut status, 1, 0, runtime(PipelinePhase::Running));
+        insert_runtime(&mut status, 1, 1, runtime(PipelinePhase::Stopped));
+        status.set_active_generation(0);
+        status.set_serving_generation(0, 1);
+        status.set_serving_generation(1, 0);
+
+        status.compact_instances_to_selected();
+
+        assert_eq!(status.per_instance().len(), 2);
+        assert!(status.instance_status(0, 1).is_some());
+        assert!(status.instance_status(1, 0).is_some());
+        assert!(status.instance_status(0, 0).is_none());
+        assert!(status.instance_status(1, 1).is_none());
+    }
+
+    /// Scenario: observed state has multiple generations but there is no
+    /// mixed-generation serving override and the controller has committed a new
+    /// active generation.
+    /// Guarantees: compaction retains only the committed active generation.
+    #[test]
+    fn compact_instances_retains_only_active_generation_when_no_serving_override() {
+        let mut status = new_status(HealthPolicy::default());
+        insert_runtime(&mut status, 0, 0, runtime(PipelinePhase::Stopped));
+        insert_runtime(&mut status, 0, 1, runtime(PipelinePhase::Running));
+        insert_runtime(&mut status, 1, 0, runtime(PipelinePhase::Stopped));
+        insert_runtime(&mut status, 1, 1, runtime(PipelinePhase::Running));
+        status.set_active_generation(1);
+
+        status.compact_instances_to_selected();
+
+        assert_eq!(status.per_instance().len(), 2);
+        assert!(status.instance_status(0, 1).is_some());
+        assert!(status.instance_status(1, 1).is_some());
+        assert!(status.instance_status(0, 0).is_none());
+        assert!(status.instance_status(1, 0).is_none());
+    }
+
+    /// Scenario: observed state contains only superseded generations relative
+    /// to the last committed active generation.
+    /// Guarantees: compaction falls back to the highest observed generation per
+    /// core so status remains bounded without dropping the last known view.
+    #[test]
+    fn compact_instances_falls_back_to_latest_generation_per_core() {
+        let mut status = new_status(HealthPolicy::default());
+        insert_runtime(&mut status, 0, 0, runtime(PipelinePhase::Stopped));
+        insert_runtime(&mut status, 0, 2, runtime(PipelinePhase::Stopped));
+        insert_runtime(&mut status, 1, 1, runtime(PipelinePhase::Running));
+        insert_runtime(&mut status, 1, 3, runtime(PipelinePhase::Stopped));
+        status.set_active_generation(9);
+
+        status.compact_instances_to_selected();
+
+        assert_eq!(status.per_instance().len(), 2);
+        assert!(status.instance_status(0, 2).is_some());
+        assert!(status.instance_status(1, 3).is_some());
+        assert!(status.instance_status(0, 0).is_none());
+        assert!(status.instance_status(1, 1).is_none());
+    }
+
+    /// Scenario: a logical pipeline has been fully shut down and observed state
+    /// still contains an older generation alongside the final stopped
+    /// generation.
+    /// Guarantees: compaction keeps the last stopped generation per core so
+    /// `/status` continues to surface the final stopped view instead of
+    /// collapsing to an empty runtime set.
+    #[test]
+    fn compact_instances_preserves_last_stopped_generation_view_after_shutdown() {
+        let mut status = new_status(HealthPolicy::default());
+        insert_runtime(&mut status, 0, 0, runtime(PipelinePhase::Stopped));
+        insert_runtime(&mut status, 0, 1, runtime(PipelinePhase::Stopped));
+        status.set_active_generation(1);
+        status.set_active_cores([0]);
+
+        status.compact_instances_to_selected();
+
+        assert_eq!(status.per_instance().len(), 1);
+        assert_eq!(status.total_cores(), 1);
+        assert_eq!(status.running_cores(), 0);
+        assert!(matches!(
+            status
+                .instance_status(0, 1)
+                .expect("latest generation should remain")
+                .phase(),
+            PipelinePhase::Stopped
+        ));
+        assert!(status.instance_status(0, 0).is_none());
+    }
+
+    /// Scenario: a pure resize-down retires one core without changing the
+    /// committed generation, so multiple retained instances share the same
+    /// active generation across different cores.
+    /// Guarantees: aggregated status and compaction respect the committed core
+    /// footprint instead of treating every instance on the active generation as
+    /// still serving.
+    #[test]
+    fn active_generation_selection_respects_committed_core_footprint() {
+        let mut status = new_status(HealthPolicy::default());
+        insert_runtime(&mut status, 0, 0, runtime(PipelinePhase::Running));
+        insert_runtime(&mut status, 1, 0, runtime(PipelinePhase::Stopped));
+        status.set_active_generation(0);
+        status.set_active_cores([0]);
+
+        assert_eq!(status.total_cores(), 1);
+        assert_eq!(status.running_cores(), 1);
+
+        status.compact_instances_to_selected();
+
+        assert_eq!(status.per_instance().len(), 1);
+        assert!(status.instance_status(0, 0).is_some());
+        assert!(status.instance_status(1, 0).is_none());
+    }
+
+    /// Scenario: a rolling cutover overlaps the old and new generations on one
+    /// core while aggregation must still reflect only the selected serving set.
+    /// Guarantees: `/status.instances` preserves both retained generations for
+    /// debugging, while aggregated `cores` and core counts continue to use the
+    /// selected serving generation per core.
+    #[test]
+    fn serialization_preserves_overlap_aware_instances_while_aggregating_selected_cores() {
+        let mut status = new_status(HealthPolicy::default());
+        insert_runtime(&mut status, 0, 0, runtime(PipelinePhase::Stopped));
+        insert_runtime(&mut status, 0, 1, runtime(PipelinePhase::Running));
+        insert_runtime(&mut status, 1, 0, runtime(PipelinePhase::Running));
+        status.set_active_generation(0);
+        status.set_serving_generation(0, 1);
+        status.set_serving_generation(1, 0);
+
+        let json = serde_json::to_value(&status).expect("pipeline status should serialize");
+        let instances = json["instances"]
+            .as_array()
+            .expect("instances should serialize as an array");
+
+        assert_eq!(json["totalCores"], 2);
+        assert_eq!(json["runningCores"], 2);
+        assert_eq!(
+            json["cores"]
+                .as_object()
+                .expect("cores should serialize as an object")
+                .len(),
+            2
+        );
+        assert_eq!(instances.len(), 3);
+        assert_eq!(instances[0]["coreId"], 0);
+        assert_eq!(instances[0]["deploymentGeneration"], 0);
+        assert_eq!(instances[1]["coreId"], 0);
+        assert_eq!(instances[1]["deploymentGeneration"], 1);
+        assert_eq!(instances[2]["coreId"], 1);
+        assert_eq!(instances[2]["deploymentGeneration"], 0);
+    }
 }
diff --git a/rust/otap-dataflow/crates/state/src/store.rs b/rust/otap-dataflow/crates/state/src/store.rs
index 8b1660d3f8..da1dd01ec5 100644
--- a/rust/otap-dataflow/crates/state/src/store.rs
+++ b/rust/otap-dataflow/crates/state/src/store.rs
@@ -7,7 +7,7 @@ use crate::ObservedEventRingBuffer;
 use crate::error::Error;
 use crate::phase::PipelinePhase;
 use crate::pipeline_rt_status::{ApplyOutcome, PipelineRuntimeStatus};
-use crate::pipeline_status::PipelineStatus;
+use crate::pipeline_status::{PipelineRolloutSummary, PipelineStatus, RuntimeInstanceKey};
 use otap_df_config::PipelineKey;
 use otap_df_config::health::HealthPolicy;
 use otap_df_config::observed_state::{ObservedStateSettings, SendPolicy};
@@ -170,6 +170,134 @@ impl ObservedStateStore {
         _ = policies.insert(pipeline_key, health_policy);
     }
 
+    /// Returns the health policy currently configured for one logical pipeline.
+    fn health_policy_for_pipeline(&self, pipeline_key: &PipelineKey) -> HealthPolicy {
+        self.health_policies
+            .lock()
+            .ok()
+            .and_then(|policies| policies.get(pipeline_key).cloned())
+            .unwrap_or_else(|| self.default_health_policy.clone())
+    }
+
+    /// Records the committed active generation for a logical pipeline.
+    pub fn set_pipeline_active_generation(&self, pipeline_key: PipelineKey, generation: u64) {
+        let mut pipelines = self.pipelines.lock().unwrap_or_else(|poisoned| {
+            otel_error!(
+                "state.mutex_poisoned",
+                action = "continuing with possibly inconsistent state"
+            );
+            poisoned.into_inner()
+        });
+        let status = pipelines
+            .entry(pipeline_key.clone())
+            .or_insert_with(|| PipelineStatus::new(self.health_policy_for_pipeline(&pipeline_key)));
+        status.set_active_generation(generation);
+    }
+
+    /// Records the committed serving core footprint for a logical pipeline.
+    pub fn set_pipeline_active_cores<I>(&self, pipeline_key: PipelineKey, core_ids: I)
+    where
+        I: IntoIterator<Item = otap_df_config::CoreId>,
+    {
+        let mut pipelines = self.pipelines.lock().unwrap_or_else(|poisoned| {
+            otel_error!(
+                "state.mutex_poisoned",
+                action = "continuing with possibly inconsistent state"
+            );
+            poisoned.into_inner()
+        });
+        let status = pipelines
+            .entry(pipeline_key.clone())
+            .or_insert_with(|| PipelineStatus::new(self.health_policy_for_pipeline(&pipeline_key)));
+        status.set_active_cores(core_ids);
+    }
+
+    /// Records which generation is serving traffic for the given logical core.
+    pub fn set_pipeline_serving_generation(
+        &self,
+        pipeline_key: PipelineKey,
+        core_id: otap_df_config::CoreId,
+        generation: u64,
+    ) {
+        let mut pipelines = self.pipelines.lock().unwrap_or_else(|poisoned| {
+            otel_error!(
+                "state.mutex_poisoned",
+                action = "continuing with possibly inconsistent state"
+            );
+            poisoned.into_inner()
+        });
+        let status = pipelines
+            .entry(pipeline_key.clone())
+            .or_insert_with(|| PipelineStatus::new(self.health_policy_for_pipeline(&pipeline_key)));
+        status.set_serving_generation(core_id, generation);
+    }
+
+    /// Removes the serving-generation marker for a logical core.
+    pub fn clear_pipeline_serving_generation(
+        &self,
+        pipeline_key: PipelineKey,
+        core_id: otap_df_config::CoreId,
+    ) {
+        let mut pipelines = self.pipelines.lock().unwrap_or_else(|poisoned| {
+            otel_error!(
+                "state.mutex_poisoned",
+                action = "continuing with possibly inconsistent state"
+            );
+            poisoned.into_inner()
+        });
+        if let Some(status) = pipelines.get_mut(&pipeline_key) {
+            status.clear_serving_generation(core_id);
+        }
+    }
+
+    /// Updates the rollout summary exposed in `/status`.
+    pub fn set_pipeline_rollout_summary(
+        &self,
+        pipeline_key: PipelineKey,
+        rollout: PipelineRolloutSummary,
+    ) {
+        let mut pipelines = self.pipelines.lock().unwrap_or_else(|poisoned| {
+            otel_error!(
+                "state.mutex_poisoned",
+                action = "continuing with possibly inconsistent state"
+            );
+            poisoned.into_inner()
+        });
+        let status = pipelines
+            .entry(pipeline_key.clone())
+            .or_insert_with(|| PipelineStatus::new(self.health_policy_for_pipeline(&pipeline_key)));
+        status.set_rollout_summary(rollout);
+    }
+
+    /// Clears any rollout summary for the logical pipeline.
+    pub fn clear_pipeline_rollout_summary(&self, pipeline_key: PipelineKey) {
+        let mut pipelines = self.pipelines.lock().unwrap_or_else(|poisoned| {
+            otel_error!(
+                "state.mutex_poisoned",
+                action = "continuing with possibly inconsistent state"
+            );
+            poisoned.into_inner()
+        });
+        if let Some(status) = pipelines.get_mut(&pipeline_key) {
+            status.clear_rollout_summary();
+        }
+    }
+
+    /// Compacts retained observed instances for one logical pipeline to the
+    /// generations currently selected for status aggregation.
+    pub fn compact_pipeline_instances(&self, pipeline_key: &PipelineKey) {
+        let mut pipelines = self.pipelines.lock().unwrap_or_else(|poisoned| {
+            otel_error!(
+                "state.mutex_poisoned",
+                action = "continuing with possibly inconsistent state"
+            );
+            poisoned.into_inner()
+        });
+        if let Some(status) = pipelines.get_mut(pipeline_key) {
+            status.compact_instances_to_selected();
+        }
+    }
+
     /// Returns a handle that can be used to read the current observed state.
     #[must_use]
     pub fn handle(&self) -> ObservedStateHandle {
@@ -281,11 +409,17 @@ impl ObservedStateStore {
         let ps = pipelines
             .entry(pipeline_key)
             .or_insert_with(|| PipelineStatus::new(health_policy));
+        if ps.active_generation().is_none() {
+            ps.set_active_generation(key.deployment_generation);
+        }
 
-        // Upsert the core record and its condition snapshot
+        // Upsert the runtime-instance record and its condition snapshot
         let cs = ps
-            .cores
-            .entry(key.core_id)
+            .instances
+            .entry(RuntimeInstanceKey {
+                core_id: key.core_id,
+                deployment_generation: key.deployment_generation,
+            })
             .or_insert_with(|| PipelineRuntimeStatus {
                 phase: PipelinePhase::Pending,
                 last_heartbeat_time: observed_event.time,
@@ -444,6 +578,7 @@ mod tests {
             pipeline_group_id: Cow::Borrowed("group"),
             pipeline_id: Cow::Borrowed("pipeline"),
             core_id,
+            deployment_generation: 0,
         }
     }
 
@@ -587,7 +722,7 @@ mod tests {
             "All {num_cores} cores should reach Running when engine events are reliable. \
              Stuck in Pending: {}",
             status
-                .per_core()
+                .per_instance()
                 .values()
                 .filter(|c| matches!(c.phase, PipelinePhase::Pending))
                 .count(),
@@ -684,7 +819,7 @@ mod tests {
             "All {num_cores} cores should reach Running despite log channel contention. \
              Stuck in Pending: {}",
             status
-                .per_core()
+                .per_instance()
                 .values()
                 .filter(|c| matches!(c.phase, PipelinePhase::Pending))
                 .count(),
diff --git a/rust/otap-dataflow/crates/telemetry/src/lib.rs b/rust/otap-dataflow/crates/telemetry/src/lib.rs
index b9d3e78cec..ecb7b93143 100644
--- a/rust/otap-dataflow/crates/telemetry/src/lib.rs
+++ b/rust/otap-dataflow/crates/telemetry/src/lib.rs
@@ -443,6 +443,7 @@ mod tests {
     use otap_df_pdata::proto::opentelemetry::resource::v1::Resource;
     use otap_df_pdata::testing::equiv::assert_equivalent;
     use prost::Message;
+    use std::time::Duration;
 
     fn test_reporter() -> ObservedEventReporter {
         let (sender, _receiver) = flume::bounded(16);
@@ -503,7 +504,9 @@ mod tests {
             });
 
             // Receiver should have the log
-            let recv = rx.recv().expect("receiver should have log after emit");
+            let recv = rx
+                .recv_timeout(Duration::from_secs(1))
+                .expect("receiver should have log after emit");
             assert!(matches!(recv, ObservedEvent::Log(_)));
             let text = recv.to_string();
             assert!(text.contains("test log message"), "log message is {}", text);
diff --git a/rust/otap-dataflow/crates/validation/src/simulate.rs b/rust/otap-dataflow/crates/validation/src/simulate.rs
index de963c04aa..02f53ec0a5 100644
--- a/rust/otap-dataflow/crates/validation/src/simulate.rs
+++ b/rust/otap-dataflow/crates/validation/src/simulate.rs
@@ -5,7 +5,7 @@ use crate::error::ValidationError;
 use crate::metrics_types::{MetricSetSnapshot, MetricsSnapshot};
 use otap_df_admin_api::{
     AdminClient, AdminEndpoint, HttpAdminClientSettings, engine::ProbeStatus,
-    operations::OperationOptions, pipeline_groups::ShutdownStatus, telemetry::MetricsOptions,
+    groups::ShutdownStatus, operations::OperationOptions, telemetry::MetricsOptions,
 };
 use otap_df_config::engine::OtelDataflowSpec;
 use otap_df_controller::Controller;
@@ -147,7 +147,7 @@ async fn wait_for_validation_finished(
 /// shutdown pipeline after running
 async fn shutdown_pipeline(client: &AdminClient) -> Result<(), ValidationError> {
     let response = client
-        .pipeline_groups()
+        .groups()
         .shutdown(&OperationOptions::default())
         .await
         .map_err(admin_error)?;
@@ -476,7 +476,7 @@ mod tests {
             .await;
 
         Mock::given(method("POST"))
-            .and(path("/api/v1/pipeline-groups/shutdown"))
+            .and(path("/api/v1/groups/shutdown"))
             .and(query_param("wait", "false"))
             .and(query_param("timeout_secs", "60"))
             .respond_with(ResponseTemplate::new(202).set_body_json(serde_json::json!({
diff --git a/rust/otap-dataflow/docs/admin/README.md b/rust/otap-dataflow/docs/admin/README.md
index f7e2c187d8..e31505f3ec 100644
--- a/rust/otap-dataflow/docs/admin/README.md
+++ b/rust/otap-dataflow/docs/admin/README.md
@@ -3,12 +3,14 @@
 This section documents the admin surface of the OTAP Dataflow Engine:
 
 - runtime endpoints used for health, status, and telemetry;
+- live pipeline reconfiguration and shutdown operations;
 - embedded browser UI behavior and architecture.
 - the public Rust admin SDK.
 
 ## Document map
 
 - [Admin UI Architecture](architecture.md)
+- [Live Pipeline Reconfiguration](live-reconfiguration.md)
 - [Crate README (admin endpoints and crate layout)](../../crates/admin/README.md)
 - [Public Rust SDK README](../../crates/admin-api/README.md)
 
@@ -26,6 +28,9 @@ raw HTTP requests directly.
 For architecture details (state model, derivation rules, graph rules, testing),
 start with [Admin UI Architecture](architecture.md).
 
+For the live mutation API used to create, replace, resize, and shut down
+logical pipelines, see [Live Pipeline Reconfiguration](live-reconfiguration.md).
+
 ## UI module tests
 
 Prerequisite:
diff --git a/rust/otap-dataflow/docs/admin/live-reconfiguration.md b/rust/otap-dataflow/docs/admin/live-reconfiguration.md
new file mode 100644
index 0000000000..4c728228d6
--- /dev/null
+++ b/rust/otap-dataflow/docs/admin/live-reconfiguration.md
@@ -0,0 +1,480 @@
+# Live Pipeline Reconfiguration
+
+This document describes the live reconfiguration flow exposed by the admin API.
+
+The feature lets a running OTel Dataflow Engine mutate one logical pipeline at
+a time without restarting the process or reloading the full startup file.
+
+## Goals
+
+- Reconfigure one pipeline in a running engine instance.
+- Keep the mutation scoped to a single `(pipeline_group_id, pipeline_id)`.
+- Preserve service continuity for topology/config changes with a serial rolling
+  cutover that overlaps old and new instances only on the affected cores.
+- Support pure resource policy changes, including scale up and scale down,
+  without restarting unchanged cores.
+- Make progress observable through admin endpoints instead of hidden internal
+  controller state.
+
+## Supported Operations
+
+- Create a new pipeline inside an existing pipeline group.
+- Replace an existing pipeline with a new topology or node configuration.
+- Resize an existing pipeline when the only effective runtime change is
+  `policies.resources.core_allocation`.
+- Accept an effectively identical update as a `noop`.
+- Track rollout progress with a rollout id.
+- Shutdown a logical pipeline and track shutdown progress with a shutdown id.
+
+## Terminology
+
+Live reconfiguration uses a few controller-specific terms. They are important
+because the admin API exposes both committed pipeline state and in-progress
+runtime state.
+
+- Logical pipeline: the named pipeline addressed by `(pipeline_group_id,
+  pipeline_id)`. A logical pipeline can have several runtime instances over
+  time as it is rolled, resized, or shut down.
+- Runtime instance: one concrete execution of a logical pipeline on one core.
+  Runtime instances are identified by `(pipeline_group_id, pipeline_id, core_id,
+  deployment_generation)`.
+- Deployment generation: a monotonically assigned version for runtime
+  instances of one logical pipeline. `create` and `replace` rollouts start a new
+  generation. `resize` keeps the same generation and only changes the active
+  core set.
+- Active generation: the generation currently committed by the controller as
+  the logical pipeline's desired serving generation.
+- Serving generation: the generation currently selected for a specific core in
+  observed state. During a rolling cutover, different cores may temporarily
+  serve different generations.
+- Candidate pipeline config: the pipeline config submitted by the client and
+  validated by the controller before it is committed into the live in-memory
+  engine config.
+- Candidate generation: the target generation for a `create` or `replace`
+  rollout while it is still being tested and has not yet been committed as the
+  active generation.
+- Candidate instance: a runtime instance launched from the candidate generation.
+  Candidate instances must become admitted and ready before the controller uses
+  them for serving. If the rollout fails before commit, the controller
+  best-effort shuts them down.
+- Rollout worker: the background controller thread that executes an accepted
+  rollout plan after the admin request has been accepted. The API can return
+  before this worker finishes when `wait=false`.
+- Rollout worker panic: an unexpected Rust panic in the rollout worker itself,
+  not a normal pipeline runtime error. The controller catches this panic, marks
+  the rollout failed, reports diagnostics, clears the active-operation conflict,
+  and cleans up uncommitted candidate instances when needed.
+- Drain: a graceful shutdown step. The runtime stops accepting new ingress,
+  lets already admitted work finish as far as the node contracts allow, and
+  exits before the drain timeout.
+
+This document uses `serial rolling cutover with overlap` for topology-changing
+replacement.
+
+During `replace`, the controller overlaps old and new instances only on the
+core currently being switched:
+
+- start the new instance for one core;
+- wait for `Admitted` and `Ready`;
+- drain the old instance on that same core;
+- move to the next core.
+
+This does not start a second complete serving fleet and perform one atomic
+traffic flip across the whole pipeline.
+
+## Boundaries and Current Limits
+
+- Updates are in-memory only. The startup YAML file is not rewritten.
+- The target pipeline group must already exist.
+- Runtime topic broker mutation is rejected. In practice this means:
+  - no new or removed declared topics;
+  - no change to the selected topic mode;
+  - no change to topic backend or topic policies.
+- Group-level and engine-level policy mutation is out of scope.
+- There is no dedicated scale endpoint. Scale-only changes use the same `PUT`
+  endpoint as topology changes.
+
+## Consistency Model
+
+The current API serializes live operations per logical pipeline, identified by
+`(pipeline_group_id, pipeline_id)`. A rollout or shutdown conflicts with another
+active operation for the same logical pipeline, while operations for different
+logical pipelines may run concurrently.
+
+Rollout planning validates a candidate by patching one pipeline into the
+controller's current in-memory `OtelDataflowSpec` snapshot and running full
+engine validation on that candidate snapshot. That validation does not make the
+operation a whole-config transaction: another logical pipeline can commit before
+this rollout commits, and commit applies only the accepted pipeline back into
+the latest live config.
+
+The API intentionally leaves room to widen the consistency scope later. If
+group-level invariants become mutable, the controller can serialize
+config-mutating operations per pipeline group and return `409 Conflict` for
+concurrent operations in that group without changing the existing pipeline
+endpoint or response schema. Engine-level reconfiguration can be added as a
+separate operation surface if full-engine transactions become necessary.
+
+## How It Works
+
+1. The client submits a candidate pipeline config to
+   `PUT /groups/{group}/pipelines/{id}`.
+1. The controller patches exactly that pipeline into its live in-memory
+   `OtelDataflowSpec`.
+1. The candidate config is validated as a full engine snapshot:
+   - pipeline structure and canonicalization;
+   - component config validation;
+   - whole-config validation, including topic cycle checks;
+   - topic runtime profile compatibility.
+1. The controller classifies the update:
+   - `create`: the logical pipeline does not exist yet;
+   - `noop`: the resolved pipeline and active serving footprint already match
+     the request;
+   - `replace`: the runtime graph or runtime-significant node config changed;
+   - `resize`: only the effective core allocation changed.
+1. The controller executes the plan:
+   - `create`: start all target instances in parallel and commit only if they
+     all become healthy.
+   - `noop`: record an immediately successful rollout result without restarting
+     any runtime instances.
+   - `replace`: do a serial rolling cutover with overlap per common core.
+     Start the new generation on one core, wait for admission and readiness,
+     then drain the old generation on that core.
+   - `resize`: start only newly added cores and drain only removed cores.
+     Common cores stay up and keep serving the current generation.
+1. The controller records rollout progress and mirrors a summary into
+   `GET /groups/{group}/pipelines/{id}/status`.
+
+### Success Gate
+
+For `replace` and `create`, a new instance must reach both `Admitted` and
+`Ready` before the controller commits the new serving state for that step.
+
+The request body carries two timeouts:
+
+- `stepTimeoutSecs`: how long to wait for the new instance to admit and become
+  ready. Default: `60`.
+- `drainTimeoutSecs`: how long to wait for graceful drain of the old instance.
+  Default: `60`.
+
+The query string also supports an overall client wait timeout:
+
+- `timeout_secs` on the `PUT` request when `wait=true`.
+
+### Failure Handling
+
+- `create`: if any target instance fails to admit or become ready, the
+  controller shuts down the instances that were already launched and leaves the
+  committed config unchanged.
+- `replace`: if a core fails during the rollout, the controller stops and
+  automatically rolls back already switched cores to the previous generation.
+- `resize`: if added or removed cores fail during the operation, the controller
+  rolls the resize back by draining newly added cores and relaunching retired
+  cores when possible.
+- If rollback cannot restore a healthy serving set, the rollout ends in
+  `rollback_failed` and the mixed state remains visible through status
+  endpoints.
+
+### Controller Safety Behaviors
+
+The controller treats live reconfiguration as a runtime lifecycle operation,
+not just as an in-memory config edit. Several edge cases are handled explicitly
+to avoid orphaned runtime instances, stale conflicts, or unbounded status
+growth.
+
+- Partial `create` launch failure: if one core fails to launch after earlier
+  cores were already started, the controller best-effort shuts down the
+  candidate instances that were launched by that same create operation before
+  returning rollout failure.
+- Readiness failure after candidate launch: if a candidate generation starts
+  but does not reach `Admitted` and `Ready` before the step timeout, the
+  controller shuts down the candidate instance before continuing with failure
+  handling or rollback.
+- Rollout worker panic: if the detached rollout worker panics, the controller
+  records a terminal failed rollout, clears the active-operation conflict, and
+  emits internal panic diagnostics. If the panic happened while an uncommitted
+  target generation was active, the controller best-effort sends shutdown to
+  those candidate instances first.
+- Committed generation protection: panic cleanup does not shut down a target
+  generation that is already committed as the active serving generation. This
+  prevents a late bookkeeping panic from turning a successful rollout into an
+  outage.
+- Shutdown worker panic: if the detached shutdown worker panics, the controller
+  records a terminal failed shutdown and clears the active-operation conflict,
+  so later operations for the same logical pipeline are not blocked until
+  restart.
+- Runtime thread panic or error: runtime instance failures are reported back
+  into observed state with a concise operator message and diagnostic source
+  detail. The instance is marked exited so controller liveness accounting can
+  progress.
+- Launch and exit races: a runtime thread can exit before its launch
+  registration is visible to the controller. The controller records early exits
+  and reconciles them during registration, avoiding stale active-instance
+  counts.
+- Global shutdown dispatch: `POST /groups/shutdown` snapshots active instances
+  and attempts shutdown delivery to all of them. One failed send does not
+  prevent later instances from receiving shutdown. Dispatch is idempotent for
+  instances that already accepted shutdown.
+- Observed-state compaction: after active controller work no longer needs old
+  generations, the controller compacts retained instance status to the selected
+  serving view. During active rollout overlap, status still exposes both old and
+  new generations so operators can debug cutover behavior.
+- Bounded operation history: terminal rollout and shutdown records are retained
+  only in a bounded in-memory window. Recent terminal ids remain useful for
+  follow-up inspection, but old by-id history is intentionally evictable.
+
+## API Surface
+
+### Read current pipeline config
+
+`GET /groups/{group}/pipelines/{id}`
+
+Returns:
+
+- `pipelineGroupId`
+- `pipelineId`
+- `activeGeneration`
+- `pipeline`
+- optional `rollout` summary
+
+### Create, replace, or resize a pipeline
+
+`PUT /groups/{group}/pipelines/{id}?wait=true|false&timeout_secs=<overall>`
+
+Request body:
+
+```json
+{
+  "pipeline": {
+    "...": "PipelineConfig"
+  },
+  "stepTimeoutSecs": 60,
+  "drainTimeoutSecs": 60
+}
+```
+
+Behavior:
+
+- If `(group, id)` does not exist, the action is `create`.
+- If the submitted config is already in effect, the action is `noop`.
+- If only the effective core allocation changed, the action is `resize`.
+- Otherwise the action is `replace`.
+
+Response body is a `RolloutStatus` with:
+
+- `rolloutId`
+- `action` (`create`, `noop`, `replace`, `resize`)
+- `state` (`pending`, `running`, `succeeded`, `failed`, `rolling_back`,
+  `rollback_failed`)
+- `targetGeneration`
+- `previousGeneration`
+- `startedAt`
+- `updatedAt`
+- optional `failureReason`
+- `cores`
+
+Status codes:
+
+- `202 Accepted`: request accepted and `wait=false`
+- `200 OK`: `wait=true` and the rollout finished successfully
+- `404 Not Found`: pipeline group does not exist
+- `409 Conflict`: another incompatible live operation is active in the
+  controller's current consistency scope, or a waited rollout finished in
+  failure. In the current version of the API, that scope is one logical
+  pipeline.
+- `422 Unprocessable Entity`: validation failure or unsupported runtime
+  mutation
+- `504 Gateway Timeout`: `wait=true` exceeded the overall wait timeout
+
+### Read rollout progress
+
+`GET /groups/{group}/pipelines/{id}/rollouts/{rolloutId}`
+
+Returns the current `RolloutStatus` snapshot for that operation.
+Terminal rollout ids are retained only within a bounded in-memory window, so
+older ids may return `404 Not Found` after eviction.
+
+### Read observed pipeline status
+
+`GET /groups/{group}/pipelines/{id}/status`
+
+Returns the aggregated pipeline status. Useful fields during rollout:
+
+- `conditions`
+- `totalCores`
+- `runningCores`
+- `activeGeneration`
+- `servingGenerations`
+- `rollout`
+- `instances`
+
+Each `instances` entry is keyed by `(coreId, deploymentGeneration)`, so
+overlapping old/new generations stay distinguishable during a rolling cutover.
+
+### Related shutdown endpoints
+
+- `POST /groups/{group}/pipelines/{id}/shutdown`
+- `GET /groups/{group}/pipelines/{id}/shutdowns/{shutdownId}`
+- `POST /groups/shutdown`
+
+These are separate from reconfiguration, but they use the same resident
+controller and the same live-operation consistency scope.
+Terminal shutdown ids are retained only within a bounded in-memory window, so
+older ids may return `404 Not Found` after eviction.
+
+## Manual Examples
+
+The examples below use
+[`configs/engine-conf/topic_multitenant_isolation.yaml`](../../configs/engine-conf/topic_multitenant_isolation.yaml).
+That config binds admin HTTP to `127.0.0.1:8085` and defines the logical
+pipeline `topic_multitenant_isolation/tenant_c_pipeline`.
+
+### Start the sample engine
+
+```bash
+cargo run -- -c configs/engine-conf/topic_multitenant_isolation.yaml
+```
+
+In another terminal:
+
+```bash
+BASE=http://127.0.0.1:8085/api/v1
+GROUP=topic_multitenant_isolation
+PIPE=tenant_c_pipeline
+```
+
+Inspect the current committed config and observed runtime state:
+
+```bash
+curl -s "$BASE/groups/$GROUP/pipelines/$PIPE" | jq .
+curl -s "$BASE/groups/$GROUP/pipelines/$PIPE/status" | jq .
+```
+
+### Example: Topology change with serial rolling cutover
+
+This example inserts a debug processor between the topic receiver and the retry
+processor.
+
+Build the request body from the live config:
+
+```bash
+curl -s "$BASE/groups/$GROUP/pipelines/$PIPE" \
+  | jq '
+      {
+        pipeline: (
+          .pipeline
+          | .nodes += {
+              tenant_c_debug: {
+                type: "processor:debug",
+                config: {
+                  verbosity: "basic"
+                }
+              }
+            }
+          | .connections = [
+              {from: "tenant_c_receiver", to: "tenant_c_debug"},
+              {from: "tenant_c_debug", to: "tenant_c_retry"},
+              {from: "tenant_c_retry", to: "tenant_c_sink"}
+            ]
+        ),
+        stepTimeoutSecs: 60,
+        drainTimeoutSecs: 60
+      }
+    ' \
+  > /tmp/tenant_c_pipeline-debug.json
+```
+
+Submit the update and wait for completion:
+
+```bash
+curl -sS -X PUT \
+  "$BASE/groups/$GROUP/pipelines/$PIPE?wait=true&timeout_secs=120" \
+  -H 'content-type: application/json' \
+  --data-binary @/tmp/tenant_c_pipeline-debug.json | jq .
+```
+
+Expected result:
+
+- `action` is `replace`
+- `state` ends as `succeeded`
+- `targetGeneration` is greater than `previousGeneration`
+
+Verify the committed config and rollout-aware status:
+
+```bash
+curl -s "$BASE/groups/$GROUP/pipelines/$PIPE" | jq .
+curl -s "$BASE/groups/$GROUP/pipelines/$PIPE/status" \
+  | jq '{conditions, totalCores, runningCores, activeGeneration, servingGenerations, rollout, instances}'
+```
+
+### Example: Async rollout tracking
+
+Use `wait=false` to return immediately, then poll the rollout resource:
+
+```bash
+ROLLOUT_ID=$(
+  curl -sS -X PUT \
+    "$BASE/groups/$GROUP/pipelines/$PIPE?wait=false" \
+    -H 'content-type: application/json' \
+    --data-binary @/tmp/tenant_c_pipeline-debug.json \
+  | jq -r '.rolloutId'
+)
+
+curl -s "$BASE/groups/$GROUP/pipelines/$PIPE/rollouts/$ROLLOUT_ID" | jq .
+```
+
+### Example: Pure resource-policy resize
+
+This example changes only `coreAllocation.count` from `1` to `2`. The
+controller detects that the runtime shape is otherwise unchanged and executes a
+`resize` rollout instead of a full replace.
+
+```bash
+curl -s "$BASE/groups/$GROUP/pipelines/$PIPE" \
+  | jq '
+      {
+        pipeline: .pipeline,
+        stepTimeoutSecs: 60,
+        drainTimeoutSecs: 60
+      }
+      | .pipeline.policies.resources.coreAllocation.count = 2
+    ' \
+  > /tmp/tenant_c_pipeline-scale-up.json
+```
+
+```bash
+curl -sS -X PUT \
+  "$BASE/groups/$GROUP/pipelines/$PIPE?wait=true&timeout_secs=120" \
+  -H 'content-type: application/json' \
+  --data-binary @/tmp/tenant_c_pipeline-scale-up.json | jq .
+```
+
+Expected result:
+
+- `action` is `resize`
+- `targetGeneration` stays equal to `previousGeneration`
+- only the added core is started
+
+Verify the pipeline footprint:
+
+```bash
+curl -s "$BASE/groups/$GROUP/pipelines/$PIPE/status" \
+  | jq '{totalCores, runningCores, activeGeneration, servingGenerations, rollout}'
+```
+
+Scale back down by setting `coreAllocation.count = 1` in the same request body
+pattern.
+
+## Operational Notes
+
+- Different logical pipelines may roll concurrently in the current
+  implementation.
+- A single logical pipeline allows only one active rollout or shutdown at a
+  time.
+- Future group-level consistency can widen the conflict scope so concurrent
+  operations in the same group return `409 Conflict`.
+- `GET /groups/{group}/pipelines/{id}` always returns the committed
+  live config, not an uncommitted candidate.
+- `GET /groups/{group}/pipelines/{id}/status` is the best endpoint
+  for watching serving generations and per-instance phase changes during a
+  rollout.
diff --git a/rust/otap-dataflow/src/main.rs b/rust/otap-dataflow/src/main.rs
index 1c419ab935..d130a542d9 100644
--- a/rust/otap-dataflow/src/main.rs
+++ b/rust/otap-dataflow/src/main.rs
@@ -3,6 +3,7 @@
 
 //! Create and run a multi-core pipeline
 
+use cfg_if::cfg_if;
 use clap::Parser;
 use otap_df_config::config_provider::{ConfigFormat, resolve_config};
 use otap_df_config::engine::OtelDataflowSpec;
@@ -16,7 +17,6 @@ use otap_df_controller::startup;
 // Keep this side-effect import so the crate is linked and its `linkme`
 // distributed-slice registrations (core nodes) are visible
 // in `OTAP_PIPELINE_FACTORY` at runtime.
-use cfg_if::cfg_if;
 use otap_df_core_nodes as _;
 use otap_df_otap::OTAP_PIPELINE_FACTORY;
 /// Project license text (Apache-2.0), embedded at compile time.
@@ -238,6 +238,7 @@ fn main() -> Result<(), Box<dyn std::error::Error>> {
     {
         dhat_start();
     }
+
     // Install the rustls crypto provider selected by the crypto-* feature flag.
     // This must happen before any TLS connections (reqwest, tonic, etc.).
     otap_df_otap::crypto::install_crypto_provider()
@@ -324,14 +325,14 @@ mod tests {
             Ok(CoreAllocation::core_set(vec![
                 CoreRange { start: 0, end: 4 },
                 CoreRange { start: 5, end: 5 },
-                CoreRange { start: 6, end: 7 },
+                CoreRange { start: 6, end: 7 }
             ]))
         );
         assert_eq!(
             parse_core_id_allocation("0..4"),
             Ok(CoreAllocation::core_set(vec![CoreRange {
                 start: 0,
-                end: 4,
+                end: 4
             }]))
         );
     }
@@ -499,7 +500,7 @@ connections:
             args.core_id_range,
             Some(CoreAllocation::core_set(vec![
                 CoreRange { start: 1, end: 3 },
-                CoreRange { start: 7, end: 7 },
+                CoreRange { start: 7, end: 7 }
             ]))
         );
         assert_eq!(args.num_cores, None);