feat(rkl): Add the container probe functionality. #249

Ivanbeethoven · 2025-11-04T07:34:43Z

No description provided.

Copilot

Pull Request Overview

This PR adds comprehensive health probe support (liveness, readiness, and startup probes) to the RKL container orchestration system, bringing it closer to Kubernetes compatibility. The implementation includes probe execution (HTTP, TCP, Exec), state management, and integration with pod lifecycle operations.

Introduces a new probe subsystem in project/rkl/src/daemon/probe.rs with support for HTTP, TCP, and Exec probe types
Integrates probes into pod startup and deletion workflows in both static pod and client/worker modes
Extends the common data model with probe specifications and status reporting
Updates the pod state command to display probe status information

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
project/rkl/src/daemon/probe.rs	New 861-line module implementing probe execution, state management, and registry
project/rkl/src/daemon/static_pods.rs	Integrates probe registration on pod start and deregistration on pod deletion
project/rkl/src/daemon/client.rs	Adds probe handling to worker pod lifecycle (run/delete operations)
project/rkl/src/daemon/mod.rs	Exposes the new probe module
project/rkl/src/commands/pod/standalone.rs	Adds probe status output to the pod state command
project/rkl/src/commands/pod/mod.rs	Refactors return type to include probe-relevant data (pod IP, task)
project/rkl/src/commands/container/mod.rs	Adds probe fields to ContainerSpec initialization
project/rkl/src/commands/compose/mod.rs	Adds probe fields to ContainerSpec initialization
project/common/src/lib.rs	Extends data model with Probe, ProbeAction, and status types

Copilot · 2025-11-04T08:16:41Z

project/rkl/src/daemon/probe.rs

+    }
+
+    let old = {
+        let mut registry = PROBE_REGISTRY.lock().unwrap_or_else(|err| err.into_inner());


The unwrap_or_else(|err| err.into_inner()) pattern recovers from a poisoned mutex by extracting the inner data. While this prevents panics, it silently ignores poisoning which could indicate another thread panicked while holding the lock. This is risky for shared state integrity. Consider either: (1) returning an error to the caller, (2) logging a warning when poison is detected, or (3) documenting why it's safe to recover from poison in this specific case.

Suggested change

let mut registry = PROBE_REGISTRY.lock().unwrap_or_else(|err| err.into_inner());

let mut registry = match PROBE_REGISTRY.lock() {

Ok(guard) => guard,

Err(poisoned) => {

warn!(

pod = pod_name,

"probe registry mutex was poisoned; recovering inner data"

);

poisoned.into_inner()

}

};

Copilot · 2025-11-04T08:16:41Z

project/rkl/src/daemon/probe.rs

+        }
+    }
+}
+


This function silently enforces a minimum probe period of 1 second. This constraint should be documented either in the function doc comment or inline, as it affects probe configuration behavior and differs from Kubernetes defaults (which allow sub-second periods).

Suggested change

/// Returns a tokio interval for the given period, enforcing a minimum of 1 second.

///

/// This constraint differs from Kubernetes defaults, which allow sub-second probe periods.

/// Any configured period less than 1 second will be silently increased to 1 second.

/// This is done for resource and implementation reasons.

Copilot · 2025-11-04T08:16:42Z

project/rkl/src/daemon/probe.rs

+
+    stream.write_all(request.as_bytes()).await?;
+
+    let mut buf = vec![0u8; 512];


The 512-byte buffer size for HTTP responses is a magic number that may be insufficient for HTTP headers with large cookies or custom headers. Consider either: (1) extracting this as a named constant (e.g., HTTP_PROBE_BUFFER_SIZE), or (2) documenting why 512 bytes is sufficient for probe health checks (since only the status line is typically needed).

Copilot · 2025-11-04T08:16:42Z

project/rkl/src/daemon/probe.rs

+    tokio::runtime::Handle::try_current()
+        .context("probe registration requires an active Tokio runtime")?;
+


This function checks for a Tokio runtime but doesn't use the returned handle. The check appears to be a validation that probes can be spawned, but spawn_probe (line 571) will spawn on the current runtime regardless. Either: (1) remove this check if spawn_probe will fail appropriately on its own, or (2) store the handle and use it explicitly via handle.spawn() to make the dependency clear.

Suggested change

tokio::runtime::Handle::try_current()

.context("probe registration requires an active Tokio runtime")?;

Copilot · 2025-11-04T08:16:42Z

project/rkl/src/daemon/static_pods.rs

+                tokio::spawn({
+                    let pod_name = pod.metadata.name.clone();
+                    async move {
+                        deregister_pod_probes(&pod_name).await;
+                    }
+                });


The spawned task for probe deregistration is detached and its result is never checked. If deregistration fails or panics, it will be silent. Consider either: (1) using tokio::spawn with .await to ensure cleanup completes before returning, or (2) storing the JoinHandle and awaiting it, or (3) at minimum documenting why fire-and-forget is acceptable here.

Copilot · 2025-11-04T08:16:43Z

project/rkl/src/daemon/client.rs

+                                                eprintln!(
+                                                    "[worker] registering probes for pod {} failed: {err:?}",
+                                                    pod_name
+                                                );


Using eprintln! for error logging is inconsistent with the tracing infrastructure used elsewhere in the codebase (imported on line 8). Use tracing::error! or tracing::warn! instead for consistent structured logging and observability.

Copilot · 2025-11-04T08:16:43Z

project/rkl/src/daemon/client.rs

+                                            eprintln!(
+                                                "[worker] building probe registrations for pod {} failed: {err:?}",
+                                                pod_name
+                                            );


Using eprintln! for error logging is inconsistent with the tracing infrastructure used elsewhere in the codebase. Use tracing::error! or tracing::warn! instead for consistent structured logging and observability.

Copilot · 2025-11-04T08:16:43Z

project/rkl/src/daemon/client.rs

+                                    tokio::spawn({
+                                        let pod_name = name.clone();
+                                        async move {
+                                            deregister_pod_probes(&pod_name).await;
+                                        }
+                                    });
                                    let _ = send_uni(&connection, &RksMessage::Ack).await;


The spawned task for probe deregistration is detached and its result is never checked. If deregistration fails or panics, it will be silent. Consider either: (1) awaiting the task to ensure cleanup completes before sending the Ack, or (2) at minimum documenting why fire-and-forget is acceptable in this deletion flow.

Suggested change

tokio::spawn({

let pod_name = name.clone();

async move {

deregister_pod_probes(&pod_name).await;

}

});

let _ = send_uni(&connection, &RksMessage::Ack).await;

// Await probe deregistration before sending Ack.

match deregister_pod_probes(&name).await {

Ok(_) => {

let _ = send_uni(&connection, &RksMessage::Ack).await;

}

Err(e) => {

eprintln!(

"[worker] deregister_pod_probes for pod {} failed: {e:?}",

name

);

let _ = send_uni(

&connection,

&RksMessage::Error(format!(

"deregister probes for {name} failed: {e}"

)),

)

.await;

}

}

Copilot · 2025-11-04T08:16:43Z

project/rkl/src/commands/pod/mod.rs

 pub mod cluster;
 pub mod standalone;

+#[allow(dead_code)]


The #[allow(dead_code)] attribute on a newly introduced public struct suggests unused fields. Since this struct is used in the return type of run_pod_from_taskrunner and its fields are accessed in multiple call sites (static_pods.rs:81, client.rs:338-342), this attribute appears unnecessary and should be removed.

Suggested change

#[allow(dead_code)]

Signed-off-by: Luxian <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Copilot · 2025-11-05T03:17:30Z

project/rkl/src/daemon/probe.rs

+
+/// Returns a tokio interval for the given period, enforcing a minimum of 1 second.  
+fn new_interval(period: Duration) -> Interval {
+    let mut interval = tokio::time::interval(period);


The comment states 'enforcing a minimum of 1 second', but the implementation does not actually enforce any minimum period. Either update the implementation to enforce this constraint or correct the comment to remove the incorrect claim.

Suggested change

let mut interval = tokio::time::interval(period);

let min_period = Duration::from_secs(1);

let mut interval = tokio::time::interval(period.max(min_period));

Copilot · 2025-11-05T03:17:31Z

project/rkl/src/commands/pod/mod.rs

+    #[allow(unused)]
+    pub pod_sandbox_id: String,
+    pub pod_ip: String,
+    #[allow(unused)]
+    pub container_names: Vec<String>,


The pod_sandbox_id and container_names fields are marked with #[allow(unused)]. Since this is a public API, consider removing these fields if they are truly unused, or remove the attribute if they will be used in the future. Keeping unused public fields in a struct makes the API unclear.

Suggested change

#[allow(unused)]

pub pod_sandbox_id: String,

pub pod_ip: String,

#[allow(unused)]

pub container_names: Vec<String>,

pub pod_ip: String,

Signed-off-by: Luxian <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 1 comment.

Copilot · 2025-11-05T07:22:02Z

project/slayerfs/examples/persistence_s3_demo.rs

 #[tokio::main]
 async fn main() -> Result<(), Box<dyn std::error::Error>> {
-    // 加载 .env 文件
+    // load .env 


Incomplete comment: 'load .env' should be 'load .env file' for clarity.

Suggested change

// load .env

// load .env file

Ivanbeethoven marked this pull request as draft November 4, 2025 07:34

Ivanbeethoven force-pushed the main branch from 51ee32b to 5ade7e4 Compare November 4, 2025 08:08

genedna requested a review from Copilot November 4, 2025 08:13

Copilot AI reviewed Nov 4, 2025

View reviewed changes

Ivanbeethoven force-pushed the main branch from 5ade7e4 to 41ff1a0 Compare November 4, 2025 08:32

feat(rkl): Add the container probe functionality.

f84b345

Signed-off-by: Luxian <[email protected]>

Ivanbeethoven force-pushed the main branch from 41ff1a0 to f84b345 Compare November 5, 2025 03:14

genedna requested a review from Copilot November 5, 2025 03:14

Copilot AI reviewed Nov 5, 2025

View reviewed changes

Ivanbeethoven marked this pull request as ready for review November 5, 2025 03:35

Ivanbeethoven requested a review from ichAB November 5, 2025 03:36

doc(slayerfs): translate comment to EN.

bdb30ee

Signed-off-by: Luxian <[email protected]>

genedna requested a review from Copilot November 5, 2025 07:20

Copilot AI reviewed Nov 5, 2025

View reviewed changes

fix typo

e68cfb3

genedna assigned Ivanbeethoven Nov 5, 2025

-        let mut registry = PROBE_REGISTRY.lock().unwrap_or_else(|err| err.into_inner());
+        let mut registry = match PROBE_REGISTRY.lock() {
+            Ok(guard) => guard,
+            Err(poisoned) => {
+                warn!(
+                    pod = pod_name,
+                    "probe registry mutex was poisoned; recovering inner data"
+                );
+                poisoned.into_inner()
+            }
+        };

+/// Returns a tokio interval for the given period, enforcing a minimum of 1 second.
+///
+/// This constraint differs from Kubernetes defaults, which allow sub-second probe periods.
+/// Any configured period less than 1 second will be silently increased to 1 second.
+/// This is done for resource and implementation reasons.


		stream.write_all(request.as_bytes()).await?;

		let mut buf = vec![0u8; 512];

		tokio::runtime::Handle::try_current()
		.context("probe registration requires an active Tokio runtime")?;

	let mut interval = tokio::time::interval(period);
	let min_period = Duration::from_secs(1);
	let mut interval = tokio::time::interval(period.max(min_period));

feat(rkl): Add the container probe functionality. #249

Are you sure you want to change the base?

feat(rkl): Add the container probe functionality. #249

Conversation

Ivanbeethoven commented Nov 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant