-
Notifications
You must be signed in to change notification settings - Fork 36
feat(rkl): Add the container probe functionality. #249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive health probe support (liveness, readiness, and startup probes) to the RKL container orchestration system, bringing it closer to Kubernetes compatibility. The implementation includes probe execution (HTTP, TCP, Exec), state management, and integration with pod lifecycle operations.
- Introduces a new probe subsystem in
project/rkl/src/daemon/probe.rswith support for HTTP, TCP, and Exec probe types - Integrates probes into pod startup and deletion workflows in both static pod and client/worker modes
- Extends the common data model with probe specifications and status reporting
- Updates the pod state command to display probe status information
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| project/rkl/src/daemon/probe.rs | New 861-line module implementing probe execution, state management, and registry |
| project/rkl/src/daemon/static_pods.rs | Integrates probe registration on pod start and deregistration on pod deletion |
| project/rkl/src/daemon/client.rs | Adds probe handling to worker pod lifecycle (run/delete operations) |
| project/rkl/src/daemon/mod.rs | Exposes the new probe module |
| project/rkl/src/commands/pod/standalone.rs | Adds probe status output to the pod state command |
| project/rkl/src/commands/pod/mod.rs | Refactors return type to include probe-relevant data (pod IP, task) |
| project/rkl/src/commands/container/mod.rs | Adds probe fields to ContainerSpec initialization |
| project/rkl/src/commands/compose/mod.rs | Adds probe fields to ContainerSpec initialization |
| project/common/src/lib.rs | Extends data model with Probe, ProbeAction, and status types |
project/rkl/src/daemon/probe.rs
Outdated
| } | ||
|
|
||
| let old = { | ||
| let mut registry = PROBE_REGISTRY.lock().unwrap_or_else(|err| err.into_inner()); |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unwrap_or_else(|err| err.into_inner()) pattern recovers from a poisoned mutex by extracting the inner data. While this prevents panics, it silently ignores poisoning which could indicate another thread panicked while holding the lock. This is risky for shared state integrity. Consider either: (1) returning an error to the caller, (2) logging a warning when poison is detected, or (3) documenting why it's safe to recover from poison in this specific case.
| let mut registry = PROBE_REGISTRY.lock().unwrap_or_else(|err| err.into_inner()); | |
| let mut registry = match PROBE_REGISTRY.lock() { | |
| Ok(guard) => guard, | |
| Err(poisoned) => { | |
| warn!( | |
| pod = pod_name, | |
| "probe registry mutex was poisoned; recovering inner data" | |
| ); | |
| poisoned.into_inner() | |
| } | |
| }; |
| } | ||
| } | ||
| } | ||
|
|
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function silently enforces a minimum probe period of 1 second. This constraint should be documented either in the function doc comment or inline, as it affects probe configuration behavior and differs from Kubernetes defaults (which allow sub-second periods).
| /// Returns a tokio interval for the given period, enforcing a minimum of 1 second. | |
| /// | |
| /// This constraint differs from Kubernetes defaults, which allow sub-second probe periods. | |
| /// Any configured period less than 1 second will be silently increased to 1 second. | |
| /// This is done for resource and implementation reasons. |
project/rkl/src/daemon/probe.rs
Outdated
|
|
||
| stream.write_all(request.as_bytes()).await?; | ||
|
|
||
| let mut buf = vec![0u8; 512]; |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 512-byte buffer size for HTTP responses is a magic number that may be insufficient for HTTP headers with large cookies or custom headers. Consider either: (1) extracting this as a named constant (e.g., HTTP_PROBE_BUFFER_SIZE), or (2) documenting why 512 bytes is sufficient for probe health checks (since only the status line is typically needed).
project/rkl/src/daemon/probe.rs
Outdated
| tokio::runtime::Handle::try_current() | ||
| .context("probe registration requires an active Tokio runtime")?; | ||
|
|
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function checks for a Tokio runtime but doesn't use the returned handle. The check appears to be a validation that probes can be spawned, but spawn_probe (line 571) will spawn on the current runtime regardless. Either: (1) remove this check if spawn_probe will fail appropriately on its own, or (2) store the handle and use it explicitly via handle.spawn() to make the dependency clear.
| tokio::runtime::Handle::try_current() | |
| .context("probe registration requires an active Tokio runtime")?; |
| tokio::spawn({ | ||
| let pod_name = pod.metadata.name.clone(); | ||
| async move { | ||
| deregister_pod_probes(&pod_name).await; | ||
| } | ||
| }); |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spawned task for probe deregistration is detached and its result is never checked. If deregistration fails or panics, it will be silent. Consider either: (1) using tokio::spawn with .await to ensure cleanup completes before returning, or (2) storing the JoinHandle and awaiting it, or (3) at minimum documenting why fire-and-forget is acceptable here.
| eprintln!( | ||
| "[worker] registering probes for pod {} failed: {err:?}", | ||
| pod_name | ||
| ); |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using eprintln! for error logging is inconsistent with the tracing infrastructure used elsewhere in the codebase (imported on line 8). Use tracing::error! or tracing::warn! instead for consistent structured logging and observability.
| eprintln!( | ||
| "[worker] building probe registrations for pod {} failed: {err:?}", | ||
| pod_name | ||
| ); |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using eprintln! for error logging is inconsistent with the tracing infrastructure used elsewhere in the codebase. Use tracing::error! or tracing::warn! instead for consistent structured logging and observability.
project/rkl/src/daemon/client.rs
Outdated
| tokio::spawn({ | ||
| let pod_name = name.clone(); | ||
| async move { | ||
| deregister_pod_probes(&pod_name).await; | ||
| } | ||
| }); | ||
| let _ = send_uni(&connection, &RksMessage::Ack).await; |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spawned task for probe deregistration is detached and its result is never checked. If deregistration fails or panics, it will be silent. Consider either: (1) awaiting the task to ensure cleanup completes before sending the Ack, or (2) at minimum documenting why fire-and-forget is acceptable in this deletion flow.
| tokio::spawn({ | |
| let pod_name = name.clone(); | |
| async move { | |
| deregister_pod_probes(&pod_name).await; | |
| } | |
| }); | |
| let _ = send_uni(&connection, &RksMessage::Ack).await; | |
| // Await probe deregistration before sending Ack. | |
| match deregister_pod_probes(&name).await { | |
| Ok(_) => { | |
| let _ = send_uni(&connection, &RksMessage::Ack).await; | |
| } | |
| Err(e) => { | |
| eprintln!( | |
| "[worker] deregister_pod_probes for pod {} failed: {e:?}", | |
| name | |
| ); | |
| let _ = send_uni( | |
| &connection, | |
| &RksMessage::Error(format!( | |
| "deregister probes for {name} failed: {e}" | |
| )), | |
| ) | |
| .await; | |
| } | |
| } |
project/rkl/src/commands/pod/mod.rs
Outdated
| pub mod cluster; | ||
| pub mod standalone; | ||
|
|
||
| #[allow(dead_code)] |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The #[allow(dead_code)] attribute on a newly introduced public struct suggests unused fields. Since this struct is used in the return type of run_pod_from_taskrunner and its fields are accessed in multiple call sites (static_pods.rs:81, client.rs:338-342), this attribute appears unnecessary and should be removed.
| #[allow(dead_code)] |
Signed-off-by: Luxian <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
|
|
||
| /// Returns a tokio interval for the given period, enforcing a minimum of 1 second. | ||
| fn new_interval(period: Duration) -> Interval { | ||
| let mut interval = tokio::time::interval(period); |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states 'enforcing a minimum of 1 second', but the implementation does not actually enforce any minimum period. Either update the implementation to enforce this constraint or correct the comment to remove the incorrect claim.
| let mut interval = tokio::time::interval(period); | |
| let min_period = Duration::from_secs(1); | |
| let mut interval = tokio::time::interval(period.max(min_period)); |
| #[allow(unused)] | ||
| pub pod_sandbox_id: String, | ||
| pub pod_ip: String, | ||
| #[allow(unused)] | ||
| pub container_names: Vec<String>, |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pod_sandbox_id and container_names fields are marked with #[allow(unused)]. Since this is a public API, consider removing these fields if they are truly unused, or remove the attribute if they will be used in the future. Keeping unused public fields in a struct makes the API unclear.
| #[allow(unused)] | |
| pub pod_sandbox_id: String, | |
| pub pod_ip: String, | |
| #[allow(unused)] | |
| pub container_names: Vec<String>, | |
| pub pod_ip: String, |
Signed-off-by: Luxian <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 32 out of 32 changed files in this pull request and generated 1 comment.
| #[tokio::main] | ||
| async fn main() -> Result<(), Box<dyn std::error::Error>> { | ||
| // 加载 .env 文件 | ||
| // load .env |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete comment: 'load .env' should be 'load .env file' for clarity.
| // load .env | |
| // load .env file |
No description provided.