-
Notifications
You must be signed in to change notification settings - Fork 7
Feat/k3s autoscaler #287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/k3s autoscaler #287
Conversation
* Implement NodePool CRD for manual and dynamic node provisioning * Add ScalingPolicy CRD for automated GPU node scaling decisions * Create SSH-based node provisioner with K3s installation support * Implement WireGuard VPN setup for secure cluster networking * Add leader election for high availability * Implement health/readiness probes and Prometheus metrics * Add SecureCloud API client for dynamic node provisioning * Create unified error handling with AutoscalerError type * Support configurable reconciliation intervals and phase timeouts
* Create multi-stage Dockerfile with rust:1.88 builder * Add build.sh script with --image-name and --image-tag options * Support debug/release build modes via --debug flag * Configure non-root runtime user for security * Install openssh-client for node provisioning via SSH
* Define NodePool and ScalingPolicy CRDs with comprehensive validation * Add ClusterRole and Role RBAC for node, pod, and secret access * Configure leader election lease permissions * Create Deployment with security context (non-root, read-only fs) * Add ConfigMap with default autoscaler configuration * Include ServiceMonitor for Prometheus metrics scraping * Add PodDisruptionBudget for high availability * Set up namespaces for autoscaler and SSH keys storage
* Create autoscaler.yml playbook for K3s deployment * Apply CRDs, RBAC, and deployment manifests in order * Support configurable autoscaler_image variable * Wait for CRD establishment before proceeding * Verify rollout completion with 300s timeout * Display pod status after deployment
|
Caution Review failedThe pull request is closed. WalkthroughAdds a new workspace crate Changes
Sequence Diagram(s)sequenceDiagram
participant K8s as Kubernetes API
participant Controller as NodePool Controller
participant KubeClient as KubeClient
participant SecureCloud as SecureCloud API
participant Provisioner as SSH Provisioner
participant Node as Remote Node
rect rgba(200,220,255,0.6)
Note over Controller,KubeClient: Reconcile — load NodePool
Controller->>KubeClient: Get NodePool(ns,name)
KubeClient->>K8s: GET NodePool
K8s-->>KubeClient: NodePool spec/status
KubeClient-->>Controller: NodePool
end
rect rgba(220,255,220,0.6)
Note over Controller,SecureCloud: Dynamic provisioning
Controller->>SecureCloud: start_rental(offering,ssh_key)
SecureCloud-->>Controller: RentalInfo (external_ip,ssh_port)
Controller->>KubeClient: Patch NodePool status (rental_id, external_ip)
end
rect rgba(255,240,200,0.6)
Note over Controller,Provisioner: Configure node & WireGuard
Controller->>Provisioner: configure_base_system(host,port,ssh_cfg)
Provisioner->>Node: SSH run scripts
Node-->>Provisioner: output
Provisioner-->>Controller: done
Controller->>Provisioner: install_wireguard(...)
Provisioner-->>Controller: public_key
Controller->>SecureCloud: register_wireguard_key(node_id, public_key)
SecureCloud-->>Controller: registration response
Controller->>SecureCloud: get_peers(node_id)
SecureCloud-->>Controller: peers
Controller->>Provisioner: configure_wireguard_peers(peers)
end
rect rgba(255,220,220,0.6)
Note over Controller,KubeClient: Join cluster → Ready
Controller->>Provisioner: install_k3s_agent(..., token)
Provisioner->>Node: install k3s agent
Provisioner-->>Controller: agent running
Controller->>KubeClient: Wait for Node readiness
KubeClient->>K8s: GET Node status
K8s-->>KubeClient: Node Ready
KubeClient-->>Controller: Node ready -> Patch NodePool Ready
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 17
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (7)
crates/basilica-autoscaler/Cargo.toml-37-42 (1)
37-42: Ensure tokio-util workspace dependency includes the "rt" feature.The kube/k8s-openapi version compatibility is correct. However,
tokio-utilis specified in the autoscaler with features["rt"], while the workspace dependency defines it without features. Either addfeatures = ["rt"]to the workspace dependency definition or remove the feature specification here to use the workspace version. Theprometheuscrate is legitimately used directly (not just throughmetrics-exporter-prometheus) for metrics instrumentation inmetrics.rs, so it should remain as a direct dependency rather than a workspace dependency.crates/basilica-autoscaler/src/crd/node_pool.rs-212-262 (1)
212-262: Type mismatch:delete_grace_period_secondsisu32in Rust buti64in NodePoolStatus.In
LifecycleConfig(line 226),delete_grace_period_secondsisu32, but inNodePoolStatus(as referenced in the relevant snippets), there's a reference todelete_grace_period_seconds: i64. Additionally, the Kubernetes API usesi64for grace periods inDeleteOptions.Consider using
i64for consistency with Kubernetes API conventions:/// Delete grace period in seconds #[serde(default = "default_delete_grace_period")] - pub delete_grace_period_seconds: u32, + pub delete_grace_period_seconds: i64,And update the default function accordingly:
-fn default_delete_grace_period() -> u32 { +fn default_delete_grace_period() -> i64 { 30 }orchestrator/k8s/services/autoscaler/deployment.yaml-196-205 (1)
196-205: PDB withminAvailable: 1conflicts with single replica deployment.With
replicas: 1andminAvailable: 1, the PDB will block any voluntary disruption (including node drains and rolling updates). Consider either increasing replicas or usingmaxUnavailable: 0which has similar semantics but is more explicit about the intent.Either accept that voluntary disruptions will be blocked, or adjust:
spec: - minAvailable: 1 + maxUnavailable: 0 # More explicit: block all voluntary disruptions selector:Or add a comment explaining the intent:
spec: + # Single replica - intentionally blocks voluntary disruptions + # to maintain leader election continuity minAvailable: 1orchestrator/k8s/services/autoscaler/crds.yaml-369-374 (1)
369-374: ScalingPolicyminNodesandmaxNodeslack validation for min <= max.Users could set
minNodes: 10andmaxNodes: 5, which would be invalid. Consider adding a CEL validation rule.Add a validation rule:
spec: type: object + x-kubernetes-validations: + - rule: "!has(self.minNodes) || !has(self.maxNodes) || self.minNodes <= self.maxNodes" + message: "minNodes must be less than or equal to maxNodes" properties:Committable suggestion skipped: line range outside the PR's diff.
crates/basilica-autoscaler/src/provisioner/ssh.rs-404-440 (1)
404-440: Appending peers creates duplicates on re-run.Each call appends new
[Peer]sections without checking if the peer already exists. Re-running this method will create duplicate entries, which WireGuard may reject.Consider regenerating the entire config file or checking for existing peers before appending.
crates/basilica-autoscaler/src/provisioner/ssh.rs-383-390 (1)
383-390: Appending to sysctl.conf creates duplicates on re-run.Each invocation appends
net.ipv4.ip_forward=1to/etc/sysctl.conf, creating duplicate entries if the provisioner runs multiple times.Use a drop-in file or check for existing entries:
- self.execute_ssh_command( - host, - port, - ssh_config, - "echo 'net.ipv4.ip_forward=1' >> /etc/sysctl.conf && sysctl -p", - ) + self.execute_ssh_command( + host, + port, + ssh_config, + "echo 'net.ipv4.ip_forward=1' > /etc/sysctl.d/99-wireguard.conf && sysctl --system", + )crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs-319-353 (1)
319-353: Pools without existing status are silently skipped during scale-down.If
pool.statusisNone(line 343), the pool won't be transitioned toDrainingstate. This could leave pools that should be drained in an inconsistent state.Consider creating a default status when none exists:
- for pool in candidates.into_iter().take(count as usize) { - let pool_name = pool.name_any(); - info!(pool = %pool_name, "Initiating drain for scale-down"); - - if let Some(mut status) = pool.status.clone() { - status.phase = Some(NodePoolPhase::Draining); - status.phase_entered_at = Some(Utc::now()); - self.k8s - .update_node_pool_status(ns, &pool_name, status) - .await?; - } - } + for pool in candidates.into_iter().take(count as usize) { + let pool_name = pool.name_any(); + info!(pool = %pool_name, "Initiating drain for scale-down"); + + let mut status = pool.status.clone().unwrap_or_default(); + status.phase = Some(NodePoolPhase::Draining); + status.phase_entered_at = Some(Utc::now()); + self.k8s + .update_node_pool_status(ns, &pool_name, status) + .await?; + }Committable suggestion skipped: line range outside the PR's diff.
🧹 Nitpick comments (42)
crates/basilica-autoscaler/src/leader_election.rs (2)
116-129: Initial delay before first lease attempt.The loop sleeps for
config.retry_periodbefore the first lease acquisition attempt. On startup, this means the leader election won't begin immediately. Consider usingtokio::time::intervalor attempting acquisition before entering the loop.loop { + // Attempt lease acquisition/renewal first, then wait + match try_acquire_or_renew(&api, &config, &identity).await { + // ... handle result + } + tokio::select! { _ = shutdown_rx.changed() => { // ... shutdown handling } - _ = tokio::time::sleep(config.retry_period) => { - match try_acquire_or_renew(&api, &config, &identity).await { + _ = tokio::time::sleep(config.retry_period) => {} } }
72-83: Spawned task is not tracked or awaited.The spawned
leader_election_looptask handle is dropped immediately. If the loop panics, the error won't propagate. Consider storing theJoinHandleand checking it periodically or on shutdown.pub struct LeaderElector { // ... shutdown_tx: Option<watch::Sender<bool>>, + task_handle: Option<tokio::task::JoinHandle<()>>, } // In start(): -tokio::spawn(async move { +let handle = tokio::spawn(async move { leader_election_loop(/* ... */).await; }); +self.task_handle = Some(handle);crates/basilica-autoscaler/src/main.rs (1)
17-17: Preserve error context when converting errors.Using
anyhow::anyhow!("{}", e)loses the error chain. Use.context()oranyhow::Error::from()to preserve the full error information for debugging.- runtime.run().await.map_err(|e| anyhow::anyhow!("{}", e)) + runtime.run().await.map_err(|e| anyhow::anyhow!(e).context("controller runtime failed"))Or if the runtime error implements
std::error::Error:- runtime.run().await.map_err(|e| anyhow::anyhow!("{}", e)) + runtime.run().await.map_err(anyhow::Error::from)scripts/autoscaler/Dockerfile (2)
23-40: Clarify cleanup chaining for readability and future maintenanceThe
rm -rf/findcleanup chain mixes&&and||:... && \ rm -rf ... 2>/dev/null || true && \ find /tmp/target -name "*.rlib" -delete 2>/dev/null || trueThis works, but the precedence is non-obvious (
&&/||left-associative), which makes it easy to misread.Consider explicitly grouping and separating the commands to make intent clearer, e.g.:
RUN if [ "$BUILD_MODE" = "release" ]; then \ cargo build --locked --release -p basilica-autoscaler; \ else \ cargo build --locked -p basilica-autoscaler; \ fi && \ (rm -rf /tmp/target/debug/deps /tmp/target/debug/incremental \ /tmp/target/release/deps /tmp/target/release/incremental 2>/dev/null || true) && \ (find /tmp/target -name "*.rlib" -delete 2>/dev/null || true)Same behavior, but easier to reason about.
45-57: Align build-time and runtime expectations for BUILD_MODEThe runtime stage also declares
ARG BUILD_MODE=releaseand copies:COPY --from=builder /tmp/target/${BUILD_MODE}/basilica-autoscaler /usr/local/bin/basilica-autoscalerThis is correct as long as callers always pass a matching
--build-arg BUILD_MODE=...to both stages (which Docker does), and onlydebug/releaseare used.If you ever add more modes or change the target dir layout, this tight coupling can bite. A slightly more defensive pattern is to normalize the build output (e.g., always copy from
/tmp/target/releaseand only tweak optimization flags) or validateBUILD_MODEin the builder RUN step and fail fast on unknown values.orchestrator/ansible/playbooks/02-deploy/autoscaler.yml (1)
7-22: Confirm repo_root_final points at the path you expect
_detected_repo_rootwalks three directories up fromplaybook_dir, so by defaultrepo_root_finalresolves to theorchestratordirectory, and the copy task uses:src: "{{ repo_root_final }}/k8s/services/autoscaler/"This assumes
k8s/services/autoscalerlives directly underorchestrator/. If the actual repo layout differs (e.g., manifests live underorchestrator/k8s/...but_detected_repo_rootshould really be the repository root), this path could silently break.Worth double‑checking this against your other playbooks and adjusting the number of
dirnamehops if needed.crates/basilica-autoscaler/src/lib.rs (1)
1-17: Crate root API surface is cohesive and clearThe crate root exposes the main functional areas as modules and re-exports the key types (
AutoscalerConfig,AutoscalerError,Result,AutoscalerK8sClient,KubeClient,NodeProvisioner,SshProvisioner). This should make downstream usage straightforward (basilica_autoscaler::{AutoscalerConfig, Result, ...}) without having to dig into internal modules.Once
ControllerRuntimestabilizes, you may also want to re-export it from here for an even simplerbasilica_autoscaler::ControllerRuntimeentry point, but that’s optional.scripts/autoscaler/build.sh (2)
1-41: Tighten argument and shell robustnessTwo small improvements here:
- The script currently uses
set -eonly. Adding-uand-o pipefailmakes it more robust to unset variables and pipeline failures:#!/bin/bash set -euo pipefail
--image-name/--image-tagassume a following argument. If the flag is last by mistake,shift 2will error out somewhat cryptically underset -e. You can make this failure mode clearer:--image-name) if [[ $# -lt 2 ]]; then echo "Missing value for --image-name" >&2 exit 1 fi IMAGE_NAME="$2" shift 2 ;;Same idea for
--image-tag.
7-8: Be aware of default image name vs deployment defaultsThe build script defaults to:
IMAGE_NAME="basilica/basilica-autoscaler" IMAGE_TAG="latest"while the Ansible playbook defaults
autoscaler_imagetoghcr.io/one-covenant/basilica-autoscaler:latest. That’s fine if you intentionally distinguish “local build name” from “published image name”, but if the intention is to build exactly what you deploy via Ansible, consider aligning these defaults or documenting the difference to avoid confusion.crates/basilica-autoscaler/src/runtime.rs (2)
322-325:shutdownAPI is currently hard to use withrun(self)
ControllerRuntime::run(self)consumes the runtime instance, whileshutdown(&self)expects a shared reference. In the common pattern:let runtime = ControllerRuntime::new(config); runtime.run().await?;there’s no way to call
shutdown()whilerunis in progress, so theCancellationTokenpath in theselect!is effectively unreachable unless you redesign the caller.If you intend to support programmatic shutdown (e.g., embedding the runtime in another service), consider changing one of:
- Make
runtake&selforArc<Self>instead ofself, or- Expose a separate shutdown handle that wraps the
CancellationTokenand can be cloned and sent to other tasks.Right now the only practical shutdown trigger is
ctrl_cor internal leader failures.
332-337: Test depends onAutoscalerConfig::from_envbeing infallibleThe unit test uses:
let config = AutoscalerConfig::from_env();If
from_envever becomes fallible (e.g., reading required env vars and panicking on missing ones), this test will be environment‑dependent and potentially flaky in CI.If
from_envis already fallible, consider aResult-returning constructor and using a helper likeAutoscalerConfig::for_tests()here instead, so tests don’t rely on ambient environment configuration.crates/basilica-autoscaler/src/controllers/k8s_client.rs (1)
225-232: Label selector may need escaping for special characters.If
keyorvaluecontain special characters (e.g.,/,=), the label selector formatkey=valuecould be malformed or misinterpreted.Consider validating or escaping the key/value, or use
ListParams::default().labels_from()if available in kube-rs for safer construction.crates/basilica-autoscaler/src/crd/node_pool.rs (1)
334-449: Status struct is comprehensive but consider usingOptionwrapper forfailure_count.The
failure_countfield defaults to 0, which is fine, but other integer fields likegpu_countuseOption<u32>. For consistency and to distinguish "never checked" from "zero failures," consider making it optional./// Failure count - #[serde(default)] - pub failure_count: u32, + #[serde(default, skip_serializing_if = "Option::is_none")] + pub failure_count: Option<u32>,orchestrator/k8s/services/autoscaler/crds.yaml (1)
122-124: Server URL pattern allows anyhttps://prefix.The pattern
^https://.+is minimal but functional. For production, consider adding validation for valid URL format or at least requiring a host portion.orchestrator/k8s/services/autoscaler/deployment.yaml (2)
54-55: Usinglatesttag andimagePullPolicy: Alwaysmay cause deployment instability.Using the
latesttag can lead to unexpected behavior when images are updated. Consider using semantic versioning or commit SHA tags for reproducibility.- image: ghcr.io/one-covenant/basilica-autoscaler:latest - imagePullPolicy: Always + image: ghcr.io/one-covenant/basilica-autoscaler:v0.1.0 # Pin to specific version + imagePullPolicy: IfNotPresent
69-74: Duplicate leader election configuration in env vars and ConfigMap.Leader election settings are defined both as environment variables (lines 69-74) and in the ConfigMap (lines 177-183). This could lead to confusion about which takes precedence.
Choose one source of truth. If the application reads from the config file, remove the env vars:
- - name: LEADER_ELECTION_LEASE_DURATION - value: "15s" - - name: LEADER_ELECTION_RENEW_DEADLINE - value: "10s" - - name: LEADER_ELECTION_RETRY_PERIOD - value: "2s"crates/basilica-autoscaler/src/health.rs (2)
26-46: Consider batching state updates to reduce lock contention.Each setter acquires a write lock independently. If multiple state fields are updated together during reconciliation, consider adding a combined setter to reduce lock overhead:
pub async fn update_status(&self, ready: bool, leader: bool) { let mut status = self.inner.write().await; status.ready = ready; status.leader = leader; }
88-102: Health server runs indefinitely without graceful shutdown support.The
axum::servecall blocks forever. Consider accepting a shutdown signal for graceful termination:pub async fn start_health_server( host: &str, port: u16, state: HealthState, + shutdown: impl std::future::Future<Output = ()> + Send + 'static, ) -> Result<(), std::io::Error> { let addr = format!("{}:{}", host, port); info!(addr = %addr, "Starting health server"); let app = health_router(state); let listener = tokio::net::TcpListener::bind(&addr).await?; - axum::serve(listener, app).await?; + axum::serve(listener, app) + .with_graceful_shutdown(shutdown) + .await?; Ok(()) }crates/basilica-autoscaler/src/provisioner/k3s.rs (2)
42-43: Fixed 10-second sleep is fragile for installation verification.A hardcoded
sleep 10may be insufficient on slower systems or excessive on fast ones. Consider polling the service status with a timeout loop instead:# Wait for agent to be ready (up to 60 seconds) for i in $(seq 1 12); do if systemctl is-active --quiet k3s-agent || systemctl is-active --quiet k3s; then exit 0 fi sleep 5 done echo "K3s agent failed to start" exit 1
111-119: Version parsing may fail on non-standard output formats.The parser assumes the version is always the third whitespace-separated token. Consider making this more robust or documenting the expected format:
pub fn parse_k3s_version(output: &str) -> Option<String> { - // K3s version format: "k3s version v1.27.4+k3s1 (hash)" + // Expected format: "k3s version v1.27.4+k3s1 (hash)" + // Returns None if format doesn't match output .lines() .next() - .and_then(|line| line.split_whitespace().nth(2)) + .and_then(|line| { + let token = line.split_whitespace().nth(2)?; + if token.starts_with('v') { Some(token.to_string()) } else { None } + }) - .map(|v| v.to_string()) }orchestrator/k8s/services/autoscaler/rbac.yaml (2)
85-108: Consider restricting SSH keys Role to specific secret names.The Role grants
getandliston all secrets in thebasilica-ssh-keysnamespace, which is broader than necessary. If the SSH key secret names are known, restrict access usingresourceNames:rules: - apiGroups: [""] resources: ["secrets"] - verbs: ["get", "list"] + verbs: ["get"] + resourceNames: + - basilica-ssh-private-key + # Add other specific SSH key secret namesThis follows the principle of least privilege and addresses the static analysis finding.
116-125: Redundantcreateverb in resourceNames-scoped rule.The rule on lines 117-121 includes
createwithresourceNames, butresourceNamesdoesn't apply tocreateoperations. The separate rule on lines 123-125 correctly handles initial lease creation. Removecreatefrom the first rule:- apiGroups: ["coordination.k8s.io"] resources: ["leases"] - verbs: ["get", "create", "update", "patch"] + verbs: ["get", "update", "patch"] resourceNames: - basilica-autoscaler-leadercrates/basilica-autoscaler/src/metrics.rs (5)
46-54: Metric creation uses unwrap() which can panic.While unlikely to fail for fresh registries, consider propagating errors or using
expect()with context for better debugging:- let node_pools_total = IntGauge::new( - "autoscaler_node_pools_total", - "Total number of NodePool resources", - ) - .unwrap(); + let node_pools_total = IntGauge::new( + "autoscaler_node_pools_total", + "Total number of NodePool resources", + ) + .expect("failed to create node_pools_total metric");This applies to all metric creations in this file.
65-72: Potential high cardinality onphase_transitions_totalmetric.The
poollabel can grow unbounded with many NodePools. Consider using a fixed identifier or aggregating by phase only:let phase_transitions_total = IntCounterVec::new( Opts::new( "autoscaler_phase_transitions_total", "Total number of phase transitions", ), &["to_phase"], // Remove "pool" label to prevent cardinality explosion )Alternatively, implement label cleanup when pools are deleted.
264-269: Debug formatting for phase may produce inconsistent label values.
format!("{:?}", phase)produces Rust debug output which may include variant details. Consider using a stable string representation:pub fn record_phase_transition(&self, pool: &str, phase: &NodePoolPhase) { - let phase_str = format!("{:?}", phase); + let phase_str = phase.as_str(); // Implement Display or as_str() on NodePoolPhase self.phase_transitions_total .with_label_values(&[pool, &phase_str]) .inc(); }
290-293:rentals_activegauge can go negative ifrecord_rental_stoppedis called without a matching start.Consider guarding against negative values:
pub fn record_rental_stopped(&self, _pool: &str) { self.rentals_stopped_total.inc(); - self.rentals_active.dec(); + if self.rentals_active.get() > 0 { + self.rentals_active.dec(); + } }
359-373: Metrics server also lacks graceful shutdown support.Same concern as the health server - consider adding shutdown signal support for clean termination during pod lifecycle events.
crates/basilica-autoscaler/src/config.rs (3)
76-88: Leader election timing constraints should be validated.The Kubernetes leader election algorithm requires
lease_duration > renew_deadline > retry_period. Invalid configurations will cause runtime failures. Consider adding validation:impl LeaderElectionConfig { fn from_env() -> Self { let config = Self { /* ... current impl ... */ }; if config.enabled { assert!( config.lease_duration > config.renew_deadline, "lease_duration must be greater than renew_deadline" ); assert!( config.renew_deadline > config.retry_period, "renew_deadline must be greater than retry_period" ); } config } }
157-168: Empty control plane IPs list could cause silent failures.If the environment variable is set to an empty string or whitespace-only,
control_plane_ipswill be empty, potentially causing network validation to pass vacuously. Consider logging a warning or validating:impl NetworkValidationConfig { fn from_env() -> Self { let default_ips = "10.200.0.1,10.200.0.2,10.200.0.3".to_string(); let ips_str = std::env::var("BASILICA_AUTOSCALER_CONTROL_PLANE_IPS").unwrap_or(default_ips); let control_plane_ips: Vec<String> = ips_str .split(',') .map(|s| s.trim().to_string()) .filter(|s| !s.is_empty()) .collect(); + if control_plane_ips.is_empty() { + tracing::warn!("No control plane IPs configured for network validation"); + } Self { control_plane_ips } } }
206-211:parse_bool_envsilently treats invalid values asfalse.Values like "yes", "YES", "True" (with capital T followed by lowercase) would return
falseinstead of the default. Consider being more explicit:fn parse_bool_env(key: &str, default: bool) -> bool { std::env::var(key) .ok() - .map(|s| s.eq_ignore_ascii_case("true") || s == "1") + .map(|s| match s.to_lowercase().as_str() { + "true" | "1" | "yes" | "on" => true, + "false" | "0" | "no" | "off" => false, + _ => { + tracing::warn!(key = %key, value = %s, "Invalid boolean value, using default"); + default + } + }) .unwrap_or(default) }crates/basilica-autoscaler/src/provisioner/wireguard.rs (2)
92-101: Hardcoded network interface and subnet mask may not match all deployments.The configuration hardcodes
eth0for NAT masquerading and/16for the address subnet. These assumptions may not hold across different cloud providers or network configurations.Consider making these configurable via
WireGuardConfigor as parameters:pub fn generate_wg_config( private_key: &str, address: &str, listen_port: u16, interface_name: &str, peers: &[WireGuardPeer], + subnet_mask: u8, + external_iface: &str, ) -> String { let mut config = format!( r#"[Interface] PrivateKey = {} -Address = {}/16 +Address = {}/{} ListenPort = {} -PostUp = iptables -A FORWARD -i {} -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE -PostDown = iptables -D FORWARD -i {} -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE +PostUp = iptables -A FORWARD -i {} -j ACCEPT; iptables -t nat -A POSTROUTING -o {} -j MASQUERADE +PostDown = iptables -D FORWARD -i {} -j ACCEPT; iptables -t nat -D POSTROUTING -o {} -j MASQUERADE "#, - private_key, address, listen_port, interface_name, interface_name + private_key, address, subnet_mask, listen_port, interface_name, external_iface, interface_name, external_iface );
17-17: Unused fieldk8sinWireGuardInstaller.The
k8sfield is stored in the struct but never used in the current implementation. Either remove it if not needed, or add a TODO comment explaining the planned usage.crates/basilica-autoscaler/src/provisioner/ssh.rs (2)
481-486:unwrap()onSystemTimecould panic in edge cases.
duration_since(UNIX_EPOCH).unwrap()can panic if the system clock is set before the Unix epoch. Useunwrap_or_default()for robustness:let now = std::time::SystemTime::now() .duration_since(std::time::UNIX_EPOCH) - .unwrap() + .unwrap_or_default() .as_secs();
672-685: Test coverage is minimal.The test only verifies struct creation. Consider adding unit tests for error handling paths, particularly for SSH timeout scenarios.
crates/basilica-autoscaler/src/error.rs (1)
132-145: Consider addingCircuitBreakerOpento retryable errors.
CircuitBreakerOpenindicates a transient condition that resolves after the circuit breaker's reset timeout. It should likely be retryable.pub fn is_retryable(&self) -> bool { matches!( self, Self::SshConnection { .. } | Self::SshExecution { .. } | Self::NetworkValidation(_) | Self::KubeApi(_) | Self::Http(_) | Self::SecureCloudApi(_) | Self::LeaderElection(_) + | Self::CircuitBreakerOpen(_) ) }crates/basilica-autoscaler/src/crd/scaling_policy.rs (1)
106-108: Consider adding validation forgpu_utilization_thresholdrange.The threshold is an f32 without bounds validation. A value outside 0-100 would be nonsensical. Consider adding a validation webhook or documenting the expected range.
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (2)
158-189: Best-effort cleanup may leave orphaned resources.Errors during cleanup (delete_node, stop_rental, deregister_node) are logged but the finalizer is still removed. This prevents re-reconciliation and may leave orphaned resources.
Consider tracking cleanup failures in status and retrying before removing the finalizer, or at minimum emit a metric/event for alerting:
+ let mut cleanup_errors = Vec::new(); + if let Some(node_name) = &status.node_name { if let Err(e) = self.k8s.delete_node(node_name).await { warn!(node = %node_name, error = %e, "Failed to delete K8s node"); + cleanup_errors.push(format!("node/{}: {}", node_name, e)); } } // ... similar for other cleanup steps + + if !cleanup_errors.is_empty() { + self.metrics.record_cleanup_failures(&name, &cleanup_errors); + }
234-234: Redundant import ofResourceExt.
kube::ResourceExtis already imported at the module level (line 4).- use kube::ResourceExt;crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (3)
165-211: Metrics collection silently ignores errors for node listing.Line 173 uses
unwrap_or_default()to swallow errors fromlist_nodes_with_label, and line 192 does the same forlist_node_pools. While this prevents reconciliation failures, it means the controller may make scaling decisions based on incomplete data (e.g., thinking there are 0 GPU nodes when the API call failed).Consider logging when these calls fail to aid debugging:
let nodes = self .k8s .list_nodes_with_label("nvidia.com/gpu", "true") .await - .unwrap_or_default(); + .unwrap_or_else(|e| { + warn!("Failed to list GPU nodes: {}", e); + Vec::new() + });
301-307: Timestamp-based naming could cause collisions across rapid scale-up calls.The pool name uses second-precision timestamps. While the index
iprevents collisions within a singlescale_upcall, two concurrent reconciliations or rapid successive calls in the same second could generate duplicate names. Kubernetes will reject duplicates, but this would cause the operation to fail.Consider adding a random suffix or using a more granular timestamp:
- let pool_name = format!( - "{}-{}-{}", - policy_name, - Utc::now().format("%Y%m%d%H%M%S"), - i - ); + let pool_name = format!( + "{}-{}-{:04x}", + policy_name, + Utc::now().format("%Y%m%d%H%M%S"), + rand::random::<u16>() + );
463-473: Tests are minimal; consider adding coverage for core logic.The current test only verifies
ScalingDecisionequality semantics. Key logic likeevaluate_scalingdecision boundaries, cooldown checks, and metrics evaluation would benefit from unit tests.Would you like me to generate additional unit tests for the
evaluate_scalinglogic and cooldown checks?crates/basilica-autoscaler/src/api/secure_cloud.rs (1)
296-310: Usingexpect()in constructors can cause panics in production.
reqwest::Client::builder().build()can fail (e.g., TLS initialization issues). Usingexpect()will panic, which is problematic for library code.Consider returning
Resultfrom constructors:- pub fn new(base_url: String, api_key: String, timeout: Duration) -> Self { + pub fn new(base_url: String, api_key: String, timeout: Duration) -> Result<Self> { let client = reqwest::Client::builder() .timeout(timeout) - .build() - .expect("Failed to create HTTP client"); + .build()?; - Self { + Ok(Self { client, base_url, api_key, circuit_breaker: Arc::new(CircuitBreaker::new(CircuitBreakerConfig::default())), - } + }) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (30)
Cargo.toml(1 hunks)crates/basilica-autoscaler/Cargo.toml(1 hunks)crates/basilica-autoscaler/src/api/mod.rs(1 hunks)crates/basilica-autoscaler/src/api/secure_cloud.rs(1 hunks)crates/basilica-autoscaler/src/bin/crdgen.rs(1 hunks)crates/basilica-autoscaler/src/config.rs(1 hunks)crates/basilica-autoscaler/src/controllers/k8s_client.rs(1 hunks)crates/basilica-autoscaler/src/controllers/mod.rs(1 hunks)crates/basilica-autoscaler/src/controllers/node_pool_controller.rs(1 hunks)crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs(1 hunks)crates/basilica-autoscaler/src/crd/mod.rs(1 hunks)crates/basilica-autoscaler/src/crd/node_pool.rs(1 hunks)crates/basilica-autoscaler/src/crd/scaling_policy.rs(1 hunks)crates/basilica-autoscaler/src/error.rs(1 hunks)crates/basilica-autoscaler/src/health.rs(1 hunks)crates/basilica-autoscaler/src/leader_election.rs(1 hunks)crates/basilica-autoscaler/src/lib.rs(1 hunks)crates/basilica-autoscaler/src/main.rs(1 hunks)crates/basilica-autoscaler/src/metrics.rs(1 hunks)crates/basilica-autoscaler/src/provisioner/k3s.rs(1 hunks)crates/basilica-autoscaler/src/provisioner/mod.rs(1 hunks)crates/basilica-autoscaler/src/provisioner/ssh.rs(1 hunks)crates/basilica-autoscaler/src/provisioner/wireguard.rs(1 hunks)crates/basilica-autoscaler/src/runtime.rs(1 hunks)orchestrator/ansible/playbooks/02-deploy/autoscaler.yml(1 hunks)orchestrator/k8s/services/autoscaler/crds.yaml(1 hunks)orchestrator/k8s/services/autoscaler/deployment.yaml(1 hunks)orchestrator/k8s/services/autoscaler/rbac.yaml(1 hunks)scripts/autoscaler/Dockerfile(1 hunks)scripts/autoscaler/build.sh(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (9)
crates/basilica-autoscaler/src/main.rs (2)
crates/basilica-autoscaler/src/runtime.rs (2)
fmt(162-164)new(32-37)crates/basilica-autoscaler/src/config.rs (8)
from_env(17-27)from_env(39-45)from_env(56-62)from_env(77-87)from_env(98-104)from_env(115-121)from_env(135-147)from_env(158-167)
crates/basilica-autoscaler/src/controllers/k8s_client.rs (3)
crates/basilica-api/src/server.rs (1)
std(1211-1211)crates/basilica-autoscaler/src/crd/node_pool.rs (2)
default(189-196)default(242-251)crates/basilica-autoscaler/src/crd/scaling_policy.rs (3)
default(79-85)default(124-132)default(210-215)
crates/basilica-autoscaler/src/crd/node_pool.rs (1)
crates/basilica-autoscaler/src/crd/scaling_policy.rs (2)
crd_metadata_is_correct(325-332)spec_has_required_fields(335-354)
crates/basilica-autoscaler/src/lib.rs (2)
crates/basilica-autoscaler/src/controllers/k8s_client.rs (1)
api(328-329)crates/basilica-common/src/ssh/connection.rs (1)
config(131-133)
crates/basilica-autoscaler/src/api/mod.rs (1)
orchestrator/cluster-manager/src/clustermgr/commands/topology.py (1)
WireGuardPeer(30-83)
crates/basilica-autoscaler/src/crd/scaling_policy.rs (1)
crates/basilica-autoscaler/src/crd/node_pool.rs (2)
default(189-196)default(242-251)
crates/basilica-autoscaler/src/error.rs (1)
crates/basilica-api/src/server.rs (1)
std(1211-1211)
crates/basilica-autoscaler/src/provisioner/ssh.rs (2)
crates/basilica-autoscaler/src/provisioner/k3s.rs (2)
parse_k3s_version(112-119)verify_k3s_status(107-109)crates/basilica-autoscaler/src/provisioner/wireguard.rs (3)
generate_wg_config(85-117)parse_wg_show_output(120-156)new(27-33)
crates/basilica-autoscaler/src/provisioner/wireguard.rs (1)
crates/basilica-api/src/wireguard/config.rs (1)
allowed_ips(61-73)
🪛 Checkov (3.2.334)
orchestrator/k8s/services/autoscaler/rbac.yaml
[high] 71-84: No ServiceAccount/Node should be able to read all secrets
(CKV2_K8S_5)
[high] 96-109: No ServiceAccount/Node should be able to read all secrets
(CKV2_K8S_5)
🔇 Additional comments (45)
Cargo.toml (1)
19-19: LGTM!The new workspace member is correctly added and maintains the existing ordering.
crates/basilica-autoscaler/src/leader_election.rs (1)
310-322: LGTM!The unit test correctly validates the atomic behavior of the shared
is_leaderflag across clones.crates/basilica-autoscaler/src/bin/crdgen.rs (1)
4-9: LGTM!Clean utility for CRD generation. The use of
expect()is appropriate for a CLI tool where failure should be immediately visible. The YAML document separator (---) correctly separates multiple CRD documents.crates/basilica-autoscaler/src/main.rs (1)
7-10: LGTM!The tracing setup is clean with a sensible default log level for the autoscaler module. The
EnvFilter::from_default_env()allows runtime configuration viaRUST_LOG.crates/basilica-autoscaler/src/provisioner/mod.rs (1)
1-7: Provisioner module wiring looks consistentThe module declarations and re-exports line up cleanly with the crate root API surface (K3s installer, SSH provisioner, WireGuard installer). No issues from a structure or visibility standpoint.
crates/basilica-autoscaler/src/api/mod.rs (1)
1-7: API module re-exports are well-scopedRe-exporting the Secure Cloud types and client from
secure_cloudat this level gives a nice, flat API (basilica_autoscaler::api::{SecureCloudClient, GpuOffering, ...}) without over-exposing the underlying module structure. This looks good and keeps callers’ imports simple.crates/basilica-autoscaler/src/crd/mod.rs (1)
1-12: CRD re-exports provide a clean, centralized type surfaceRe-exporting the NodePool and ScalingPolicy types (plus their associated configs, status types, and
FINALIZER) from this module gives consumers a single import point for all CRD-related types. The selection of exported items looks intentional and avoids leaking internal details.No issues from a structure or visibility standpoint.
crates/basilica-autoscaler/src/controllers/k8s_client.rs (3)
1-63: Well-designed trait abstraction for Kubernetes operations.The
AutoscalerK8sClienttrait provides a clean, testable abstraction over Kubernetes API operations. TheSend + Syncbounds are appropriate for async runtimes.
310-341: Pod eviction implementation looks correct.The eviction uses the proper Eviction subresource API and correctly handles the 429 (Too Many Requests) status for PodDisruptionBudget violations by returning
Ok(false).
350-358: Good test forSend + Syncbounds.This compile-time test ensures the client can be safely shared across threads, which is essential for async Kubernetes controllers.
crates/basilica-autoscaler/src/crd/node_pool.rs (3)
1-65: Well-structured CRD definition with comprehensive spec fields.The NodePoolSpec is well-designed with clear separation of concerns for different provisioning modes (Manual/Dynamic), K3s configuration, and lifecycle management. The use of
serde(default)andskip_serializing_ifis appropriate for optional fields.
287-324: Comprehensive phase enum with clear lifecycle stages.The
NodePoolPhaseenum covers the full lifecycle fromPendingthroughReady,Unhealthy, and terminal states. TheDisplayimplementation is straightforward and correct.
496-540: Good test coverage for CRD metadata validation.The tests verify CRD name, group, kind, scope, and required fields. This aligns with similar tests in
scaling_policy.rs(per the relevant code snippets).orchestrator/k8s/services/autoscaler/crds.yaml (3)
45-49: Excellent use of CEL validation rules for mode-based requirements.The
x-kubernetes-validationsrules ensuresshis required for Manual mode andsecureCloudfor Dynamic mode. This provides server-side validation before the controller sees the resource.
405-437:nodeTemplate.secureCloudrequiressshKeyIdandsshKeySecretRefbut noofferingId.Unlike the NodePool CRD's
secureCloudwhich requiresofferingId, the ScalingPolicy'snodeTemplate.secureCloudomits it. Verify this is intentional—perhaps the autoscaler dynamically selects offerings based ongpuTypeand other criteria.
1-553: CRD manifests are comprehensive and well-structured.Both CRDs follow Kubernetes conventions with proper versioning, status subresources, printer columns, and extensive schema validation. The structure aligns well with the Rust type definitions.
crates/basilica-autoscaler/src/controllers/mod.rs (1)
1-7: Clean module organization with appropriate public exports.The module correctly declares submodules and re-exports the key types needed by consumers of the controllers module.
orchestrator/k8s/services/autoscaler/deployment.yaml (5)
1-16: Good namespace separation for security-sensitive resources.Creating a dedicated
basilica-ssh-keysnamespace withrestricted: "true"label enables fine-grained RBAC and PSP/PSA policies for SSH key secrets.
38-51: Strong security context with control-plane tolerations.The pod security settings (non-root user, seccompProfile, runAsNonRoot) are well configured. Tolerations for control-plane nodes allow scheduling on master nodes if needed.
111-120: Container security context is comprehensive and follows best practices.Dropping all capabilities, read-only root filesystem, and RuntimeDefault seccomp profile provide defense in depth.
121-124: PreStop hook withsleep 15aids graceful shutdown.This allows in-flight requests to complete and load balancers to drain connections before the container terminates.
207-221: ServiceMonitor correctly configured for Prometheus scraping.The 30-second scrape interval matches the metrics collection interval in the ConfigMap.
crates/basilica-autoscaler/src/health.rs (3)
1-11: LGTM! Clean health state implementation.The use of
Arc<RwLock<HealthStatus>>for thread-safe async state management is appropriate for this use case.
61-78: LGTM! Proper liveness and readiness semantics.Liveness always returns 200 (correct for K8s restart behavior), while readiness returns 503 when not ready, which will remove the pod from service endpoints appropriately.
104-165: LGTM! Good test coverage for health endpoints.Tests verify the key behaviors: liveness always returns OK, readiness returns 503 when not ready and 200 when ready.
crates/basilica-autoscaler/src/provisioner/k3s.rs (2)
56-76: LGTM! Uninstall script handles edge cases well.The script correctly checks for both agent and server uninstall scripts, uses
2>/dev/null || trueto suppress errors for missing services, and cleans up residual directories.
78-103: LGTM! Label collection is straightforward.The function correctly builds the expected labels and handles optional GPU fields appropriately.
orchestrator/k8s/services/autoscaler/rbac.yaml (1)
1-54: LGTM! ClusterRole permissions are appropriately scoped.The permissions granted align with autoscaler functionality: CRD management, node operations, pod eviction, PDB reads, and event creation.
crates/basilica-autoscaler/src/metrics.rs (1)
375-397: LGTM! Test coverage validates metric creation and export format.Tests confirm metrics are properly registered and export valid Prometheus format.
crates/basilica-autoscaler/src/config.rs (2)
1-28: LGTM! Clean configuration structure with nested components.The hierarchical config design with
from_env()entry point is clean and maintainable.
213-231: LGTM! Tests validate sensible defaults.The tests confirm that default configuration values are reasonable and phase timeouts are set correctly.
crates/basilica-autoscaler/src/provisioner/wireguard.rs (1)
165-213: Good test coverage for utility functions.The tests adequately cover config generation and parsing for both empty and populated peer scenarios.
crates/basilica-autoscaler/src/error.rs (1)
1-130: Well-structured error hierarchy with clear categorization.The error enum is comprehensive and follows Rust best practices with
thiserror. TheFromimplementations enable ergonomic error propagation.crates/basilica-autoscaler/src/crd/scaling_policy.rs (2)
1-51: Well-designed CRD with sensible defaults and print columns.The ScalingPolicy CRD follows Kubernetes conventions with proper versioning, status subresource, and useful print columns.
319-365: Good test coverage for CRD metadata and defaults.Tests validate CRD naming, required fields, and sensible default values.
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (3)
50-70: Clean controller design with proper dependency injection.The generic controller with Arc-wrapped dependencies enables testability and follows the controller-runtime pattern.
93-134: Phase timeout handling with cleanup is well-structured.The timeout detection and phase-specific cleanup logic properly handles the various states and prevents re-entering cleanup.
1066-1098: Unit tests cover core helper functions.Tests verify phase timeout definitions and deterministic node ID generation properties.
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (4)
1-28: LGTM!Clean struct definition with appropriate trait bounds. The use of
Arcfor shared ownership and#[allow(dead_code)]annotation for the unusedapifield is noted - this suggests future integration work.
30-51: LGTM!Manual
Cloneimplementation correctly clonesArcreferences, and the constructor is straightforward.
355-398: LGTM!The template-to-NodePool conversion is well-structured with appropriate label management and configuration mapping.
408-439: LGTM!Thorough GPU detection covering both regular and init containers. The logic correctly checks for
nvidia.com/gpuin resource requests.crates/basilica-autoscaler/src/api/secure_cloud.rs (3)
12-51: LGTM on circuit breaker state design.The circuit breaker state machine and configuration structure are well-designed with sensible defaults.
379-414: LGTM on API method structure.The pattern of checking circuit, making requests, and recording outcomes is consistently applied. Good error handling with status code and body in error messages.
668-712: LGTM on unit tests.Good coverage for serialization/deserialization of API types.
* Rename autoscaler-crdgen binary to crdgen for consistency * Regenerate CRDs from Rust derive macros with kube-rs * Add schema descriptions for all spec and status fields * Include proper nullable annotations for optional fields * Add default values for configuration structs
* Add push.sh for tagging and pushing images to GHCR * Add rollout.sh for restarting deployment in K3s cluster * Support configurable source/target images and tags * Add kubeconfig path and namespace options for rollout * Include optional log following after rollout completes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
scripts/autoscaler/push.sh (1)
43-57: Image existence check works; considerdocker inspectfor robustness.The current approach using
docker imageswith grep and regex anchors should work correctly. However,docker inspectwould be semantically clearer and more direct for validating image existence, avoiding reliance on exact output formatting.Optional refactor: Use
docker inspectinstead ofdocker imagesandgrep.# Check if source image exists -if ! docker images --format "table {{.Repository}}:{{.Tag}}" | grep -q "^${SOURCE_IMAGE_FULL}$"; then +if ! docker inspect "$SOURCE_IMAGE_FULL" &>/dev/null; then echo "Error: Source image $SOURCE_IMAGE_FULL not found" >&2 echo "Please build the image first using: ./scripts/autoscaler/build.sh" >&2 exit 1 fiThis is cleaner and doesn't depend on output format assumptions.
scripts/autoscaler/rollout.sh (1)
5-5: Remove unused variable or clarify its purpose.The
PROJECT_ROOTvariable is defined but never used within this script. If it's intended for external consumption (e.g., when sourced by another script), please add a comment to clarify. Otherwise, remove the line.If the variable is truly unused and not exported, apply this diff:
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"Alternatively, if it's intentionally exported for external use, add a comment above its definition:
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# Exported for use by external scripts PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
crates/basilica-autoscaler/Cargo.toml(1 hunks)orchestrator/k8s/services/autoscaler/crds.yaml(1 hunks)scripts/autoscaler/push.sh(1 hunks)scripts/autoscaler/rollout.sh(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- crates/basilica-autoscaler/Cargo.toml
- orchestrator/k8s/services/autoscaler/crds.yaml
🧰 Additional context used
🪛 Shellcheck (0.11.0)
scripts/autoscaler/rollout.sh
[warning] 5-5: PROJECT_ROOT appears unused. Verify use (or export if used externally).
(SC2034)
🔇 Additional comments (4)
scripts/autoscaler/push.sh (2)
1-41: Well-structured argument parsing and setup.The option handling is clean, with proper shift semantics for flag-value pairs and helpful documentation via
--help. Path resolution is correct and properly quoted.
59-65: Tag and push operations are solid.The commands are properly quoted and will correctly tag the local image and push to the target registry. Note that
docker pushwill fail at runtime if the user is not authenticated to the target registry; consider documenting this prerequisite if needed.scripts/autoscaler/rollout.sh (2)
1-93: Excellent script structure and error handling.The script provides clear argument parsing, proper validation of prerequisites (kubeconfig existence), informative output, and sound error handling via
set -e. Variable quoting is consistent, and error messages are appropriately routed to stderr. The rollout logic is correct: restart → wait for completion → verify pod status → optionally tail logs.Minor suggestion (optional): Consider adding an early check for
kubectlavailability before attempting rollout to provide a clearer error message if the tool is missing:+command -v kubectl &> /dev/null || { echo "Error: kubectl not found in PATH" >&2; exit 1; } + export KUBECONFIG="$KUBECONFIG_PATH"
85-85: Label selector is correctly configured.Line 85's
-l app="$DEPLOYMENT"correctly matches the pod labels in the deployment manifest. All deployments consistently useapp: <deployment-name>as the pod label, so the selector will properly identify pods for the given deployment name.
* Add #[allow(unused_assignments)] to Ed25519PrivateKey struct * Add #[allow(unused_assignments)] to Ed25519KeyPair struct * Add #[allow(unused_assignments)] to P256PrivateKey struct * Add #[allow(unused_assignments)] to P256KeyPair struct * Fix false positives caused by ZeroizeOnDrop derive with #[zeroize(skip)] fields
…ilds * Add sed exclusion for basilica-autoscaler in billing Dockerfile * Add sed exclusion for basilica-autoscaler in payments Dockerfile * Prevent cargo build failure when autoscaler crate is not in build context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
crates/basilica-common/src/crypto/p256.rs (2)
24-29: Document why#[allow(unused_assignments)]is needed (and ensure it’s not masking real issues).
Please add a brief comment explaining the exact warning this suppresses (e.g., a false positive from#[derive(ZeroizeOnDrop)]expansion) and keep it only if it’s still reproducible.
38-44: Same concern forP256KeyPair: prefer justified/traceable lint suppression.
If this is to silence macro-generated code warnings, a short note helps prevent future accidental masking of genuine unused-assignment bugs.crates/basilica-common/src/crypto/ed25519.rs (2)
16-21: Add a short rationale for#[allow(unused_assignments)]onEd25519PrivateKey.
Helps ensure this is only suppressing a known macro/derive false-positive and not hiding a legitimate issue.
30-36: Add rationale / verify necessity forEd25519KeyPairlint suppression.
If the warning is fromZeroizeOnDropexpansion, consider noting that explicitly to avoid future confusion.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
crates/basilica-common/src/crypto/ed25519.rs(2 hunks)crates/basilica-common/src/crypto/p256.rs(2 hunks)scripts/billing/Dockerfile(1 hunks)scripts/payments/Dockerfile(1 hunks)
🔇 Additional comments (2)
scripts/billing/Dockerfile (1)
35-35: LGTM!The addition correctly excludes the new
basilica-autoscalercrate from the billing service Docker build, following the established pattern of workspace member removals (lines 25–36). The sed command syntax is correct and necessary since the billing service doesn't depend on the autoscaler.scripts/payments/Dockerfile (1)
23-36: Good: keep Docker "pruned workspace" in sync with new workspace members (autoscaler).
Line 35 prevents the payments image build from breaking whencrates/basilica-autoscalerexists in the workspace but isn't copied into the build context. The rootCargo.tomluses explicit member declarations (not glob patterns), so thesedapproach will work correctly.basilica-paymentshas no dependency onbasilica-autoscaler.
* Add sha2 workspace dependency for stable hashing across Rust versions * Add base64 workspace dependency for safe file uploads via SSH * These dependencies are required for security and stability fixes
* Add Runtime error variant to AutoscalerError enum * Track fatal errors in tokio::select! branches for unexpected exits * Return Err(AutoscalerError::Runtime) when controllers exit unexpectedly * Return Err when leader election fails fatally * Return Err when health or metrics servers fail or panic * Enables Kubernetes restart policy to properly handle failures
* Replace heredoc file uploads with base64 encoding to prevent command injection * Add sanitize_shell_arg() to filter dangerous characters from shell inputs * Add shell_escape() for safe single-quote escaping of URLs and tokens * Sanitize node_name, node_id, flannel_interface, and labels in K3s scripts * Use safe slice for node_id to prevent panic on short strings * Update tests to verify sanitization behavior Security fixes for: - Heredoc delimiter collision in SSH file uploads - Shell injection in K3s installation script generation - Potential panic on short node_id values
* Use replace() with resourceVersion for optimistic concurrency in acquire_lease * Increment leaseTransitions counter properly during lease acquisition * Fix try_increment_pending_scale_up to use replace_status instead of patch * Verify resourceVersion before attempting status update * Replace DefaultHasher with SHA-256 for stable node ID generation * SHA-256 provides consistent output across Rust toolchain versions Fixes: - Race condition when multiple instances acquire expired lease - resourceVersion in patch body not enforcing optimistic locking - Potential node ID changes on Rust version upgrades
* Remove pending counter reset on NoAction to preserve in-flight operation tracking * Counters should only decrement when NodePools reach Ready/Failed state * Fix deregister_node to return error on API failure instead of Ok(()) * Circuit breaker now correctly records failures for non-success HTTP responses * Add success log message for successful node deregistration
* Add AUTOSCALER_NAMESPACE env var using fieldRef to metadata.namespace * Runtime now watches correct namespace instead of defaulting to basilica-system * Dynamically derives namespace from pod's actual deployment namespace * Ensures controllers watch resources in the same namespace as the autoscaler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
♻️ Duplicate comments (10)
crates/basilica-autoscaler/src/provisioner/ssh.rs (2)
140-147: SSH host key verification remains disabled.This was flagged in a previous review.
ServerCheckMethod::NoCheckcreates a MITM vulnerability for production infrastructure provisioning.
463-468: Same public key slicing issue for peer keys.Apply the same safe slice pattern here:
debug!( host = %host, - peer_key = %&peer.public_key[..8], + peer_key = %&peer.public_key[..peer.public_key.len().min(8)], endpoint = ?peer.endpoint, "WireGuard peer status" );crates/basilica-autoscaler/src/api/secure_cloud.rs (1)
91-105:success_thresholdconfig field is still unused.This was flagged in a previous review. The
CircuitBreakerConfig::success_thresholdfield (line 40) is defined butrecord_successimmediately transitions toClosedon the first success inHalfOpenstate.crates/basilica-autoscaler/src/runtime.rs (1)
61-62: AUTOSCALER_NAMESPACE default may not match deployment namespace.This was flagged in a previous review. The deployment manifest needs to set
AUTOSCALER_NAMESPACEto match the actual deployment namespace.crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (1)
733-742:handle_terminatingusesspec.node_idbut should usestatus.node_idfor dynamic mode.Similar to the WireGuard handler, line 734 reads from
pool.spec.node_idwhich may not be set in dynamic provisioning mode where the ID is generated.- if let Some(node_id) = &pool.spec.node_id { + let node_id = status.node_id.as_ref().or(pool.spec.node_id.as_ref()); + if let Some(node_id) = node_id { if let Err(e) = self.api.deregister_node(node_id).await {crates/basilica-autoscaler/src/leader_election.rs (1)
249-293: Good:acquire_leasenow usesresourceVersion+replace(fixes prior lease acquisition race).
Only tweak I’d suggest: don’tunwrap_or_default()resourceVersion; treat missing as an error, and ensureleaseDurationSecondsfits the expected int range.- let resource_version = current - .metadata - .resource_version - .clone() - .unwrap_or_default(); + let resource_version = current + .metadata + .resource_version + .clone() + .ok_or_else(|| crate::error::AutoscalerError::Unexpected("Lease missing resourceVersion".into()))?; @@ - "leaseDurationSeconds": config.lease_duration.as_secs(), + "leaseDurationSeconds": i64::try_from(config.lease_duration.as_secs()) + .map_err(|_| crate::error::AutoscalerError::InvalidConfiguration("lease_duration too large".into()))?,crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)
146-159: Good: NoAction no longer resets pending counters.
This matches the intended “decrement on NodePool Ready/Failed” behavior.crates/basilica-autoscaler/src/controllers/k8s_client.rs (3)
121-133: Finalizer add overwrites existing finalizers.
Patch::Mergewith"finalizers": [FINALIZER]replaces the array; you’ll drop other controllers’ finalizers. (Same concern as prior review.)async fn add_node_pool_finalizer(&self, ns: &str, name: &str) -> Result<()> { use crate::crd::FINALIZER; - use kube::api::{Api, Patch, PatchParams}; + use kube::api::{Api, Patch, PatchParams}; let api: Api<NodePool> = Api::namespaced(self.client.clone(), ns); - let patch = serde_json::json!({ - "metadata": { - "finalizers": [FINALIZER] - } - }); - api.patch(name, &PatchParams::default(), &Patch::Merge(&patch)) - .await?; + let node_pool = api.get(name).await?; + let mut finalizers = node_pool.metadata.finalizers.unwrap_or_default(); + if !finalizers.iter().any(|f| f == FINALIZER) { + finalizers.push(FINALIZER.to_string()); + let patch = serde_json::json!({ "metadata": { "finalizers": finalizers } }); + api.patch(name, &PatchParams::default(), &Patch::Merge(&patch)).await?; + } Ok(()) }
135-146: Finalizer removal clears all finalizers.
Setting"finalizers": nullwill remove other controllers’ finalizers too. (Same concern as prior review.)async fn remove_node_pool_finalizer(&self, ns: &str, name: &str) -> Result<()> { + use crate::crd::FINALIZER; use kube::api::{Api, Patch, PatchParams}; let api: Api<NodePool> = Api::namespaced(self.client.clone(), ns); - let patch = serde_json::json!({ - "metadata": { - "finalizers": null - } - }); + let node_pool = api.get(name).await?; + let finalizers: Vec<String> = node_pool + .metadata + .finalizers + .unwrap_or_default() + .into_iter() + .filter(|f| f != FINALIZER) + .collect(); + let patch = serde_json::json!({ + "metadata": { + "finalizers": if finalizers.is_empty() { serde_json::Value::Null } else { serde_json::json!(finalizers) } + } + }); api.patch(name, &PatchParams::default(), &Patch::Merge(&patch)) .await?; Ok(()) }
176-215:try_increment_pending_scale_upcan “succeed” without updating status whenpolicy.statusisNone.
You currently no-op if status is absent, then stillreplace_status(...)and returnOk(true).- // Update the status with incremented pending_scale_up - if let Some(ref mut status) = policy.status { - status.pending_scale_up = current_value + increment; - } + // Ensure status exists, then update atomically + let status = policy.status.get_or_insert_with(Default::default); + status.pending_scale_up = current_value.saturating_add(increment); @@ - // Serialize policy to bytes for replace_status - let data = serde_json::to_vec(&policy)?; - - // Use replace_status with resourceVersion in metadata for optimistic locking - match api.replace_status(name, &PostParams::default(), data).await { + // Use replace_status (RV in metadata) for optimistic locking + match api.replace_status(name, &PostParams::default(), serde_json::to_vec(&policy)?).await { Ok(_) => Ok(true),
🧹 Nitpick comments (10)
crates/basilica-autoscaler/Cargo.toml (1)
17-62: Some of these dependencies could be centralized, but others are appropriately crate-specific.The workspace has an established
[workspace.dependencies]section that covers common dependencies liketokio,serde,metrics, etc. However, the Kubernetes-specific packages (kube,kube-runtime,k8s-openapi,schemars,prometheus) are intentionally pinned at the crate level because:
- They're required only by this crate.
- The versions may be tied to Kubernetes API compatibility (v1_29) rather than workspace standards.
That said, consider consolidating the generic, non-Kubernetes dependencies into
[workspace.dependencies]:
serde_yaml(0.9)mockall(0.12) andtokio-test(0.4) could move to a[workspace.lints]or shared[dev-dependencies]if the workspace supports it.Note:
tokio-utilis already in workspace at version 0.7.crates/basilica-autoscaler/src/provisioner/k3s.rs (4)
18-80: Install script generation is much safer now; consider guarding against empty sanitized values + reducecurl | shrisk.
- If
safe_node_name/safe_flannel/safe_node_idsanitize down to empty, the script will fail in a less actionable way; better to fail early with a clear message.curl -sfL https://get.k3s.io | ... sh -s -is convenient but is still supply-chain sensitive; consider pinningINSTALL_K3S_VERSION(or similar) and/or verifying what you can.Example early-guard inside the script:
format!( r#"#!/bin/bash set -euo pipefail + +if [ -z "{safe_node_name}" ] || [ -z "{safe_flannel}" ]; then + echo "Invalid sanitized inputs (empty after sanitization)" >&2 + exit 2 +fi ... "#,
104-130:collect_node_labels: consider validating label key/value format/length (esp. GPU fields).
These labels are likely fed to Kubernetes APIs later; a small validator (or centralized label builder) would prevent propagating invalid labels and failing later in harder-to-debug places.
132-145: Status/version helpers are OK; version parsing is a bit positional.
parse_k3s_versionassumes the version token is the 3rd whitespace token; consider scanning for the first token that looks likev<digits>.to be more resilient to banners/warnings.
147-221: Tests are a solid baseline; add one for quoted tokens + “invalid label char” behavior once sanitizer is tightened.
A test case with a token containing a single quote (or newline) would lock in theshell_escapebehavior in the install script output, and a test for rejecting/normalizing invalid label characters will prevent regressions.crates/basilica-autoscaler/src/error.rs (1)
137-160: Consider expandingis_retryableclassification.Some timeout-related errors like
FlannelTimeout,DrainTimeout, andPhaseTimeoutcould potentially be retryable after intervention or longer waits. Additionally,Ioerrors are often transient (network filesystems, temp file issues). The current classification might cause permanent failures for recoverable situations.Consider whether these should be retryable:
pub fn is_retryable(&self) -> bool { matches!( self, Self::SshConnection { .. } | Self::SshExecution { .. } | Self::NetworkValidation(_) | Self::KubeApi(_) | Self::Http(_) | Self::SecureCloudApi(_) | Self::LeaderElection(_) + | Self::Io(_) + | Self::FlannelTimeout { .. } + | Self::CircuitBreakerOpen(_) ) }crates/basilica-autoscaler/src/api/secure_cloud.rs (1)
296-310:expecton HTTP client creation could panic in edge cases.While unlikely,
reqwest::Client::builder().build()can fail (e.g., TLS backend issues). Consider returning aResultinstead of panicking.pub fn new(base_url: String, api_key: String, timeout: Duration) -> Self { - let client = reqwest::Client::builder() + Self::try_new(base_url, api_key, timeout).expect("Failed to create HTTP client") + } + + pub fn try_new(base_url: String, api_key: String, timeout: Duration) -> Result<Self> { + let client = reqwest::Client::builder() .timeout(timeout) .build() - .expect("Failed to create HTTP client"); + .map_err(|e| AutoscalerError::Internal(format!("Failed to create HTTP client: {}", e)))?;crates/basilica-autoscaler/src/runtime.rs (1)
119-142: Consider extracting context structs to module level.Defining
ControllerContextandScalingContextstructs inside therunmethod limits testability and reusability. Consider moving these to the module level.crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (2)
167-213: Avoidunwrap_or_default()on K8s list errors incollect_metrics(silent failure).
This can hide RBAC/network issues and report misleading metrics (e.g.,0GPU nodes). Prefer logging + propagating (or at least emitting a degraded condition).
292-319: Scale-up NodePool naming can collide (second-level timestamp +i).
Two scale-ups within the same second (or retries) can generate identical names and fail with AlreadyExists. Consider adding a random suffix (UUID) or a monotonic counter.- let pool_name = format!( - "{}-{}-{}", - policy_name, - Utc::now().format("%Y%m%d%H%M%S"), - i - ); + let pool_name = format!( + "{}-{}-{}", + policy_name, + Utc::now().format("%Y%m%d%H%M%S"), + uuid::Uuid::new_v4().simple() + );
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
crates/basilica-autoscaler/Cargo.toml(1 hunks)crates/basilica-autoscaler/src/api/secure_cloud.rs(1 hunks)crates/basilica-autoscaler/src/controllers/k8s_client.rs(1 hunks)crates/basilica-autoscaler/src/controllers/node_pool_controller.rs(1 hunks)crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs(1 hunks)crates/basilica-autoscaler/src/error.rs(1 hunks)crates/basilica-autoscaler/src/leader_election.rs(1 hunks)crates/basilica-autoscaler/src/provisioner/k3s.rs(1 hunks)crates/basilica-autoscaler/src/provisioner/ssh.rs(1 hunks)crates/basilica-autoscaler/src/runtime.rs(1 hunks)orchestrator/k8s/services/autoscaler/deployment.yaml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- orchestrator/k8s/services/autoscaler/deployment.yaml
🧰 Additional context used
🧬 Code graph analysis (3)
crates/basilica-autoscaler/src/runtime.rs (7)
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (2)
new(56-70)clone(39-47)crates/basilica-autoscaler/src/leader_election.rs (2)
new(24-40)is_leader(43-45)crates/basilica-autoscaler/src/health.rs (3)
new(27-31)start_health_server(89-102)default(21-23)crates/basilica-autoscaler/src/metrics.rs (3)
new(46-246)start_metrics_server(360-373)default(340-342)crates/basilica-autoscaler/src/config.rs (8)
from_env(17-27)from_env(39-45)from_env(56-62)from_env(77-87)from_env(98-104)from_env(115-121)from_env(135-147)from_env(158-167)crates/basilica-autoscaler/src/crd/node_pool.rs (3)
fmt(307-323)default(189-196)default(242-251)crates/basilica-autoscaler/src/crd/scaling_policy.rs (3)
default(79-85)default(124-132)default(210-215)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (2)
crates/basilica-autoscaler/src/crd/scaling_policy.rs (3)
default(79-85)default(124-132)default(210-215)crates/basilica-autoscaler/src/crd/node_pool.rs (2)
default(189-196)default(242-251)
crates/basilica-autoscaler/src/provisioner/ssh.rs (2)
crates/basilica-autoscaler/src/provisioner/k3s.rs (2)
parse_k3s_version(138-145)verify_k3s_status(133-135)crates/basilica-autoscaler/src/provisioner/wireguard.rs (3)
generate_wg_config(85-117)parse_wg_show_output(120-156)new(27-33)
🔇 Additional comments (16)
crates/basilica-autoscaler/Cargo.toml (1)
9-16: Binaries look correctly declared; ensure the paths exist and are intended entrypoints.Please double-check that
src/main.rsandsrc/bin/crdgen.rsexist in this PR and thatcrdgenis meant to be shipped as a separate binary (vs. a feature-gated dev tool).crates/basilica-autoscaler/src/provisioner/k3s.rs (2)
12-16:shell_escapeimplementation looks correct for single-quoted contexts.
The'"'"'pattern is the right approach here.
82-103: Uninstall script is pragmatic and defensive (service stops tolerate missing units).
Nothing concerning here given the stated “agent” scope.crates/basilica-autoscaler/src/error.rs (1)
1-133: LGTM! Well-structured error handling.The error enum is comprehensive with descriptive messages and appropriate structured data. The
Fromtrait implementations enable clean?propagation.crates/basilica-autoscaler/src/provisioner/ssh.rs (3)
481-492: Handshake timestamp comparison assumes synchronized clocks.The handshake validation compares the remote node's timestamp against the local system time. Clock skew between the provisioning host and the remote node could cause false negatives (valid handshakes appearing stale) or false positives.
Consider adding a tolerance margin or documenting the clock sync requirement.
203-222: LGTM! Base64 encoding fixes the heredoc injection vulnerability.Good fix using base64 encoding to avoid delimiter collision and command injection.
554-638: LGTM! Safe node_id slicing implemented.The fix at line 566 using
&node_id[..node_id.len().min(8)]correctly prevents panics on short IDs.crates/basilica-autoscaler/src/api/secure_cloud.rs (2)
590-623: LGTM!deregister_nodenow correctly returns error on API failure.The fix properly returns an
Erron non-success HTTP status, ensuring the circuit breaker correctly records failures.
379-414: LGTM! Consistent error handling pattern across API methods.The pattern of checking circuit breaker, making request, handling errors, and updating circuit breaker state is well-implemented and consistent.
crates/basilica-autoscaler/src/runtime.rs (3)
284-338: LGTM! Fatal errors now properly propagate to process exit.The implementation correctly tracks
fatal_errorand returnsErr(AutoscalerError::Runtime(...))on fatal conditions, allowing the orchestrator to detect and restart the process.
268-280: Leader watch task logs error but doesn't setfatal_erroron channel close.When the leader election channel closes (line 279), it logs an error but the task completes normally. This triggers the
leader_watch_taskbranch inselect!which does setfatal_error. The flow is correct, but the error message could be clearer that this is the termination path.
354-359: Proposed fix is invalid:AutoscalerConfigdoes not implementDefaulttrait.The suggestion to use
AutoscalerConfig::default()will not compile becauseAutoscalerConfigonly derivesCloneandDebug. More importantly, the concern about environment variables is overstated:from_env()provides sensible fallbacks for all configuration values (e.g.,"https://api.basilica.ai"forBASILICA_API_URL, port 8080 for health checks), so the test will not fail in CI environments without specific env vars set. The test as written is safe to use.If test isolation from production defaults is desired, consider adding a test-specific constructor rather than
Default.Likely an incorrect or invalid review comment.
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (4)
1030-1047: LGTM! SHA-256 provides stable deterministic node IDs.The fix correctly uses SHA-256 instead of
DefaultHasher, ensuring stable node IDs across Rust versions. The format remains consistent with the previous implementation.
158-189: Deletion handler continues cleanup despite individual failures.The approach of logging warnings and continuing on cleanup failures (lines 166-182) is appropriate for deletion handlers—best-effort cleanup avoids leaving the system in a stuck state.
92-134: Phase timeout logic correctly prevents concurrent cleanup.The
cleanup_in_progressflag prevents race conditions when multiple reconciliation cycles detect the same timeout. Good defensive programming.
1071-1103: LGTM! Comprehensive tests for deterministic node ID generation.Tests properly verify stability, uniqueness by IP, and uniqueness by pool name.
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs
Outdated
Show resolved
Hide resolved
- CircuitBreaker now tracks consecutive successful calls in HalfOpen state and only transitions to Closed after success_threshold calls - add_node_pool_finalizer uses fetch-modify-update pattern to append the finalizer without overwriting existing ones - remove_node_pool_finalizer filters out only the autoscaler's finalizer, preserving other controllers' finalizers
…tency - Safe slice WireGuard public key to prevent panic on malformed data - Split sanitize_shell_arg into context-aware functions for K8s labels: sanitize_shell_word (hostnames), sanitize_label_key (with /), sanitize_label_value (max 63 chars, no : = /) - Handle shutdown_rx.changed() error when sender dropped unexpectedly - Add separate leader_tx for proper leadership signaling when disabled - Use saturating_add for pending_scale_down to accumulate in-flight ops - Resolve node_id from status before spec in deletion, terminating, timeout cleanup, and wireguard handlers for dynamic provisioning - Add v1_29 feature to basilica-api k8s-openapi for workspace consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
♻️ Duplicate comments (3)
crates/basilica-autoscaler/src/provisioner/k3s.rs (1)
4-30: Silent character removal can cause identity collisions and mask configuration errors.All three sanitization functions silently filter or truncate invalid characters without returning errors or warnings. This creates several risks:
- Identity collisions:
node-123!andnode-123?both becomenode-123, potentially causing node identity conflicts.- Silent configuration errors: Invalid inputs are "fixed" rather than rejected, masking misconfigurations.
- Missing validations:
sanitize_shell_word: No length limits (hostnames limited to 253 chars)sanitize_label_key: No length limits (K8s limit: 253 chars for prefix, 63 for name)- No check for empty results after sanitization
- No validation that labels start/end with alphanumeric (K8s requirement)
Consider returning
Result<String, Error>from these functions to reject invalid inputs explicitly, or at minimum, validate and error on empty results after sanitization.-fn sanitize_shell_word(s: &str) -> String { +fn sanitize_shell_word(s: &str) -> Result<String, String> { + if s.is_empty() { + return Err("Input cannot be empty".to_string()); + } + if s.len() > 253 { + return Err("Input exceeds hostname length limit of 253 chars".to_string()); + } + let sanitized = s.chars() .filter(|c| c.is_alphanumeric() || matches!(c, '-' | '_' | '.')) - .collect() + .collect::<String>(); + + if sanitized.is_empty() { + return Err(format!("Input '{}' contains no valid characters", s)); + } + + Ok(sanitized) }Apply similar patterns to
sanitize_label_keyandsanitize_label_value, and propagate errors in calling code.crates/basilica-autoscaler/src/provisioner/ssh.rs (1)
140-147: SSH host key verification is disabled, creating MITM vulnerability.
ServerCheckMethod::NoCheckdisables SSH host key verification, making connections vulnerable to man-in-the-middle attacks. For production infrastructure provisioning, this is a security concern.Consider implementing host key verification, or at minimum, document the security tradeoff. Options include:
- Store and verify host keys after first connection (TOFU)
- Accept host keys from a trusted source (e.g., cloud provider API)
- Use
ServerCheckMethod::HashKnownHostswith pre-populated keysBased on learnings, this issue was previously flagged but remains unaddressed.
crates/basilica-autoscaler/src/leader_election.rs (1)
110-180: Good: clean shutdown semantics + leadership notification separation.
Handlingshutdown_rx.changed()error and usingleader_tx(esp. in “disabled” mode) avoids the previous “never flips to false”/closed-channel pitfalls.
🧹 Nitpick comments (1)
crates/basilica-autoscaler/src/leader_election.rs (1)
182-234: Minor: avoid per-iteration allocation when comparing holder identity.
holder == Some(&identity.to_string())allocates; compare asOption<&str>.- if holder == Some(&identity.to_string()) { + if holder.map(String::as_str) == Some(identity) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
crates/basilica-api/Cargo.toml(1 hunks)crates/basilica-autoscaler/src/api/secure_cloud.rs(1 hunks)crates/basilica-autoscaler/src/controllers/k8s_client.rs(1 hunks)crates/basilica-autoscaler/src/controllers/node_pool_controller.rs(1 hunks)crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs(1 hunks)crates/basilica-autoscaler/src/leader_election.rs(1 hunks)crates/basilica-autoscaler/src/provisioner/k3s.rs(1 hunks)crates/basilica-autoscaler/src/provisioner/ssh.rs(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (3)
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (2)
new(56-70)add_condition(1009-1031)crates/basilica-autoscaler/src/crd/scaling_policy.rs (3)
default(79-85)default(124-132)default(210-215)crates/basilica-autoscaler/src/crd/node_pool.rs (2)
default(189-196)default(242-251)
crates/basilica-autoscaler/src/provisioner/ssh.rs (2)
crates/basilica-autoscaler/src/provisioner/k3s.rs (2)
parse_k3s_version(158-165)verify_k3s_status(153-155)crates/basilica-autoscaler/src/provisioner/wireguard.rs (3)
generate_wg_config(85-117)parse_wg_show_output(120-156)new(27-33)
🔇 Additional comments (3)
crates/basilica-api/Cargo.toml (1)
52-52: This k8s-openapi configuration change is safe and compatible.Verification confirms that
k8s-openapi 0.21withfeatures = ["v1_29"]is fully compatible withkube 0.88(kube 0.88.0 officially bumped k8s-openapi to 0.21 for Kubernetes v1.29 support). All workspace code uses only stable v1 APIs (core::v1,batch::v1,networking::v1,policy::v1,apps::v1,rbac::v1, etc.); no v1beta1 or v1beta2 dependencies exist. Disabling default-features is safe and reduces binary bloat by including only necessary Kubernetes API versions.crates/basilica-autoscaler/src/provisioner/k3s.rs (2)
153-165: LGTM - Verification functions are clear and correct.Both
verify_k3s_statusandparse_k3s_versionimplement straightforward parsing logic appropriate for their use cases. The trimming inverify_k3s_statushandles potential trailing newlines, andparse_k3s_versionsafely extracts the version token.
167-270: Tests provide good coverage of happy paths.The test suite validates the main functionality of script generation, sanitization, and parsing. The tests for dangerous character removal (lines 192-199) and shell escaping (lines 231-234) are particularly valuable for security verification.
Consider adding tests for edge cases once the sanitization functions return
Resulttypes:
- Empty strings after sanitization
- Inputs exceeding length limits
- Empty keys/values in
extra_labelsThese can be addressed when implementing the suggested refactors.
Add module-level #![allow(unused_assignments)] to ed25519.rs and p256.rs. The ZeroizeOnDrop derive macro generates assignments in the Drop impl that trigger warnings. Field-level attributes don't suppress derive macro generated code, so module-level is required.
0705eeb to
07cea84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
crates/basilica-common/src/crypto/p256.rs (1)
6-6: Prefer narrowing#![allow(unused_assignments)]scope (module-wide suppression can hide real bugs).If the warning is only emitted by the
ZeroizeOnDrop-generatedDropimpl, consider moving the allow to the smallest scope possible (e.g., item-level on the affected type, or#[expect(unused_assignments)]if your MSRV supports it) instead of the whole module. Also, please double-check that the lint is indeed coming from this derive and not from a genuine unused assignment elsewhere.crates/basilica-common/src/crypto/ed25519.rs (1)
6-6: Prefer narrowing#![allow(unused_assignments)]to the specificZeroizeOnDropderives (avoid file-wide masking).-#![allow(unused_assignments)] // ZeroizeOnDrop derive generates assignments in Drop impl - use crate::error::CryptoError; use base64::{engine::general_purpose::STANDARD as BASE64, Engine}; use ed25519_dalek::{ Signature, Signer, SigningKey, Verifier, VerifyingKey, PUBLIC_KEY_LENGTH, SECRET_KEY_LENGTH, }; use rand::rngs::OsRng; use std::fmt; use zeroize::ZeroizeOnDrop; /// Ed25519 private key wrapper with secure memory handling +#[allow(unused_assignments)] // ZeroizeOnDrop derive generates assignments in Drop impl #[derive(ZeroizeOnDrop)] pub struct Ed25519PrivateKey { #[zeroize(skip)] signing_key: SigningKey, } @@ /// Ed25519 key pair containing both private and public keys +#[allow(unused_assignments)] // ZeroizeOnDrop derive generates assignments in Drop impl #[derive(ZeroizeOnDrop)] pub struct Ed25519KeyPair { private_key: Ed25519PrivateKey, #[zeroize(skip)] public_key: Ed25519PublicKey, }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
crates/basilica-common/src/crypto/ed25519.rs(1 hunks)crates/basilica-common/src/crypto/p256.rs(1 hunks)
…tion * Add idempotency guard to prevent multiple start() calls from spawning competing loops * Fix TOCTOU vulnerability in renew_lease() by using replace with resourceVersion * Fix TOCTOU vulnerability in release_lease() by using replace with resourceVersion * Replace unwrap_or_default() with explicit error for missing resourceVersion * Handle 409 conflict gracefully during lease release * Remove unused Patch and PatchParams imports
* Add validation to skip labels that become empty after sanitization * Apply sanitization to collect_node_labels with "unknown" fallback for empty values * Make /etc/hosts update idempotent using grep guard * Add structured logging for skipped invalid labels
* Use safe slice for peer.public_key to prevent panic on short/malformed keys * Make /etc/hosts update idempotent using grep guard
…king * Add required offering_id field to SecureCloudTemplate CRD * Add try_increment_pending_scale_down trait method to AutoscalerK8sClient * Implement optimistic locking for scale-down using resourceVersion * Handle 409 conflict gracefully to prevent race conditions
* Use resolved node_id from status for dynamic mode instead of spec.node_id * Reset failure_count to 0 on successful health check in Ready state * Apply optimistic locking for scale-down operations in ScalingPolicyController * Fix offering_id mapping to use correct field from SecureCloudTemplate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
♻️ Duplicate comments (13)
crates/basilica-autoscaler/src/provisioner/k3s.rs (2)
136-175: Avoid"unknown"label collisions fornode_id/pool_namefallbacksFalling back to
"unknown"is safe for API acceptance but can collapse distinct nodes/pools into the same label value, which is painful for scheduling/debugging and can break selectors/metrics assumptions. Consider a stable fallback like"unknown-<short-hash>"derived from the original string (or return an error upstream instead of silently normalizing).
This is in the same family as the earlier “lossy normalization can cause identity collisions” feedback.
13-30: Sanitization still doesn’t guarantee Kubernetes-valid label keys/values (can break installs/APIs)
sanitize_label_key/sanitize_label_valueonly filter characters (and truncate values), but Kubernetes label syntax also requires constraints like start/end alphanumeric, and label keys haveprefix/namerules (single/, prefix is DNS subdomain, name max 63). As-is, inputs like"-bad"or"...."or"a//b"can remain non-empty yet still be rejected by kubelet/K8s, causing K3s agent startup/registration failures.
This is essentially the earlier “tighten validation vs lossy filtering” concern, just in a different edge-case shape.Concrete direction: after sanitization, validate and skip (with warn) if invalid; ideally return
Resultto fail fast.+fn is_valid_label_value(s: &str) -> bool { + if s.is_empty() || s.len() > 63 { return false; } + let bytes = s.as_bytes(); + let is_alnum = |b: u8| b.is_ascii_alphanumeric(); + if !is_alnum(bytes[0]) || !is_alnum(bytes[bytes.len() - 1]) { return false; } + bytes.iter().all(|b| is_alnum(*b) || matches!(*b, b'-' | b'_' | b'.')) +} + +fn is_valid_label_key(s: &str) -> bool { + // Minimal safe check: 0 or 1 '/', name part like label-value rules but may include '/' + if s.is_empty() { return false; } + let mut parts = s.splitn(3, '/'); + let first = parts.next().unwrap(); + let second = parts.next(); + let third = parts.next(); + if third.is_some() { return false; } // more than one '/' + + let name = second.unwrap_or(first); + // Note: if there is a prefix, it must be DNS-subdomain; consider enforcing RFC1123 here. + is_valid_label_value(name) // (or a slightly different name validator allowing more chars) +} @@ for (key, value) in extra_labels { let safe_key = sanitize_label_key(key); let safe_value = sanitize_label_value(value); - // Skip labels that become empty after sanitization - if safe_key.is_empty() || safe_value.is_empty() { + if safe_key.is_empty() || safe_value.is_empty() || !is_valid_label_key(&safe_key) || !is_valid_label_value(&safe_value) { tracing::warn!( original_key = %key, original_value = %value, "Skipping invalid label: sanitized key or value is empty" ); continue; } labels.push(format!("--node-label={}={}", safe_key, safe_value)); }Also applies to: 59-72
crates/basilica-autoscaler/src/leader_election.rs (1)
54-102: Idempotency guard incomplete: fails to prevent duplicatestart()calls when leader election is disabled.The guard at line 57 checks only
shutdown_tx.is_some(), but the disabled path (lines 63-68) sets onlyleader_txand leavesshutdown_txasNone. A secondstart()call when disabled would pass the guard, create a newleader_tx, and return a new receiver, while the first receiver remains active. This breaks the single-start invariant.Apply this diff to fix the guard:
- if self.shutdown_tx.is_some() { + if self.shutdown_tx.is_some() || self.leader_tx.is_some() { return Err(crate::error::AutoscalerError::LeaderElection( "LeaderElector already started".to_string(), )); }crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (1)
1037-1054: Addressed: Deterministic node ID now uses SHA-256 for stability across Rust versions.The implementation correctly uses
sha2::Sha256instead ofDefaultHasher, ensuring stable node IDs across toolchain upgrades.crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (2)
128-170: Addressed: Scale-down now has optimistic locking matching scale-up.The
try_increment_pending_scale_downcall (lines 139-148) with resourceVersion conflict detection prevents overlapping scale-downs.
171-184: Addressed: NoAction no longer resets pending counters.The comment at lines 173-176 correctly documents that pending counters track in-flight operations and should only be decremented when NodePools reach Ready or Failed state.
crates/basilica-autoscaler/src/controllers/k8s_client.rs (2)
132-153: Addressed: Finalizer addition now preserves existing finalizers.The fetch-modify-update pattern (lines 137-151) correctly reads existing finalizers, checks for duplicates, and appends only if needed.
155-186: Addressed: Finalizer removal now filters only the specific finalizer.The implementation correctly filters out only
FINALIZER(line 166) while preserving other finalizers. Setting tonullwhen empty (lines 169-174) is appropriate.crates/basilica-autoscaler/src/provisioner/ssh.rs (5)
203-222: Addressed: File upload now uses base64 encoding to prevent command injection.The base64 encoding approach (lines 213-218) eliminates heredoc delimiter collision vulnerabilities.
401-403: Addressed: Public key logging now uses safe slice.The
get(..8).unwrap_or(&public_key)pattern prevents panic on short strings.
464-471: Addressed: Peer public key logging now uses safe slice.The
get(..8).unwrap_or(&peer.public_key)pattern is correctly applied.
569-569: Addressed: Node ID slicing now uses safe bounds.The
&node_id[..node_id.len().min(8)]pattern prevents panic on short node IDs.
140-147: SSH host key verification is disabled, creating MITM vulnerability.
ServerCheckMethod::NoCheckdisables SSH host key verification. For production infrastructure provisioning, this allows man-in-the-middle attacks. Consider implementing host key verification via TOFU (trust-on-first-use), pre-populated known hosts, or cloud provider API-provided host keys.At minimum, add a configuration flag to control this behavior and document the security trade-off.
🧹 Nitpick comments (6)
crates/basilica-autoscaler/src/leader_election.rs (1)
377-381: Log a warning whenresourceVersionis missing instead of silently returning.Line 380 returns
Ok(())whenresourceVersionisNone, but this prevents safe release and indicates an unexpected API state. Callers won't know the lease wasn't released.Apply this diff to add visibility:
let resource_version = match current.metadata.resource_version.clone() { Some(rv) => rv, - None => return Ok(()), // Cannot release without resourceVersion + None => { + warn!(identity = %identity, "Cannot release lease: missing resourceVersion"); + return Ok(()); + } };crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)
240-272: Consider adding scale-down trigger for idle nodes without GPU utilization metrics.The current implementation only triggers utilization-based scale-down when
average_gpu_utilizationis available (lines 258-267). Since this metric requires Prometheus/DCGM integration (per line 235 comment), scale-down may never trigger until that integration is complete. Consider whetheridle_nodesmetric (collected at line 236) should also inform scale-down decisions.crates/basilica-autoscaler/src/provisioner/ssh.rs (4)
11-17: Consider derivingDebugforSshConnectionConfigto aid troubleshooting.Adding
#[derive(Debug)]would help with logging during debugging. Note: if added, ensureprivate_keyandpassphrasefields are redacted to avoid leaking secrets in logs.
484-495: Handshake timestamp comparison may have clock skew issues.The code compares the WireGuard handshake timestamp against local
SystemTime::now(). If there's clock skew between the local system and the remote node, this comparison may incorrectly flag connectivity issues. Consider documenting this assumption or using the remote node's time.
590-599: K3s token is included in command string, potentially visible in process listings.The K3s token is embedded directly in the install command (line 591), which may be visible in
psoutput on the remote machine during execution. Consider using environment variables or a temporary file with restricted permissions:- let install_cmd = format!( - r#"curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="agent" K3S_URL="{}" K3S_TOKEN="{}" sh -s - \ + // Write token to temporary file to avoid exposure in process listings + let token_setup = format!( + "umask 077 && echo '{}' > /tmp/.k3s_token && trap 'rm -f /tmp/.k3s_token' EXIT", + token + ); + let install_cmd = format!( + r#"{} && curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="agent" K3S_URL="{}" K3S_TOKEN="$(cat /tmp/.k3s_token)" sh -s - \This is a defense-in-depth concern since SSH sessions are typically trusted, but reduces exposure window.
676-689: Unit test coverage is minimal.The test only verifies struct construction. Consider adding tests for:
- SSH command building with special characters
- Base64 encoding of file content
- Error handling paths (timeouts, failures)
Integration tests with a mock SSH server would provide better coverage.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
crates/basilica-autoscaler/src/controllers/k8s_client.rs(1 hunks)crates/basilica-autoscaler/src/controllers/node_pool_controller.rs(1 hunks)crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs(1 hunks)crates/basilica-autoscaler/src/crd/scaling_policy.rs(1 hunks)crates/basilica-autoscaler/src/leader_election.rs(1 hunks)crates/basilica-autoscaler/src/provisioner/k3s.rs(1 hunks)crates/basilica-autoscaler/src/provisioner/ssh.rs(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/basilica-autoscaler/src/crd/scaling_policy.rs
🧰 Additional context used
🧬 Code graph analysis (2)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (2)
crates/basilica-autoscaler/src/crd/scaling_policy.rs (3)
default(79-85)default(124-132)default(213-218)crates/basilica-autoscaler/src/crd/node_pool.rs (2)
default(189-196)default(242-251)
crates/basilica-autoscaler/src/provisioner/ssh.rs (3)
crates/basilica-autoscaler/src/provisioner/k3s.rs (2)
parse_k3s_version(187-194)verify_k3s_status(182-184)crates/basilica-autoscaler/src/provisioner/wireguard.rs (3)
generate_wg_config(85-117)parse_wg_show_output(120-156)new(27-33)crates/basilica-api/src/wireguard/config.rs (1)
allowed_ips(61-73)
🔇 Additional comments (24)
crates/basilica-autoscaler/src/provisioner/k3s.rs (2)
181-194: Status/version helpers look fine for the intended outputs
verify_k3s_status()trimming andparse_k3s_version()token extraction matchsystemctl is-activeand the commonk3s version ...format.
196-299: Tests are solid; consider adding failing cases once label validation is tightened
After adding stricter label validation, add cases like"-bad"/"...."/"a//b"to ensure they’re skipped (or rejected) deterministically.crates/basilica-autoscaler/src/leader_election.rs (6)
1-22: LGTM! Struct design properly separates shutdown and leadership signaling.The separation of
shutdown_txandleader_txcorrectly addresses the previous issue where leadership state changes couldn't be signaled independently from shutdown events.
24-52: LGTM! Identity generation and helper methods are well-implemented.The fallback chain (
POD_NAME→HOSTNAME→ UUID) ensures a unique identity in various deployment environments.
104-187: LGTM! Shutdown handling and election loop properly implemented.The loop correctly handles both explicit shutdown signals and sender-dropped scenarios (line 132), and the consecutive failure tracking with configurable limits provides resilience against transient API issues.
189-267: LGTM! Lease acquisition logic correctly handles all cases.The function properly distinguishes between holding, expired, and non-existent leases, and delegates to the appropriate helper functions that enforce optimistic concurrency control.
269-312: LGTM! Lease operations properly enforce optimistic concurrency control.Both
acquire_lease()andrenew_lease()correctly:
- Fetch the current lease to obtain
resourceVersion- Use
ok_or_else()to fail fast on missingresourceVersion- Verify holder identity before acting (in
renew_lease)- Use
replace()instead ofpatch()to enforce atomic updatesThis addresses the previous TOCTOU vulnerabilities.
Also applies to: 314-360
412-424: LGTM! Basic atomic test validates shared flag semantics.While minimal, this test confirms the
Arc<AtomicBool>pattern works correctly. Integration tests with a mock Kubernetes API would improve coverage but are not required for this review.crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (9)
1-70: LGTM! Clean controller structure with proper dependency injection.The generic controller design with trait bounds (
AutoscalerK8sClient,SecureCloudApi,NodeProvisioner) enables testability through mock implementations. TheCloneimplementation correctly usesArc::clonefor shared ownership.
92-133: Phase timeout handling is well-structured with cleanup flag to prevent concurrent cleanup.The cleanup_in_progress flag (lines 98-101, 104-108, 119) prevents race conditions when multiple reconciliations detect timeout simultaneously.
177-183: Addressed: Deletion handler now correctly resolves node_id from both status and spec.The resolved_node_id pattern (line 178) properly handles both dynamic mode (status.node_id) and manual mode (spec.node_id).
396-397: Addressed: WireGuard installation now uses resolve_node_id helper.This correctly supports both static (spec) and dynamic (status) node ID sources.
547-551: Addressed: Node labeling now uses resolved node_id from status.The pattern at lines 548-550 correctly prefers status.node_id for dynamic mode while falling back to spec.node_id for manual mode.
604-614: Addressed: failure_count is now reset on successful health check.Line 613 correctly resets
status.failure_count = 0when the node is confirmed ready.
784-790: Addressed: Timeout cleanup now uses consistent node_id resolution.Both cleanup branches correctly use
status.node_id.as_ref().or(pool.spec.node_id.as_ref())pattern.Also applies to: 805-811
1056-1076: Well-designed node_id resolution with clear precedence rules.The helper properly prioritizes: (1) user-specified spec.node_id, (2) previously persisted status.node_id, (3) deterministic generation from pool name + IP. This ensures idempotent reconciliation.
1078-1111: Good test coverage for phase timeouts and node ID generation stability.The tests verify critical invariants: timeouts are defined, node IDs are deterministic, and different inputs produce different IDs.
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (4)
30-51: LGTM! Controller structure with proper Clone and constructor.The generic design enables testing with mock implementations of
AutoscalerK8sClientandSecureCloudApi.
79-126: Well-implemented optimistic locking for scale-up operations.The resourceVersion-based conflict detection (lines 82-105) prevents concurrent reconciles from double-counting scale-ups. The comment at lines 108-111 clearly documents that partial failures are handled by the NodePool controller.
435-466: LGTM! Comprehensive GPU request detection.The
requests_gpufunction correctly checks both regular containers and init containers fornvidia.com/gpuresource requests.
400-405: Theoffering_idfield in bothSecureCloudTemplateandSecureCloudConfigis documented as the "GPU offering ID from Basilica API," confirming it correctly represents the actual GPU offering ID. No action needed.crates/basilica-autoscaler/src/controllers/k8s_client.rs (3)
8-74: Well-designed trait abstraction for Kubernetes operations.The
AutoscalerK8sClienttrait provides a comprehensive interface for all required K8s operations, enabling mock implementations for testing. The optimistic locking methods (try_increment_pending_scale_up/down) have clear semantics documented in comments.
397-428: LGTM! Pod eviction correctly handles PodDisruptionBudget blocks.The eviction implementation uses the proper Eviction subresource API and correctly interprets HTTP 429 as "blocked by PDB" returning
Ok(false).
216-255: Optimistic locking implementation is correct.The approach of fetching the full object, comparing resourceVersion locally, and calling
replace_statusworks as intended. Kubernetesreplace_statususes the resourceVersion from metadata for optimistic concurrency control—any stale resourceVersion will result in a 409 Conflict, which the code correctly handles.The concern about independent resourceVersions between spec and status is unfounded. Kubernetes uses a single resourceVersion per object that increments for any modification (spec or status); if the spec changes, the resourceVersion will be incremented, causing the status replace to fail with a 409 Conflict as expected.
* Use grep guard to check if net.ipv4.ip_forward=1 already exists * Only append to /etc/sysctl.conf if entry is missing * Prevent duplicate entries on repeated provisioning runs
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.