Feat/k3s autoscaler #287

epappas · 2025-12-11T23:15:51Z

Summary by CodeRabbit

New Features
- Adds basilica-autoscaler: NodePool and ScalingPolicy CRDs, dynamic GPU provisioning, scaling controllers, leader election, health/readiness endpoints, Prometheus metrics, WireGuard & SSH provisioning, k3s installer, secure-cloud API client with circuit breaker, centralized config, unified error types, and a controller runtime/CLI.
Chores
- Adds k8s manifests (CRDs, deployment, RBAC, ServiceMonitor), Dockerfile and build/push/rollout scripts, Ansible playbook, and workspace packaging updates.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

* Implement NodePool CRD for manual and dynamic node provisioning * Add ScalingPolicy CRD for automated GPU node scaling decisions * Create SSH-based node provisioner with K3s installation support * Implement WireGuard VPN setup for secure cluster networking * Add leader election for high availability * Implement health/readiness probes and Prometheus metrics * Add SecureCloud API client for dynamic node provisioning * Create unified error handling with AutoscalerError type * Support configurable reconciliation intervals and phase timeouts

* Create multi-stage Dockerfile with rust:1.88 builder * Add build.sh script with --image-name and --image-tag options * Support debug/release build modes via --debug flag * Configure non-root runtime user for security * Install openssh-client for node provisioning via SSH

* Define NodePool and ScalingPolicy CRDs with comprehensive validation * Add ClusterRole and Role RBAC for node, pod, and secret access * Configure leader election lease permissions * Create Deployment with security context (non-root, read-only fs) * Add ConfigMap with default autoscaler configuration * Include ServiceMonitor for Prometheus metrics scraping * Add PodDisruptionBudget for high availability * Set up namespaces for autoscaler and SSH keys storage

* Create autoscaler.yml playbook for K3s deployment * Apply CRDs, RBAC, and deployment manifests in order * Support configurable autoscaler_image variable * Wait for CRD establishment before proceeding * Verify rollout completion with 300s timeout * Display pod status after deployment

coderabbitai · 2025-12-11T23:16:02Z

Caution

Review failed

The pull request is closed.

Walkthrough

Adds a new workspace crate crates/basilica-autoscaler implementing CRDs (NodePool, ScalingPolicy), controllers and a Kube client, a SecureCloud API client with a circuit breaker, provisioners (SSH/WireGuard/K3s), metrics, health and leader-election subsystems, runtime binaries, Kubernetes manifests, and build/release scripts.

Changes

Cohort / File(s)	Summary
Workspace & Crate Manifest `Cargo.toml`, `crates/basilica-autoscaler/Cargo.toml`	Adds `crates/basilica-autoscaler` to workspace and introduces the crate manifest with binaries and dependencies.
Crate root & module exports `crates/basilica-autoscaler/src/lib.rs`, `crates/basilica-autoscaler/src/api/mod.rs`, `crates/basilica-autoscaler/src/controllers/mod.rs`, `crates/basilica-autoscaler/src/crd/mod.rs`, `crates/basilica-autoscaler/src/provisioner/mod.rs`	New public modules and re-exports exposing api, config, controllers, crd, error, health, leader_election, metrics, provisioner, runtime and selected types.
Secure Cloud API & models `crates/basilica-autoscaler/src/api/secure_cloud.rs`	New SecureCloudApi trait, SecureCloudClient HTTP impl with circuit breaker, data models, NodeManagedBy, and unit tests.
Configuration & Errors `crates/basilica-autoscaler/src/config.rs`, `crates/basilica-autoscaler/src/error.rs`	Env-driven AutoscalerConfig and sub-configs; centralized AutoscalerError enum and Result alias with retry/permanent classification.
CRDs (Rust & YAML) `crates/basilica-autoscaler/src/crd/node_pool.rs`, `crates/basilica-autoscaler/src/crd/scaling_policy.rs`, `orchestrator/k8s/services/autoscaler/crds.yaml`	NodePool and ScalingPolicy CRD Rust types (spec/status, defaults, tests) and corresponding YAML CRD manifests.
Controllers & K8s client `crates/basilica-autoscaler/src/controllers/k8s_client.rs`, `crates/basilica-autoscaler/src/controllers/node_pool_controller.rs`, `crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs`	AutoscalerK8sClient trait and KubeClient implementation; NodePoolController multi‑phase reconciliation and ScalingPolicyController scaling decision logic.
Provisioners & utilities `crates/basilica-autoscaler/src/provisioner/ssh.rs`, `crates/basilica-autoscaler/src/provisioner/k3s.rs`, `crates/basilica-autoscaler/src/provisioner/wireguard.rs`	NodeProvisioner trait, SshProvisioner implementation, K3sInstaller utilities, WireGuardInstaller, config generation/parsing and tests.
Runtime, binaries & CRD generator `crates/basilica-autoscaler/src/runtime.rs`, `crates/basilica-autoscaler/src/main.rs`, `crates/basilica-autoscaler/src/bin/crdgen.rs`	ControllerRuntime wiring components and reconcile loops, main binary, and crdgen utility binary.
Health, Metrics & Leader Election `crates/basilica-autoscaler/src/health.rs`, `crates/basilica-autoscaler/src/metrics.rs`, `crates/basilica-autoscaler/src/leader_election.rs`	Health/ready endpoints and state, Prometheus metrics subsystem and server/router, and lease-based leader election.
Deployment & Ops manifests `orchestrator/k8s/services/autoscaler/deployment.yaml`, `orchestrator/k8s/services/autoscaler/rbac.yaml`, `orchestrator/ansible/playbooks/02-deploy/autoscaler.yml`	K8s Namespace/Deployment/Service/ConfigMap/PDB/ServiceMonitor manifests, RBAC and an Ansible playbook for deployment.
Build, CI & helper scripts `scripts/autoscaler/{Dockerfile,build.sh,push.sh,rollout.sh}`, `scripts/billing/Dockerfile`, `scripts/payments/Dockerfile`	Multi-stage Dockerfile for autoscaler, build/push/rollout scripts; other Dockerfiles updated to exclude the new workspace member during their builds.
Minor lint attrs `crates/basilica-common/src/crypto/{ed25519.rs,p256.rs}`	Added `#![allow(unused_assignments)]` attributes to suppress lint noise in crypto modules.

Sequence Diagram(s)

sequenceDiagram
    participant K8s as Kubernetes API
    participant Controller as NodePool Controller
    participant KubeClient as KubeClient
    participant SecureCloud as SecureCloud API
    participant Provisioner as SSH Provisioner
    participant Node as Remote Node

    rect rgba(200,220,255,0.6)
    Note over Controller,KubeClient: Reconcile — load NodePool
    Controller->>KubeClient: Get NodePool(ns,name)
    KubeClient->>K8s: GET NodePool
    K8s-->>KubeClient: NodePool spec/status
    KubeClient-->>Controller: NodePool
    end

    rect rgba(220,255,220,0.6)
    Note over Controller,SecureCloud: Dynamic provisioning
    Controller->>SecureCloud: start_rental(offering,ssh_key)
    SecureCloud-->>Controller: RentalInfo (external_ip,ssh_port)
    Controller->>KubeClient: Patch NodePool status (rental_id, external_ip)
    end

    rect rgba(255,240,200,0.6)
    Note over Controller,Provisioner: Configure node & WireGuard
    Controller->>Provisioner: configure_base_system(host,port,ssh_cfg)
    Provisioner->>Node: SSH run scripts
    Node-->>Provisioner: output
    Provisioner-->>Controller: done
    Controller->>Provisioner: install_wireguard(...)
    Provisioner-->>Controller: public_key
    Controller->>SecureCloud: register_wireguard_key(node_id, public_key)
    SecureCloud-->>Controller: registration response
    Controller->>SecureCloud: get_peers(node_id)
    SecureCloud-->>Controller: peers
    Controller->>Provisioner: configure_wireguard_peers(peers)
    end

    rect rgba(255,220,220,0.6)
    Note over Controller,KubeClient: Join cluster → Ready
    Controller->>Provisioner: install_k3s_agent(..., token)
    Provisioner->>Node: install k3s agent
    Provisioner-->>Controller: agent running
    Controller->>KubeClient: Wait for Node readiness
    KubeClient->>K8s: GET Node status
    K8s-->>KubeClient: Node Ready
    KubeClient-->>Controller: Node ready -> Patch NodePool Ready
    end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Areas needing extra attention:
- NodePoolController multi‑phase reconciliation, timeouts, finalizer and cleanup paths
- ScalingPolicyController decision logic, cooldowns, and optimistic pending_scale_up increments
- SecureCloudClient circuit breaker concurrency, thresholds, and HTTP error mapping
- SshProvisioner remote execution, script generation, retries and file uploads
- CRD Rust types vs YAML manifest schema consistency and defaulting behavior
- Leader election lease handling, watch/stop semantics and shutdown interaction with controllers

Possibly related PRs

Feat/k3s autoscaler #287 — appears to add the same basilica-autoscaler workspace member and files (direct overlap).
feat(secure-cloud): implement the concept of secure cloud #223 — touches secure-cloud API/types that overlap with SecureCloudClient/types here.
Feat/k3s nodes wireguard #242 — related WireGuard API/peer registration surface used by the autoscaler's WireGuard workflows.

Suggested reviewers

itzlambda
distributedstatemachine

"I hopped and stitched a scaling song,
CRDs ready, controllers strong,
WireGuard tunnels, K3s delight,
Metrics hum through day and night,
A carrot-sized deploy — hop on, take flight!" 🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.10% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The PR title 'Feat/k3s autoscaler' is vague and generic. It uses a feature prefix without clearly describing the specific implementation—'k3s autoscaler' could refer to many different aspects (scaling, configuration, deployment, etc.). The changeset involves substantial additions across multiple modules, controllers, CRDs, and deployment configurations for a Kubernetes-based autoscaler system, but the title does not convey this scope or specificity.	Revise the title to be more specific and descriptive, such as: 'Add Kubernetes autoscaler with NodePool and ScalingPolicy controllers' or 'Implement k3s autoscaler with SSH provisioning and leader election'. This will better communicate the primary change to reviewers scanning the history.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 866194a and 80dc701.

📒 Files selected for processing (1)

crates/basilica-autoscaler/src/provisioner/ssh.rs (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 17

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (7)

crates/basilica-autoscaler/Cargo.toml-37-42 (1)

37-42: Ensure tokio-util workspace dependency includes the "rt" feature.

The kube/k8s-openapi version compatibility is correct. However, tokio-util is specified in the autoscaler with features ["rt"], while the workspace dependency defines it without features. Either add features = ["rt"] to the workspace dependency definition or remove the feature specification here to use the workspace version. The prometheus crate is legitimately used directly (not just through metrics-exporter-prometheus) for metrics instrumentation in metrics.rs, so it should remain as a direct dependency rather than a workspace dependency.
crates/basilica-autoscaler/src/crd/node_pool.rs-212-262 (1)
212-262: Type mismatch: delete_grace_period_seconds is u32 in Rust but i64 in NodePoolStatus.

In LifecycleConfig (line 226), delete_grace_period_seconds is u32, but in NodePoolStatus (as referenced in the relevant snippets), there's a reference to delete_grace_period_seconds: i64. Additionally, the Kubernetes API uses i64 for grace periods in DeleteOptions.

Consider using i64 for consistency with Kubernetes API conventions:
     /// Delete grace period in seconds
     #[serde(default = "default_delete_grace_period")]
-    pub delete_grace_period_seconds: u32,
+    pub delete_grace_period_seconds: i64,
And update the default function accordingly:
-fn default_delete_grace_period() -> u32 {
+fn default_delete_grace_period() -> i64 {
     30
 }
orchestrator/k8s/services/autoscaler/deployment.yaml-196-205 (1)
196-205: PDB with minAvailable: 1 conflicts with single replica deployment.

With replicas: 1 and minAvailable: 1, the PDB will block any voluntary disruption (including node drains and rolling updates). Consider either increasing replicas or using maxUnavailable: 0 which has similar semantics but is more explicit about the intent.

Either accept that voluntary disruptions will be blocked, or adjust:
 spec:
-  minAvailable: 1
+  maxUnavailable: 0  # More explicit: block all voluntary disruptions
   selector:
Or add a comment explaining the intent:
 spec:
+  # Single replica - intentionally blocks voluntary disruptions
+  # to maintain leader election continuity
   minAvailable: 1
orchestrator/k8s/services/autoscaler/crds.yaml-369-374 (1)
369-374: ScalingPolicy minNodes and maxNodes lack validation for min <= max.

Users could set minNodes: 10 and maxNodes: 5, which would be invalid. Consider adding a CEL validation rule.

Add a validation rule:
             spec:
               type: object
+              x-kubernetes-validations:
+                - rule: "!has(self.minNodes) || !has(self.maxNodes) || self.minNodes <= self.maxNodes"
+                  message: "minNodes must be less than or equal to maxNodes"
               properties:
Committable suggestion skipped: line range outside the PR's diff.
crates/basilica-autoscaler/src/provisioner/ssh.rs-404-440 (1)

404-440: Appending peers creates duplicates on re-run.

Each call appends new [Peer] sections without checking if the peer already exists. Re-running this method will create duplicate entries, which WireGuard may reject.

Consider regenerating the entire config file or checking for existing peers before appending.
crates/basilica-autoscaler/src/provisioner/ssh.rs-383-390 (1)
383-390: Appending to sysctl.conf creates duplicates on re-run.

Each invocation appends net.ipv4.ip_forward=1 to /etc/sysctl.conf, creating duplicate entries if the provisioner runs multiple times.

Use a drop-in file or check for existing entries:
-        self.execute_ssh_command(
-            host,
-            port,
-            ssh_config,
-            "echo 'net.ipv4.ip_forward=1' >> /etc/sysctl.conf && sysctl -p",
-        )
+        self.execute_ssh_command(
+            host,
+            port,
+            ssh_config,
+            "echo 'net.ipv4.ip_forward=1' > /etc/sysctl.d/99-wireguard.conf && sysctl --system",
+        )
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs-319-353 (1)
319-353: Pools without existing status are silently skipped during scale-down.

If pool.status is None (line 343), the pool won't be transitioned to Draining state. This could leave pools that should be drained in an inconsistent state.

Consider creating a default status when none exists:
-        for pool in candidates.into_iter().take(count as usize) {
-            let pool_name = pool.name_any();
-            info!(pool = %pool_name, "Initiating drain for scale-down");
-
-            if let Some(mut status) = pool.status.clone() {
-                status.phase = Some(NodePoolPhase::Draining);
-                status.phase_entered_at = Some(Utc::now());
-                self.k8s
-                    .update_node_pool_status(ns, &pool_name, status)
-                    .await?;
-            }
-        }
+        for pool in candidates.into_iter().take(count as usize) {
+            let pool_name = pool.name_any();
+            info!(pool = %pool_name, "Initiating drain for scale-down");
+
+            let mut status = pool.status.clone().unwrap_or_default();
+            status.phase = Some(NodePoolPhase::Draining);
+            status.phase_entered_at = Some(Utc::now());
+            self.k8s
+                .update_node_pool_status(ns, &pool_name, status)
+                .await?;
+        }
Committable suggestion skipped: line range outside the PR's diff.

🧹 Nitpick comments (42)

crates/basilica-autoscaler/src/leader_election.rs (2)
116-129: Initial delay before first lease attempt.

The loop sleeps for config.retry_period before the first lease acquisition attempt. On startup, this means the leader election won't begin immediately. Consider using tokio::time::interval or attempting acquisition before entering the loop.
 loop {
+    // Attempt lease acquisition/renewal first, then wait
+    match try_acquire_or_renew(&api, &config, &identity).await {
+        // ... handle result
+    }
+    
     tokio::select! {
         _ = shutdown_rx.changed() => {
             // ... shutdown handling
         }
-        _ = tokio::time::sleep(config.retry_period) => {
-            match try_acquire_or_renew(&api, &config, &identity).await {
+        _ = tokio::time::sleep(config.retry_period) => {}
     }
 }
72-83: Spawned task is not tracked or awaited.

The spawned leader_election_loop task handle is dropped immediately. If the loop panics, the error won't propagate. Consider storing the JoinHandle and checking it periodically or on shutdown.
 pub struct LeaderElector {
     // ...
     shutdown_tx: Option<watch::Sender<bool>>,
+    task_handle: Option<tokio::task::JoinHandle<()>>,
 }

 // In start():
-tokio::spawn(async move {
+let handle = tokio::spawn(async move {
     leader_election_loop(/* ... */).await;
 });
+self.task_handle = Some(handle);
crates/basilica-autoscaler/src/main.rs (1)
17-17: Preserve error context when converting errors.

Using anyhow::anyhow!("{}", e) loses the error chain. Use .context() or anyhow::Error::from() to preserve the full error information for debugging.
-    runtime.run().await.map_err(|e| anyhow::anyhow!("{}", e))
+    runtime.run().await.map_err(|e| anyhow::anyhow!(e).context("controller runtime failed"))
Or if the runtime error implements std::error::Error:
-    runtime.run().await.map_err(|e| anyhow::anyhow!("{}", e))
+    runtime.run().await.map_err(anyhow::Error::from)
scripts/autoscaler/Dockerfile (2)
23-40: Clarify cleanup chaining for readability and future maintenance

The rm -rf / find cleanup chain mixes && and ||:
... && \
    rm -rf ... 2>/dev/null || true && \
    find /tmp/target -name "*.rlib" -delete 2>/dev/null || true
This works, but the precedence is non-obvious (&&/|| left-associative), which makes it easy to misread.

Consider explicitly grouping and separating the commands to make intent clearer, e.g.:
RUN if [ "$BUILD_MODE" = "release" ]; then \
      cargo build --locked --release -p basilica-autoscaler; \
    else \
      cargo build --locked -p basilica-autoscaler; \
    fi && \
    (rm -rf /tmp/target/debug/deps /tmp/target/debug/incremental \
            /tmp/target/release/deps /tmp/target/release/incremental 2>/dev/null || true) && \
    (find /tmp/target -name "*.rlib" -delete 2>/dev/null || true)
Same behavior, but easier to reason about.

45-57: Align build-time and runtime expectations for BUILD_MODE

The runtime stage also declares ARG BUILD_MODE=release and copies:
COPY --from=builder /tmp/target/${BUILD_MODE}/basilica-autoscaler /usr/local/bin/basilica-autoscaler
This is correct as long as callers always pass a matching --build-arg BUILD_MODE=... to both stages (which Docker does), and only debug / release are used.

If you ever add more modes or change the target dir layout, this tight coupling can bite. A slightly more defensive pattern is to normalize the build output (e.g., always copy from /tmp/target/release and only tweak optimization flags) or validate BUILD_MODE in the builder RUN step and fail fast on unknown values.
orchestrator/ansible/playbooks/02-deploy/autoscaler.yml (1)
7-22: Confirm repo_root_final points at the path you expect

_detected_repo_root walks three directories up from playbook_dir, so by default repo_root_final resolves to the orchestrator directory, and the copy task uses:
src: "{{ repo_root_final }}/k8s/services/autoscaler/"
This assumes k8s/services/autoscaler lives directly under orchestrator/. If the actual repo layout differs (e.g., manifests live under orchestrator/k8s/... but _detected_repo_root should really be the repository root), this path could silently break.

Worth double‑checking this against your other playbooks and adjusting the number of dirname hops if needed.
crates/basilica-autoscaler/src/lib.rs (1)

1-17: Crate root API surface is cohesive and clear

The crate root exposes the main functional areas as modules and re-exports the key types (AutoscalerConfig, AutoscalerError, Result, AutoscalerK8sClient, KubeClient, NodeProvisioner, SshProvisioner). This should make downstream usage straightforward (basilica_autoscaler::{AutoscalerConfig, Result, ...}) without having to dig into internal modules.

Once ControllerRuntime stabilizes, you may also want to re-export it from here for an even simpler basilica_autoscaler::ControllerRuntime entry point, but that’s optional.
scripts/autoscaler/build.sh (2)
1-41: Tighten argument and shell robustness

Two small improvements here:

The script currently uses set -e only. Adding -u and -o pipefail makes it more robust to unset variables and pipeline failures:
#!/bin/bash
set -euo pipefail
--image-name / --image-tag assume a following argument. If the flag is last by mistake, shift 2 will error out somewhat cryptically under set -e. You can make this failure mode clearer:
        --image-name)
            if [[ $# -lt 2 ]]; then
                echo "Missing value for --image-name" >&2
                exit 1
            fi
            IMAGE_NAME="$2"
            shift 2
            ;;
Same idea for --image-tag.

7-8: Be aware of default image name vs deployment defaults

The build script defaults to:
IMAGE_NAME="basilica/basilica-autoscaler"
IMAGE_TAG="latest"
while the Ansible playbook defaults autoscaler_image to ghcr.io/one-covenant/basilica-autoscaler:latest. That’s fine if you intentionally distinguish “local build name” from “published image name”, but if the intention is to build exactly what you deploy via Ansible, consider aligning these defaults or documenting the difference to avoid confusion.
crates/basilica-autoscaler/src/runtime.rs (2)
322-325: shutdown API is currently hard to use with run(self)

ControllerRuntime::run(self) consumes the runtime instance, while shutdown(&self) expects a shared reference. In the common pattern:
let runtime = ControllerRuntime::new(config);
runtime.run().await?;
there’s no way to call shutdown() while run is in progress, so the CancellationToken path in the select! is effectively unreachable unless you redesign the caller.

If you intend to support programmatic shutdown (e.g., embedding the runtime in another service), consider changing one of:

Make run take &self or Arc<Self> instead of self, or

Expose a separate shutdown handle that wraps the CancellationToken and can be cloned and sent to other tasks.

Right now the only practical shutdown trigger is ctrl_c or internal leader failures.

332-337: Test depends on AutoscalerConfig::from_env being infallible

The unit test uses:
let config = AutoscalerConfig::from_env();
If from_env ever becomes fallible (e.g., reading required env vars and panicking on missing ones), this test will be environment‑dependent and potentially flaky in CI.

If from_env is already fallible, consider a Result-returning constructor and using a helper like AutoscalerConfig::for_tests() here instead, so tests don’t rely on ambient environment configuration.
crates/basilica-autoscaler/src/controllers/k8s_client.rs (1)

225-232: Label selector may need escaping for special characters.

If key or value contain special characters (e.g., /, =), the label selector format key=value could be malformed or misinterpreted.

Consider validating or escaping the key/value, or use ListParams::default().labels_from() if available in kube-rs for safer construction.
crates/basilica-autoscaler/src/crd/node_pool.rs (1)
334-449: Status struct is comprehensive but consider using Option wrapper for failure_count.

The failure_count field defaults to 0, which is fine, but other integer fields like gpu_count use Option<u32>. For consistency and to distinguish "never checked" from "zero failures," consider making it optional.
     /// Failure count
-    #[serde(default)]
-    pub failure_count: u32,
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub failure_count: Option<u32>,
orchestrator/k8s/services/autoscaler/crds.yaml (1)

122-124: Server URL pattern allows any https:// prefix.

The pattern ^https://.+ is minimal but functional. For production, consider adding validation for valid URL format or at least requiring a host portion.
orchestrator/k8s/services/autoscaler/deployment.yaml (2)
54-55: Using latest tag and imagePullPolicy: Always may cause deployment instability.

Using the latest tag can lead to unexpected behavior when images are updated. Consider using semantic versioning or commit SHA tags for reproducibility.
-          image: ghcr.io/one-covenant/basilica-autoscaler:latest
-          imagePullPolicy: Always
+          image: ghcr.io/one-covenant/basilica-autoscaler:v0.1.0  # Pin to specific version
+          imagePullPolicy: IfNotPresent
69-74: Duplicate leader election configuration in env vars and ConfigMap.

Leader election settings are defined both as environment variables (lines 69-74) and in the ConfigMap (lines 177-183). This could lead to confusion about which takes precedence.

Choose one source of truth. If the application reads from the config file, remove the env vars:
-            - name: LEADER_ELECTION_LEASE_DURATION
-              value: "15s"
-            - name: LEADER_ELECTION_RENEW_DEADLINE
-              value: "10s"
-            - name: LEADER_ELECTION_RETRY_PERIOD
-              value: "2s"
crates/basilica-autoscaler/src/health.rs (2)
26-46: Consider batching state updates to reduce lock contention.

Each setter acquires a write lock independently. If multiple state fields are updated together during reconciliation, consider adding a combined setter to reduce lock overhead:
pub async fn update_status(&self, ready: bool, leader: bool) {
    let mut status = self.inner.write().await;
    status.ready = ready;
    status.leader = leader;
}
88-102: Health server runs indefinitely without graceful shutdown support.

The axum::serve call blocks forever. Consider accepting a shutdown signal for graceful termination:
 pub async fn start_health_server(
     host: &str,
     port: u16,
     state: HealthState,
+    shutdown: impl std::future::Future<Output = ()> + Send + 'static,
 ) -> Result<(), std::io::Error> {
     let addr = format!("{}:{}", host, port);
     info!(addr = %addr, "Starting health server");
 
     let app = health_router(state);
     let listener = tokio::net::TcpListener::bind(&addr).await?;
-    axum::serve(listener, app).await?;
+    axum::serve(listener, app)
+        .with_graceful_shutdown(shutdown)
+        .await?;
 
     Ok(())
 }
crates/basilica-autoscaler/src/provisioner/k3s.rs (2)
42-43: Fixed 10-second sleep is fragile for installation verification.

A hardcoded sleep 10 may be insufficient on slower systems or excessive on fast ones. Consider polling the service status with a timeout loop instead:
# Wait for agent to be ready (up to 60 seconds)
for i in $(seq 1 12); do
    if systemctl is-active --quiet k3s-agent || systemctl is-active --quiet k3s; then
        exit 0
    fi
    sleep 5
done
echo "K3s agent failed to start"
exit 1
111-119: Version parsing may fail on non-standard output formats.

The parser assumes the version is always the third whitespace-separated token. Consider making this more robust or documenting the expected format:
 pub fn parse_k3s_version(output: &str) -> Option<String> {
-    // K3s version format: "k3s version v1.27.4+k3s1 (hash)"
+    // Expected format: "k3s version v1.27.4+k3s1 (hash)"
+    // Returns None if format doesn't match
     output
         .lines()
         .next()
-        .and_then(|line| line.split_whitespace().nth(2))
+        .and_then(|line| {
+            let token = line.split_whitespace().nth(2)?;
+            if token.starts_with('v') { Some(token.to_string()) } else { None }
+        })
-        .map(|v| v.to_string())
 }
orchestrator/k8s/services/autoscaler/rbac.yaml (2)
85-108: Consider restricting SSH keys Role to specific secret names.

The Role grants get and list on all secrets in the basilica-ssh-keys namespace, which is broader than necessary. If the SSH key secret names are known, restrict access using resourceNames:
 rules:
   - apiGroups: [""]
     resources: ["secrets"]
-    verbs: ["get", "list"]
+    verbs: ["get"]
+    resourceNames:
+      - basilica-ssh-private-key
+      # Add other specific SSH key secret names
This follows the principle of least privilege and addresses the static analysis finding.

116-125: Redundant create verb in resourceNames-scoped rule.

The rule on lines 117-121 includes create with resourceNames, but resourceNames doesn't apply to create operations. The separate rule on lines 123-125 correctly handles initial lease creation. Remove create from the first rule:
   - apiGroups: ["coordination.k8s.io"]
     resources: ["leases"]
-    verbs: ["get", "create", "update", "patch"]
+    verbs: ["get", "update", "patch"]
     resourceNames:
       - basilica-autoscaler-leader
crates/basilica-autoscaler/src/metrics.rs (5)
46-54: Metric creation uses unwrap() which can panic.

While unlikely to fail for fresh registries, consider propagating errors or using expect() with context for better debugging:
-        let node_pools_total = IntGauge::new(
-            "autoscaler_node_pools_total",
-            "Total number of NodePool resources",
-        )
-        .unwrap();
+        let node_pools_total = IntGauge::new(
+            "autoscaler_node_pools_total",
+            "Total number of NodePool resources",
+        )
+        .expect("failed to create node_pools_total metric");
This applies to all metric creations in this file.

65-72: Potential high cardinality on phase_transitions_total metric.

The pool label can grow unbounded with many NodePools. Consider using a fixed identifier or aggregating by phase only:
let phase_transitions_total = IntCounterVec::new(
    Opts::new(
        "autoscaler_phase_transitions_total",
        "Total number of phase transitions",
    ),
    &["to_phase"],  // Remove "pool" label to prevent cardinality explosion
)
Alternatively, implement label cleanup when pools are deleted.

264-269: Debug formatting for phase may produce inconsistent label values.

format!("{:?}", phase) produces Rust debug output which may include variant details. Consider using a stable string representation:
 pub fn record_phase_transition(&self, pool: &str, phase: &NodePoolPhase) {
-    let phase_str = format!("{:?}", phase);
+    let phase_str = phase.as_str();  // Implement Display or as_str() on NodePoolPhase
     self.phase_transitions_total
         .with_label_values(&[pool, &phase_str])
         .inc();
 }
290-293: rentals_active gauge can go negative if record_rental_stopped is called without a matching start.

Consider guarding against negative values:
 pub fn record_rental_stopped(&self, _pool: &str) {
     self.rentals_stopped_total.inc();
-    self.rentals_active.dec();
+    if self.rentals_active.get() > 0 {
+        self.rentals_active.dec();
+    }
 }
359-373: Metrics server also lacks graceful shutdown support.

Same concern as the health server - consider adding shutdown signal support for clean termination during pod lifecycle events.
crates/basilica-autoscaler/src/config.rs (3)
76-88: Leader election timing constraints should be validated.

The Kubernetes leader election algorithm requires lease_duration > renew_deadline > retry_period. Invalid configurations will cause runtime failures. Consider adding validation:
impl LeaderElectionConfig {
    fn from_env() -> Self {
        let config = Self { /* ... current impl ... */ };
        
        if config.enabled {
            assert!(
                config.lease_duration > config.renew_deadline,
                "lease_duration must be greater than renew_deadline"
            );
            assert!(
                config.renew_deadline > config.retry_period,
                "renew_deadline must be greater than retry_period"
            );
        }
        config
    }
}
157-168: Empty control plane IPs list could cause silent failures.

If the environment variable is set to an empty string or whitespace-only, control_plane_ips will be empty, potentially causing network validation to pass vacuously. Consider logging a warning or validating:
 impl NetworkValidationConfig {
     fn from_env() -> Self {
         let default_ips = "10.200.0.1,10.200.0.2,10.200.0.3".to_string();
         let ips_str = std::env::var("BASILICA_AUTOSCALER_CONTROL_PLANE_IPS").unwrap_or(default_ips);
         let control_plane_ips: Vec<String> = ips_str
             .split(',')
             .map(|s| s.trim().to_string())
             .filter(|s| !s.is_empty())
             .collect();
+        if control_plane_ips.is_empty() {
+            tracing::warn!("No control plane IPs configured for network validation");
+        }
         Self { control_plane_ips }
     }
 }
206-211: parse_bool_env silently treats invalid values as false.

Values like "yes", "YES", "True" (with capital T followed by lowercase) would return false instead of the default. Consider being more explicit:
 fn parse_bool_env(key: &str, default: bool) -> bool {
     std::env::var(key)
         .ok()
-        .map(|s| s.eq_ignore_ascii_case("true") || s == "1")
+        .map(|s| match s.to_lowercase().as_str() {
+            "true" | "1" | "yes" | "on" => true,
+            "false" | "0" | "no" | "off" => false,
+            _ => {
+                tracing::warn!(key = %key, value = %s, "Invalid boolean value, using default");
+                default
+            }
+        })
         .unwrap_or(default)
 }
crates/basilica-autoscaler/src/provisioner/wireguard.rs (2)
92-101: Hardcoded network interface and subnet mask may not match all deployments.

The configuration hardcodes eth0 for NAT masquerading and /16 for the address subnet. These assumptions may not hold across different cloud providers or network configurations.

Consider making these configurable via WireGuardConfig or as parameters:
 pub fn generate_wg_config(
     private_key: &str,
     address: &str,
     listen_port: u16,
     interface_name: &str,
     peers: &[WireGuardPeer],
+    subnet_mask: u8,
+    external_iface: &str,
 ) -> String {
     let mut config = format!(
         r#"[Interface]
 PrivateKey = {}
-Address = {}/16
+Address = {}/{}
 ListenPort = {}
-PostUp = iptables -A FORWARD -i {} -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
-PostDown = iptables -D FORWARD -i {} -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
+PostUp = iptables -A FORWARD -i {} -j ACCEPT; iptables -t nat -A POSTROUTING -o {} -j MASQUERADE
+PostDown = iptables -D FORWARD -i {} -j ACCEPT; iptables -t nat -D POSTROUTING -o {} -j MASQUERADE
 "#,
-        private_key, address, listen_port, interface_name, interface_name
+        private_key, address, subnet_mask, listen_port, interface_name, external_iface, interface_name, external_iface
     );
17-17: Unused field k8s in WireGuardInstaller.

The k8s field is stored in the struct but never used in the current implementation. Either remove it if not needed, or add a TODO comment explaining the planned usage.
crates/basilica-autoscaler/src/provisioner/ssh.rs (2)
481-486: unwrap() on SystemTime could panic in edge cases.

duration_since(UNIX_EPOCH).unwrap() can panic if the system clock is set before the Unix epoch. Use unwrap_or_default() for robustness:
             let now = std::time::SystemTime::now()
                 .duration_since(std::time::UNIX_EPOCH)
-                .unwrap()
+                .unwrap_or_default()
                 .as_secs();
672-685: Test coverage is minimal.

The test only verifies struct creation. Consider adding unit tests for error handling paths, particularly for SSH timeout scenarios.
crates/basilica-autoscaler/src/error.rs (1)
132-145: Consider adding CircuitBreakerOpen to retryable errors.

CircuitBreakerOpen indicates a transient condition that resolves after the circuit breaker's reset timeout. It should likely be retryable.
     pub fn is_retryable(&self) -> bool {
         matches!(
             self,
             Self::SshConnection { .. }
                 | Self::SshExecution { .. }
                 | Self::NetworkValidation(_)
                 | Self::KubeApi(_)
                 | Self::Http(_)
                 | Self::SecureCloudApi(_)
                 | Self::LeaderElection(_)
+                | Self::CircuitBreakerOpen(_)
         )
     }
crates/basilica-autoscaler/src/crd/scaling_policy.rs (1)

106-108: Consider adding validation for gpu_utilization_threshold range.

The threshold is an f32 without bounds validation. A value outside 0-100 would be nonsensical. Consider adding a validation webhook or documenting the expected range.
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (2)
158-189: Best-effort cleanup may leave orphaned resources.

Errors during cleanup (delete_node, stop_rental, deregister_node) are logged but the finalizer is still removed. This prevents re-reconciliation and may leave orphaned resources.

Consider tracking cleanup failures in status and retrying before removing the finalizer, or at minimum emit a metric/event for alerting:
+        let mut cleanup_errors = Vec::new();
+
         if let Some(node_name) = &status.node_name {
             if let Err(e) = self.k8s.delete_node(node_name).await {
                 warn!(node = %node_name, error = %e, "Failed to delete K8s node");
+                cleanup_errors.push(format!("node/{}: {}", node_name, e));
             }
         }
         // ... similar for other cleanup steps
+
+        if !cleanup_errors.is_empty() {
+            self.metrics.record_cleanup_failures(&name, &cleanup_errors);
+        }
234-234: Redundant import of ResourceExt.

kube::ResourceExt is already imported at the module level (line 4).
-        use kube::ResourceExt;
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (3)
165-211: Metrics collection silently ignores errors for node listing.

Line 173 uses unwrap_or_default() to swallow errors from list_nodes_with_label, and line 192 does the same for list_node_pools. While this prevents reconciliation failures, it means the controller may make scaling decisions based on incomplete data (e.g., thinking there are 0 GPU nodes when the API call failed).

Consider logging when these calls fail to aid debugging:
         let nodes = self
             .k8s
             .list_nodes_with_label("nvidia.com/gpu", "true")
             .await
-            .unwrap_or_default();
+            .unwrap_or_else(|e| {
+                warn!("Failed to list GPU nodes: {}", e);
+                Vec::new()
+            });
301-307: Timestamp-based naming could cause collisions across rapid scale-up calls.

The pool name uses second-precision timestamps. While the index i prevents collisions within a single scale_up call, two concurrent reconciliations or rapid successive calls in the same second could generate duplicate names. Kubernetes will reject duplicates, but this would cause the operation to fail.

Consider adding a random suffix or using a more granular timestamp:
-            let pool_name = format!(
-                "{}-{}-{}",
-                policy_name,
-                Utc::now().format("%Y%m%d%H%M%S"),
-                i
-            );
+            let pool_name = format!(
+                "{}-{}-{:04x}",
+                policy_name,
+                Utc::now().format("%Y%m%d%H%M%S"),
+                rand::random::<u16>()
+            );
463-473: Tests are minimal; consider adding coverage for core logic.

The current test only verifies ScalingDecision equality semantics. Key logic like evaluate_scaling decision boundaries, cooldown checks, and metrics evaluation would benefit from unit tests.

Would you like me to generate additional unit tests for the evaluate_scaling logic and cooldown checks?
crates/basilica-autoscaler/src/api/secure_cloud.rs (1)
296-310: Using expect() in constructors can cause panics in production.

reqwest::Client::builder().build() can fail (e.g., TLS initialization issues). Using expect() will panic, which is problematic for library code.

Consider returning Result from constructors:
-    pub fn new(base_url: String, api_key: String, timeout: Duration) -> Self {
+    pub fn new(base_url: String, api_key: String, timeout: Duration) -> Result<Self> {
         let client = reqwest::Client::builder()
             .timeout(timeout)
-            .build()
-            .expect("Failed to create HTTP client");
+            .build()?;

-        Self {
+        Ok(Self {
             client,
             base_url,
             api_key,
             circuit_breaker: Arc::new(CircuitBreaker::new(CircuitBreakerConfig::default())),
-        }
+        })
     }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b8eb65 and cff336b.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (30)

Cargo.toml (1 hunks)
crates/basilica-autoscaler/Cargo.toml (1 hunks)
crates/basilica-autoscaler/src/api/mod.rs (1 hunks)
crates/basilica-autoscaler/src/api/secure_cloud.rs (1 hunks)
crates/basilica-autoscaler/src/bin/crdgen.rs (1 hunks)
crates/basilica-autoscaler/src/config.rs (1 hunks)
crates/basilica-autoscaler/src/controllers/k8s_client.rs (1 hunks)
crates/basilica-autoscaler/src/controllers/mod.rs (1 hunks)
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (1 hunks)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1 hunks)
crates/basilica-autoscaler/src/crd/mod.rs (1 hunks)
crates/basilica-autoscaler/src/crd/node_pool.rs (1 hunks)
crates/basilica-autoscaler/src/crd/scaling_policy.rs (1 hunks)
crates/basilica-autoscaler/src/error.rs (1 hunks)
crates/basilica-autoscaler/src/health.rs (1 hunks)
crates/basilica-autoscaler/src/leader_election.rs (1 hunks)
crates/basilica-autoscaler/src/lib.rs (1 hunks)
crates/basilica-autoscaler/src/main.rs (1 hunks)
crates/basilica-autoscaler/src/metrics.rs (1 hunks)
crates/basilica-autoscaler/src/provisioner/k3s.rs (1 hunks)
crates/basilica-autoscaler/src/provisioner/mod.rs (1 hunks)
crates/basilica-autoscaler/src/provisioner/ssh.rs (1 hunks)
crates/basilica-autoscaler/src/provisioner/wireguard.rs (1 hunks)
crates/basilica-autoscaler/src/runtime.rs (1 hunks)
orchestrator/ansible/playbooks/02-deploy/autoscaler.yml (1 hunks)
orchestrator/k8s/services/autoscaler/crds.yaml (1 hunks)
orchestrator/k8s/services/autoscaler/deployment.yaml (1 hunks)
orchestrator/k8s/services/autoscaler/rbac.yaml (1 hunks)
scripts/autoscaler/Dockerfile (1 hunks)
scripts/autoscaler/build.sh (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (9)

crates/basilica-autoscaler/src/main.rs (2)

crates/basilica-autoscaler/src/runtime.rs (2)

fmt (162-164)

new (32-37)

crates/basilica-autoscaler/src/config.rs (8)

from_env (17-27)

from_env (39-45)

from_env (56-62)

from_env (77-87)

from_env (98-104)

from_env (115-121)

from_env (135-147)

from_env (158-167)

crates/basilica-autoscaler/src/controllers/k8s_client.rs (3)

crates/basilica-api/src/server.rs (1)

std (1211-1211)

crates/basilica-autoscaler/src/crd/node_pool.rs (2)

default (189-196)

default (242-251)

crates/basilica-autoscaler/src/crd/scaling_policy.rs (3)

default (79-85)

default (124-132)

default (210-215)

crates/basilica-autoscaler/src/crd/node_pool.rs (1)

crates/basilica-autoscaler/src/crd/scaling_policy.rs (2)

crd_metadata_is_correct (325-332)

spec_has_required_fields (335-354)

crates/basilica-autoscaler/src/lib.rs (2)

crates/basilica-autoscaler/src/controllers/k8s_client.rs (1)

api (328-329)

crates/basilica-common/src/ssh/connection.rs (1)

config (131-133)

crates/basilica-autoscaler/src/api/mod.rs (1)

orchestrator/cluster-manager/src/clustermgr/commands/topology.py (1)

WireGuardPeer (30-83)

crates/basilica-autoscaler/src/crd/scaling_policy.rs (1)

crates/basilica-autoscaler/src/crd/node_pool.rs (2)

default (189-196)

default (242-251)

crates/basilica-autoscaler/src/error.rs (1)

crates/basilica-api/src/server.rs (1)

std (1211-1211)

crates/basilica-autoscaler/src/provisioner/ssh.rs (2)

crates/basilica-autoscaler/src/provisioner/k3s.rs (2)

parse_k3s_version (112-119)

verify_k3s_status (107-109)

crates/basilica-autoscaler/src/provisioner/wireguard.rs (3)

generate_wg_config (85-117)

parse_wg_show_output (120-156)

new (27-33)

crates/basilica-autoscaler/src/provisioner/wireguard.rs (1)

crates/basilica-api/src/wireguard/config.rs (1)

allowed_ips (61-73)

🪛 Checkov (3.2.334)

orchestrator/k8s/services/autoscaler/rbac.yaml

[high] 71-84: No ServiceAccount/Node should be able to read all secrets

(CKV2_K8S_5)

[high] 96-109: No ServiceAccount/Node should be able to read all secrets

(CKV2_K8S_5)

🔇 Additional comments (45)

Cargo.toml (1)

19-19: LGTM!

The new workspace member is correctly added and maintains the existing ordering.

crates/basilica-autoscaler/src/leader_election.rs (1)

310-322: LGTM!

The unit test correctly validates the atomic behavior of the shared is_leader flag across clones.

crates/basilica-autoscaler/src/bin/crdgen.rs (1)

4-9: LGTM!

Clean utility for CRD generation. The use of expect() is appropriate for a CLI tool where failure should be immediately visible. The YAML document separator (---) correctly separates multiple CRD documents.

crates/basilica-autoscaler/src/main.rs (1)

7-10: LGTM!

The tracing setup is clean with a sensible default log level for the autoscaler module. The EnvFilter::from_default_env() allows runtime configuration via RUST_LOG.

crates/basilica-autoscaler/src/provisioner/mod.rs (1)

1-7: Provisioner module wiring looks consistent

The module declarations and re-exports line up cleanly with the crate root API surface (K3s installer, SSH provisioner, WireGuard installer). No issues from a structure or visibility standpoint.

crates/basilica-autoscaler/src/api/mod.rs (1)

1-7: API module re-exports are well-scoped

Re-exporting the Secure Cloud types and client from secure_cloud at this level gives a nice, flat API (basilica_autoscaler::api::{SecureCloudClient, GpuOffering, ...}) without over-exposing the underlying module structure. This looks good and keeps callers’ imports simple.

crates/basilica-autoscaler/src/crd/mod.rs (1)

1-12: CRD re-exports provide a clean, centralized type surface

Re-exporting the NodePool and ScalingPolicy types (plus their associated configs, status types, and FINALIZER) from this module gives consumers a single import point for all CRD-related types. The selection of exported items looks intentional and avoids leaking internal details.

No issues from a structure or visibility standpoint.

crates/basilica-autoscaler/src/controllers/k8s_client.rs (3)

1-63: Well-designed trait abstraction for Kubernetes operations.

The AutoscalerK8sClient trait provides a clean, testable abstraction over Kubernetes API operations. The Send + Sync bounds are appropriate for async runtimes.

310-341: Pod eviction implementation looks correct.

The eviction uses the proper Eviction subresource API and correctly handles the 429 (Too Many Requests) status for PodDisruptionBudget violations by returning Ok(false).

350-358: Good test for Send + Sync bounds.

This compile-time test ensures the client can be safely shared across threads, which is essential for async Kubernetes controllers.

crates/basilica-autoscaler/src/crd/node_pool.rs (3)

1-65: Well-structured CRD definition with comprehensive spec fields.

The NodePoolSpec is well-designed with clear separation of concerns for different provisioning modes (Manual/Dynamic), K3s configuration, and lifecycle management. The use of serde(default) and skip_serializing_if is appropriate for optional fields.

287-324: Comprehensive phase enum with clear lifecycle stages.

The NodePoolPhase enum covers the full lifecycle from Pending through Ready, Unhealthy, and terminal states. The Display implementation is straightforward and correct.

496-540: Good test coverage for CRD metadata validation.

The tests verify CRD name, group, kind, scope, and required fields. This aligns with similar tests in scaling_policy.rs (per the relevant code snippets).

orchestrator/k8s/services/autoscaler/crds.yaml (3)

45-49: Excellent use of CEL validation rules for mode-based requirements.

The x-kubernetes-validations rules ensure ssh is required for Manual mode and secureCloud for Dynamic mode. This provides server-side validation before the controller sees the resource.

405-437: nodeTemplate.secureCloud requires sshKeyId and sshKeySecretRef but no offeringId.

Unlike the NodePool CRD's secureCloud which requires offeringId, the ScalingPolicy's nodeTemplate.secureCloud omits it. Verify this is intentional—perhaps the autoscaler dynamically selects offerings based on gpuType and other criteria.

1-553: CRD manifests are comprehensive and well-structured.

Both CRDs follow Kubernetes conventions with proper versioning, status subresources, printer columns, and extensive schema validation. The structure aligns well with the Rust type definitions.

crates/basilica-autoscaler/src/controllers/mod.rs (1)

1-7: Clean module organization with appropriate public exports.

The module correctly declares submodules and re-exports the key types needed by consumers of the controllers module.

orchestrator/k8s/services/autoscaler/deployment.yaml (5)

1-16: Good namespace separation for security-sensitive resources.

Creating a dedicated basilica-ssh-keys namespace with restricted: "true" label enables fine-grained RBAC and PSP/PSA policies for SSH key secrets.

38-51: Strong security context with control-plane tolerations.

The pod security settings (non-root user, seccompProfile, runAsNonRoot) are well configured. Tolerations for control-plane nodes allow scheduling on master nodes if needed.

111-120: Container security context is comprehensive and follows best practices.

Dropping all capabilities, read-only root filesystem, and RuntimeDefault seccomp profile provide defense in depth.

121-124: PreStop hook with sleep 15 aids graceful shutdown.

This allows in-flight requests to complete and load balancers to drain connections before the container terminates.

207-221: ServiceMonitor correctly configured for Prometheus scraping.

The 30-second scrape interval matches the metrics collection interval in the ConfigMap.

crates/basilica-autoscaler/src/health.rs (3)

1-11: LGTM! Clean health state implementation.

The use of Arc<RwLock<HealthStatus>> for thread-safe async state management is appropriate for this use case.

61-78: LGTM! Proper liveness and readiness semantics.

Liveness always returns 200 (correct for K8s restart behavior), while readiness returns 503 when not ready, which will remove the pod from service endpoints appropriately.

104-165: LGTM! Good test coverage for health endpoints.

Tests verify the key behaviors: liveness always returns OK, readiness returns 503 when not ready and 200 when ready.

crates/basilica-autoscaler/src/provisioner/k3s.rs (2)

56-76: LGTM! Uninstall script handles edge cases well.

The script correctly checks for both agent and server uninstall scripts, uses 2>/dev/null || true to suppress errors for missing services, and cleans up residual directories.

78-103: LGTM! Label collection is straightforward.

The function correctly builds the expected labels and handles optional GPU fields appropriately.

orchestrator/k8s/services/autoscaler/rbac.yaml (1)

1-54: LGTM! ClusterRole permissions are appropriately scoped.

The permissions granted align with autoscaler functionality: CRD management, node operations, pod eviction, PDB reads, and event creation.

crates/basilica-autoscaler/src/metrics.rs (1)

375-397: LGTM! Test coverage validates metric creation and export format.

Tests confirm metrics are properly registered and export valid Prometheus format.

crates/basilica-autoscaler/src/config.rs (2)

1-28: LGTM! Clean configuration structure with nested components.

The hierarchical config design with from_env() entry point is clean and maintainable.

213-231: LGTM! Tests validate sensible defaults.

The tests confirm that default configuration values are reasonable and phase timeouts are set correctly.

crates/basilica-autoscaler/src/provisioner/wireguard.rs (1)

165-213: Good test coverage for utility functions.

The tests adequately cover config generation and parsing for both empty and populated peer scenarios.

crates/basilica-autoscaler/src/error.rs (1)

1-130: Well-structured error hierarchy with clear categorization.

The error enum is comprehensive and follows Rust best practices with thiserror. The From implementations enable ergonomic error propagation.

crates/basilica-autoscaler/src/crd/scaling_policy.rs (2)

1-51: Well-designed CRD with sensible defaults and print columns.

The ScalingPolicy CRD follows Kubernetes conventions with proper versioning, status subresource, and useful print columns.

319-365: Good test coverage for CRD metadata and defaults.

Tests validate CRD naming, required fields, and sensible default values.

crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (3)

50-70: Clean controller design with proper dependency injection.

The generic controller with Arc-wrapped dependencies enables testability and follows the controller-runtime pattern.

93-134: Phase timeout handling with cleanup is well-structured.

The timeout detection and phase-specific cleanup logic properly handles the various states and prevents re-entering cleanup.

1066-1098: Unit tests cover core helper functions.

Tests verify phase timeout definitions and deterministic node ID generation properties.

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (4)

1-28: LGTM!

Clean struct definition with appropriate trait bounds. The use of Arc for shared ownership and #[allow(dead_code)] annotation for the unused api field is noted - this suggests future integration work.

30-51: LGTM!

Manual Clone implementation correctly clones Arc references, and the constructor is straightforward.

355-398: LGTM!

The template-to-NodePool conversion is well-structured with appropriate label management and configuration mapping.

408-439: LGTM!

Thorough GPU detection covering both regular and init containers. The logic correctly checks for nvidia.com/gpu in resource requests.

crates/basilica-autoscaler/src/api/secure_cloud.rs (3)

12-51: LGTM on circuit breaker state design.

The circuit breaker state machine and configuration structure are well-designed with sensible defaults.

379-414: LGTM on API method structure.

The pattern of checking circuit, making requests, and recording outcomes is consistently applied. Good error handling with status code and body in error messages.

668-712: LGTM on unit tests.

Good coverage for serialization/deserialization of API types.

crates/basilica-autoscaler/src/api/secure_cloud.rs

crates/basilica-autoscaler/src/controllers/k8s_client.rs

crates/basilica-autoscaler/src/runtime.rs

orchestrator/ansible/playbooks/02-deploy/autoscaler.yml

orchestrator/k8s/services/autoscaler/crds.yaml

* Rename autoscaler-crdgen binary to crdgen for consistency * Regenerate CRDs from Rust derive macros with kube-rs * Add schema descriptions for all spec and status fields * Include proper nullable annotations for optional fields * Add default values for configuration structs

* Add push.sh for tagging and pushing images to GHCR * Add rollout.sh for restarting deployment in K3s cluster * Support configurable source/target images and tags * Add kubeconfig path and namespace options for rollout * Include optional log following after rollout completes

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

scripts/autoscaler/push.sh (1)
43-57: Image existence check works; consider docker inspect for robustness.

The current approach using docker images with grep and regex anchors should work correctly. However, docker inspect would be semantically clearer and more direct for validating image existence, avoiding reliance on exact output formatting.

Optional refactor: Use docker inspect instead of docker images and grep.
# Check if source image exists
-if ! docker images --format "table {{.Repository}}:{{.Tag}}" | grep -q "^${SOURCE_IMAGE_FULL}$"; then
+if ! docker inspect "$SOURCE_IMAGE_FULL" &>/dev/null; then
     echo "Error: Source image $SOURCE_IMAGE_FULL not found" >&2
     echo "Please build the image first using: ./scripts/autoscaler/build.sh" >&2
     exit 1
fi
This is cleaner and doesn't depend on output format assumptions.
scripts/autoscaler/rollout.sh (1)
5-5: Remove unused variable or clarify its purpose.

The PROJECT_ROOT variable is defined but never used within this script. If it's intended for external consumption (e.g., when sourced by another script), please add a comment to clarify. Otherwise, remove the line.

If the variable is truly unused and not exported, apply this diff:
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
Alternatively, if it's intentionally exported for external use, add a comment above its definition:
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+# Exported for use by external scripts
 PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cff336b and 1e1fcc3.

📒 Files selected for processing (4)

crates/basilica-autoscaler/Cargo.toml (1 hunks)
orchestrator/k8s/services/autoscaler/crds.yaml (1 hunks)
scripts/autoscaler/push.sh (1 hunks)
scripts/autoscaler/rollout.sh (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

crates/basilica-autoscaler/Cargo.toml
orchestrator/k8s/services/autoscaler/crds.yaml

🧰 Additional context used

🪛 Shellcheck (0.11.0)

scripts/autoscaler/rollout.sh

[warning] 5-5: PROJECT_ROOT appears unused. Verify use (or export if used externally).

(SC2034)

🔇 Additional comments (4)

scripts/autoscaler/push.sh (2)

1-41: Well-structured argument parsing and setup.

The option handling is clean, with proper shift semantics for flag-value pairs and helpful documentation via --help. Path resolution is correct and properly quoted.

59-65: Tag and push operations are solid.

The commands are properly quoted and will correctly tag the local image and push to the target registry. Note that docker push will fail at runtime if the user is not authenticated to the target registry; consider documenting this prerequisite if needed.
scripts/autoscaler/rollout.sh (2)
1-93: Excellent script structure and error handling.

The script provides clear argument parsing, proper validation of prerequisites (kubeconfig existence), informative output, and sound error handling via set -e. Variable quoting is consistent, and error messages are appropriately routed to stderr. The rollout logic is correct: restart → wait for completion → verify pod status → optionally tail logs.

Minor suggestion (optional): Consider adding an early check for kubectl availability before attempting rollout to provide a clearer error message if the tool is missing:
+command -v kubectl &> /dev/null || { echo "Error: kubectl not found in PATH" >&2; exit 1; }
+
 export KUBECONFIG="$KUBECONFIG_PATH"
85-85: Label selector is correctly configured.

Line 85's -l app="$DEPLOYMENT" correctly matches the pod labels in the deployment manifest. All deployments consistently use app: <deployment-name> as the pod label, so the selector will properly identify pods for the given deployment name.

* Add #[allow(unused_assignments)] to Ed25519PrivateKey struct * Add #[allow(unused_assignments)] to Ed25519KeyPair struct * Add #[allow(unused_assignments)] to P256PrivateKey struct * Add #[allow(unused_assignments)] to P256KeyPair struct * Fix false positives caused by ZeroizeOnDrop derive with #[zeroize(skip)] fields

…ilds * Add sed exclusion for basilica-autoscaler in billing Dockerfile * Add sed exclusion for basilica-autoscaler in payments Dockerfile * Prevent cargo build failure when autoscaler crate is not in build context

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

crates/basilica-common/src/crypto/p256.rs (2)

24-29: Document why #[allow(unused_assignments)] is needed (and ensure it’s not masking real issues).
Please add a brief comment explaining the exact warning this suppresses (e.g., a false positive from #[derive(ZeroizeOnDrop)] expansion) and keep it only if it’s still reproducible.

38-44: Same concern for P256KeyPair: prefer justified/traceable lint suppression.
If this is to silence macro-generated code warnings, a short note helps prevent future accidental masking of genuine unused-assignment bugs.

crates/basilica-common/src/crypto/ed25519.rs (2)

16-21: Add a short rationale for #[allow(unused_assignments)] on Ed25519PrivateKey.
Helps ensure this is only suppressing a known macro/derive false-positive and not hiding a legitimate issue.

30-36: Add rationale / verify necessity for Ed25519KeyPair lint suppression.
If the warning is from ZeroizeOnDrop expansion, consider noting that explicitly to avoid future confusion.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1e1fcc3 and 16d0d43.

📒 Files selected for processing (4)

crates/basilica-common/src/crypto/ed25519.rs (2 hunks)
crates/basilica-common/src/crypto/p256.rs (2 hunks)
scripts/billing/Dockerfile (1 hunks)
scripts/payments/Dockerfile (1 hunks)

🔇 Additional comments (2)

scripts/billing/Dockerfile (1)

35-35: LGTM!

The addition correctly excludes the new basilica-autoscaler crate from the billing service Docker build, following the established pattern of workspace member removals (lines 25–36). The sed command syntax is correct and necessary since the billing service doesn't depend on the autoscaler.

scripts/payments/Dockerfile (1)

23-36: Good: keep Docker "pruned workspace" in sync with new workspace members (autoscaler).
Line 35 prevents the payments image build from breaking when crates/basilica-autoscaler exists in the workspace but isn't copied into the build context. The root Cargo.toml uses explicit member declarations (not glob patterns), so the sed approach will work correctly. basilica-payments has no dependency on basilica-autoscaler.

* Add sha2 workspace dependency for stable hashing across Rust versions * Add base64 workspace dependency for safe file uploads via SSH * These dependencies are required for security and stability fixes

* Add Runtime error variant to AutoscalerError enum * Track fatal errors in tokio::select! branches for unexpected exits * Return Err(AutoscalerError::Runtime) when controllers exit unexpectedly * Return Err when leader election fails fatally * Return Err when health or metrics servers fail or panic * Enables Kubernetes restart policy to properly handle failures

* Replace heredoc file uploads with base64 encoding to prevent command injection * Add sanitize_shell_arg() to filter dangerous characters from shell inputs * Add shell_escape() for safe single-quote escaping of URLs and tokens * Sanitize node_name, node_id, flannel_interface, and labels in K3s scripts * Use safe slice for node_id to prevent panic on short strings * Update tests to verify sanitization behavior Security fixes for: - Heredoc delimiter collision in SSH file uploads - Shell injection in K3s installation script generation - Potential panic on short node_id values

* Use replace() with resourceVersion for optimistic concurrency in acquire_lease * Increment leaseTransitions counter properly during lease acquisition * Fix try_increment_pending_scale_up to use replace_status instead of patch * Verify resourceVersion before attempting status update * Replace DefaultHasher with SHA-256 for stable node ID generation * SHA-256 provides consistent output across Rust toolchain versions Fixes: - Race condition when multiple instances acquire expired lease - resourceVersion in patch body not enforcing optimistic locking - Potential node ID changes on Rust version upgrades

* Remove pending counter reset on NoAction to preserve in-flight operation tracking * Counters should only decrement when NodePools reach Ready/Failed state * Fix deregister_node to return error on API failure instead of Ok(()) * Circuit breaker now correctly records failures for non-success HTTP responses * Add success log message for successful node deregistration

* Add AUTOSCALER_NAMESPACE env var using fieldRef to metadata.namespace * Runtime now watches correct namespace instead of defaulting to basilica-system * Dynamically derives namespace from pod's actual deployment namespace * Ensures controllers watch resources in the same namespace as the autoscaler

coderabbitai

Actionable comments posted: 9

♻️ Duplicate comments (10)

crates/basilica-autoscaler/src/provisioner/ssh.rs (2)
140-147: SSH host key verification remains disabled.

This was flagged in a previous review. ServerCheckMethod::NoCheck creates a MITM vulnerability for production infrastructure provisioning.

463-468: Same public key slicing issue for peer keys.

Apply the same safe slice pattern here:
             debug!(
                 host = %host,
-                peer_key = %&peer.public_key[..8],
+                peer_key = %&peer.public_key[..peer.public_key.len().min(8)],
                 endpoint = ?peer.endpoint,
                 "WireGuard peer status"
             );
crates/basilica-autoscaler/src/api/secure_cloud.rs (1)

91-105: success_threshold config field is still unused.

This was flagged in a previous review. The CircuitBreakerConfig::success_threshold field (line 40) is defined but record_success immediately transitions to Closed on the first success in HalfOpen state.

crates/basilica-autoscaler/src/runtime.rs (1)

61-62: AUTOSCALER_NAMESPACE default may not match deployment namespace.

This was flagged in a previous review. The deployment manifest needs to set AUTOSCALER_NAMESPACE to match the actual deployment namespace.
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (1)
733-742: handle_terminating uses spec.node_id but should use status.node_id for dynamic mode.

Similar to the WireGuard handler, line 734 reads from pool.spec.node_id which may not be set in dynamic provisioning mode where the ID is generated.
-        if let Some(node_id) = &pool.spec.node_id {
+        let node_id = status.node_id.as_ref().or(pool.spec.node_id.as_ref());
+        if let Some(node_id) = node_id {
             if let Err(e) = self.api.deregister_node(node_id).await {
crates/basilica-autoscaler/src/leader_election.rs (1)
249-293: Good: acquire_lease now uses resourceVersion + replace (fixes prior lease acquisition race).
Only tweak I’d suggest: don’t unwrap_or_default() resourceVersion; treat missing as an error, and ensure leaseDurationSeconds fits the expected int range.
-    let resource_version = current
-        .metadata
-        .resource_version
-        .clone()
-        .unwrap_or_default();
+    let resource_version = current
+        .metadata
+        .resource_version
+        .clone()
+        .ok_or_else(|| crate::error::AutoscalerError::Unexpected("Lease missing resourceVersion".into()))?;
@@
-            "leaseDurationSeconds": config.lease_duration.as_secs(),
+            "leaseDurationSeconds": i64::try_from(config.lease_duration.as_secs())
+                .map_err(|_| crate::error::AutoscalerError::InvalidConfiguration("lease_duration too large".into()))?,
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)

146-159: Good: NoAction no longer resets pending counters.
This matches the intended “decrement on NodePool Ready/Failed” behavior.
crates/basilica-autoscaler/src/controllers/k8s_client.rs (3)
121-133: Finalizer add overwrites existing finalizers.
Patch::Merge with "finalizers": [FINALIZER] replaces the array; you’ll drop other controllers’ finalizers. (Same concern as prior review.)
 async fn add_node_pool_finalizer(&self, ns: &str, name: &str) -> Result<()> {
     use crate::crd::FINALIZER;
-    use kube::api::{Api, Patch, PatchParams};
+    use kube::api::{Api, Patch, PatchParams};
     let api: Api<NodePool> = Api::namespaced(self.client.clone(), ns);
-    let patch = serde_json::json!({
-        "metadata": {
-            "finalizers": [FINALIZER]
-        }
-    });
-    api.patch(name, &PatchParams::default(), &Patch::Merge(&patch))
-        .await?;
+    let node_pool = api.get(name).await?;
+    let mut finalizers = node_pool.metadata.finalizers.unwrap_or_default();
+    if !finalizers.iter().any(|f| f == FINALIZER) {
+        finalizers.push(FINALIZER.to_string());
+        let patch = serde_json::json!({ "metadata": { "finalizers": finalizers } });
+        api.patch(name, &PatchParams::default(), &Patch::Merge(&patch)).await?;
+    }
     Ok(())
 }
135-146: Finalizer removal clears all finalizers.
Setting "finalizers": null will remove other controllers’ finalizers too. (Same concern as prior review.)
 async fn remove_node_pool_finalizer(&self, ns: &str, name: &str) -> Result<()> {
+    use crate::crd::FINALIZER;
     use kube::api::{Api, Patch, PatchParams};
     let api: Api<NodePool> = Api::namespaced(self.client.clone(), ns);
-    let patch = serde_json::json!({
-        "metadata": {
-            "finalizers": null
-        }
-    });
+    let node_pool = api.get(name).await?;
+    let finalizers: Vec<String> = node_pool
+        .metadata
+        .finalizers
+        .unwrap_or_default()
+        .into_iter()
+        .filter(|f| f != FINALIZER)
+        .collect();
+    let patch = serde_json::json!({
+        "metadata": {
+            "finalizers": if finalizers.is_empty() { serde_json::Value::Null } else { serde_json::json!(finalizers) }
+        }
+    });
     api.patch(name, &PatchParams::default(), &Patch::Merge(&patch))
         .await?;
     Ok(())
 }
176-215: try_increment_pending_scale_up can “succeed” without updating status when policy.status is None.
You currently no-op if status is absent, then still replace_status(...) and return Ok(true).
-        // Update the status with incremented pending_scale_up
-        if let Some(ref mut status) = policy.status {
-            status.pending_scale_up = current_value + increment;
-        }
+        // Ensure status exists, then update atomically
+        let status = policy.status.get_or_insert_with(Default::default);
+        status.pending_scale_up = current_value.saturating_add(increment);
@@
-        // Serialize policy to bytes for replace_status
-        let data = serde_json::to_vec(&policy)?;
-
-        // Use replace_status with resourceVersion in metadata for optimistic locking
-        match api.replace_status(name, &PostParams::default(), data).await {
+        // Use replace_status (RV in metadata) for optimistic locking
+        match api.replace_status(name, &PostParams::default(), serde_json::to_vec(&policy)?).await {
             Ok(_) => Ok(true),

🧹 Nitpick comments (10)

crates/basilica-autoscaler/Cargo.toml (1)

17-62: Some of these dependencies could be centralized, but others are appropriately crate-specific.

The workspace has an established [workspace.dependencies] section that covers common dependencies like tokio, serde, metrics, etc. However, the Kubernetes-specific packages (kube, kube-runtime, k8s-openapi, schemars, prometheus) are intentionally pinned at the crate level because:

They're required only by this crate.

The versions may be tied to Kubernetes API compatibility (v1_29) rather than workspace standards.

That said, consider consolidating the generic, non-Kubernetes dependencies into [workspace.dependencies]:

serde_yaml (0.9)

mockall (0.12) and tokio-test (0.4) could move to a [workspace.lints] or shared [dev-dependencies] if the workspace supports it.

Note: tokio-util is already in workspace at version 0.7.
crates/basilica-autoscaler/src/provisioner/k3s.rs (4)
18-80: Install script generation is much safer now; consider guarding against empty sanitized values + reduce curl | sh risk.

If safe_node_name / safe_flannel / safe_node_id sanitize down to empty, the script will fail in a less actionable way; better to fail early with a clear message.

curl -sfL https://get.k3s.io | ... sh -s - is convenient but is still supply-chain sensitive; consider pinning INSTALL_K3S_VERSION (or similar) and/or verifying what you can.

Example early-guard inside the script:
 format!(
     r#"#!/bin/bash
 set -euo pipefail
+
+if [ -z "{safe_node_name}" ] || [ -z "{safe_flannel}" ]; then
+  echo "Invalid sanitized inputs (empty after sanitization)" >&2
+  exit 2
+fi
 ...
 "#,
104-130: collect_node_labels: consider validating label key/value format/length (esp. GPU fields).
These labels are likely fed to Kubernetes APIs later; a small validator (or centralized label builder) would prevent propagating invalid labels and failing later in harder-to-debug places.

132-145: Status/version helpers are OK; version parsing is a bit positional.
parse_k3s_version assumes the version token is the 3rd whitespace token; consider scanning for the first token that looks like v<digits>. to be more resilient to banners/warnings.

147-221: Tests are a solid baseline; add one for quoted tokens + “invalid label char” behavior once sanitizer is tightened.
A test case with a token containing a single quote (or newline) would lock in the shell_escape behavior in the install script output, and a test for rejecting/normalizing invalid label characters will prevent regressions.
crates/basilica-autoscaler/src/error.rs (1)
137-160: Consider expanding is_retryable classification.

Some timeout-related errors like FlannelTimeout, DrainTimeout, and PhaseTimeout could potentially be retryable after intervention or longer waits. Additionally, Io errors are often transient (network filesystems, temp file issues). The current classification might cause permanent failures for recoverable situations.

Consider whether these should be retryable:
     pub fn is_retryable(&self) -> bool {
         matches!(
             self,
             Self::SshConnection { .. }
                 | Self::SshExecution { .. }
                 | Self::NetworkValidation(_)
                 | Self::KubeApi(_)
                 | Self::Http(_)
                 | Self::SecureCloudApi(_)
                 | Self::LeaderElection(_)
+                | Self::Io(_)
+                | Self::FlannelTimeout { .. }
+                | Self::CircuitBreakerOpen(_)
         )
     }
crates/basilica-autoscaler/src/api/secure_cloud.rs (1)
296-310: expect on HTTP client creation could panic in edge cases.

While unlikely, reqwest::Client::builder().build() can fail (e.g., TLS backend issues). Consider returning a Result instead of panicking.
     pub fn new(base_url: String, api_key: String, timeout: Duration) -> Self {
-        let client = reqwest::Client::builder()
+        Self::try_new(base_url, api_key, timeout).expect("Failed to create HTTP client")
+    }
+
+    pub fn try_new(base_url: String, api_key: String, timeout: Duration) -> Result<Self> {
+        let client = reqwest::Client::builder()
             .timeout(timeout)
             .build()
-            .expect("Failed to create HTTP client");
+            .map_err(|e| AutoscalerError::Internal(format!("Failed to create HTTP client: {}", e)))?;
crates/basilica-autoscaler/src/runtime.rs (1)

119-142: Consider extracting context structs to module level.

Defining ControllerContext and ScalingContext structs inside the run method limits testability and reusability. Consider moving these to the module level.
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (2)
167-213: Avoid unwrap_or_default() on K8s list errors in collect_metrics (silent failure).
This can hide RBAC/network issues and report misleading metrics (e.g., 0 GPU nodes). Prefer logging + propagating (or at least emitting a degraded condition).

292-319: Scale-up NodePool naming can collide (second-level timestamp + i).
Two scale-ups within the same second (or retries) can generate identical names and fail with AlreadyExists. Consider adding a random suffix (UUID) or a monotonic counter.
-            let pool_name = format!(
-                "{}-{}-{}",
-                policy_name,
-                Utc::now().format("%Y%m%d%H%M%S"),
-                i
-            );
+            let pool_name = format!(
+                "{}-{}-{}",
+                policy_name,
+                Utc::now().format("%Y%m%d%H%M%S"),
+                uuid::Uuid::new_v4().simple()
+            );

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 16d0d43 and ad836a6.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (11)

crates/basilica-autoscaler/Cargo.toml (1 hunks)
crates/basilica-autoscaler/src/api/secure_cloud.rs (1 hunks)
crates/basilica-autoscaler/src/controllers/k8s_client.rs (1 hunks)
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (1 hunks)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1 hunks)
crates/basilica-autoscaler/src/error.rs (1 hunks)
crates/basilica-autoscaler/src/leader_election.rs (1 hunks)
crates/basilica-autoscaler/src/provisioner/k3s.rs (1 hunks)
crates/basilica-autoscaler/src/provisioner/ssh.rs (1 hunks)
crates/basilica-autoscaler/src/runtime.rs (1 hunks)
orchestrator/k8s/services/autoscaler/deployment.yaml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

orchestrator/k8s/services/autoscaler/deployment.yaml

🧰 Additional context used

🧬 Code graph analysis (3)

crates/basilica-autoscaler/src/runtime.rs (7)

crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (2)

new (56-70)

clone (39-47)

crates/basilica-autoscaler/src/leader_election.rs (2)

new (24-40)

is_leader (43-45)

crates/basilica-autoscaler/src/health.rs (3)

new (27-31)

start_health_server (89-102)

default (21-23)

crates/basilica-autoscaler/src/metrics.rs (3)

new (46-246)

start_metrics_server (360-373)

default (340-342)

crates/basilica-autoscaler/src/config.rs (8)

from_env (17-27)

from_env (39-45)

from_env (56-62)

from_env (77-87)

from_env (98-104)

from_env (115-121)

from_env (135-147)

from_env (158-167)

crates/basilica-autoscaler/src/crd/node_pool.rs (3)

fmt (307-323)

default (189-196)

default (242-251)

crates/basilica-autoscaler/src/crd/scaling_policy.rs (3)

default (79-85)

default (124-132)

default (210-215)

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (2)

crates/basilica-autoscaler/src/crd/scaling_policy.rs (3)

default (79-85)

default (124-132)

default (210-215)

crates/basilica-autoscaler/src/crd/node_pool.rs (2)

default (189-196)

default (242-251)

crates/basilica-autoscaler/src/provisioner/ssh.rs (2)

crates/basilica-autoscaler/src/provisioner/k3s.rs (2)

parse_k3s_version (138-145)

verify_k3s_status (133-135)

crates/basilica-autoscaler/src/provisioner/wireguard.rs (3)

generate_wg_config (85-117)

parse_wg_show_output (120-156)

new (27-33)

🔇 Additional comments (16)

crates/basilica-autoscaler/Cargo.toml (1)

9-16: Binaries look correctly declared; ensure the paths exist and are intended entrypoints.

Please double-check that src/main.rs and src/bin/crdgen.rs exist in this PR and that crdgen is meant to be shipped as a separate binary (vs. a feature-gated dev tool).

crates/basilica-autoscaler/src/provisioner/k3s.rs (2)

12-16: shell_escape implementation looks correct for single-quoted contexts.
The '"'"' pattern is the right approach here.

82-103: Uninstall script is pragmatic and defensive (service stops tolerate missing units).
Nothing concerning here given the stated “agent” scope.

crates/basilica-autoscaler/src/error.rs (1)

1-133: LGTM! Well-structured error handling.

The error enum is comprehensive with descriptive messages and appropriate structured data. The From trait implementations enable clean ? propagation.

crates/basilica-autoscaler/src/provisioner/ssh.rs (3)

481-492: Handshake timestamp comparison assumes synchronized clocks.

The handshake validation compares the remote node's timestamp against the local system time. Clock skew between the provisioning host and the remote node could cause false negatives (valid handshakes appearing stale) or false positives.

Consider adding a tolerance margin or documenting the clock sync requirement.

203-222: LGTM! Base64 encoding fixes the heredoc injection vulnerability.

Good fix using base64 encoding to avoid delimiter collision and command injection.

554-638: LGTM! Safe node_id slicing implemented.

The fix at line 566 using &node_id[..node_id.len().min(8)] correctly prevents panics on short IDs.

crates/basilica-autoscaler/src/api/secure_cloud.rs (2)

590-623: LGTM! deregister_node now correctly returns error on API failure.

The fix properly returns an Err on non-success HTTP status, ensuring the circuit breaker correctly records failures.

379-414: LGTM! Consistent error handling pattern across API methods.

The pattern of checking circuit breaker, making request, handling errors, and updating circuit breaker state is well-implemented and consistent.

crates/basilica-autoscaler/src/runtime.rs (3)

284-338: LGTM! Fatal errors now properly propagate to process exit.

The implementation correctly tracks fatal_error and returns Err(AutoscalerError::Runtime(...)) on fatal conditions, allowing the orchestrator to detect and restart the process.

268-280: Leader watch task logs error but doesn't set fatal_error on channel close.

When the leader election channel closes (line 279), it logs an error but the task completes normally. This triggers the leader_watch_task branch in select! which does set fatal_error. The flow is correct, but the error message could be clearer that this is the termination path.

354-359: Proposed fix is invalid: AutoscalerConfig does not implement Default trait.

The suggestion to use AutoscalerConfig::default() will not compile because AutoscalerConfig only derives Clone and Debug. More importantly, the concern about environment variables is overstated: from_env() provides sensible fallbacks for all configuration values (e.g., "https://api.basilica.ai" for BASILICA_API_URL, port 8080 for health checks), so the test will not fail in CI environments without specific env vars set. The test as written is safe to use.

If test isolation from production defaults is desired, consider adding a test-specific constructor rather than Default.

Likely an incorrect or invalid review comment.

crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (4)

1030-1047: LGTM! SHA-256 provides stable deterministic node IDs.

The fix correctly uses SHA-256 instead of DefaultHasher, ensuring stable node IDs across Rust versions. The format remains consistent with the previous implementation.

158-189: Deletion handler continues cleanup despite individual failures.

The approach of logging warnings and continuing on cleanup failures (lines 166-182) is appropriate for deletion handlers—best-effort cleanup avoids leaving the system in a stuck state.

92-134: Phase timeout logic correctly prevents concurrent cleanup.

The cleanup_in_progress flag prevents race conditions when multiple reconciliation cycles detect the same timeout. Good defensive programming.

1071-1103: LGTM! Comprehensive tests for deterministic node ID generation.

Tests properly verify stability, uniqueness by IP, and uniqueness by pool name.

crates/basilica-autoscaler/Cargo.toml

crates/basilica-autoscaler/src/controllers/node_pool_controller.rs

crates/basilica-autoscaler/src/leader_election.rs

crates/basilica-autoscaler/src/provisioner/k3s.rs

crates/basilica-autoscaler/src/provisioner/ssh.rs

- CircuitBreaker now tracks consecutive successful calls in HalfOpen state and only transitions to Closed after success_threshold calls - add_node_pool_finalizer uses fetch-modify-update pattern to append the finalizer without overwriting existing ones - remove_node_pool_finalizer filters out only the autoscaler's finalizer, preserving other controllers' finalizers

…tency - Safe slice WireGuard public key to prevent panic on malformed data - Split sanitize_shell_arg into context-aware functions for K8s labels: sanitize_shell_word (hostnames), sanitize_label_key (with /), sanitize_label_value (max 63 chars, no : = /) - Handle shutdown_rx.changed() error when sender dropped unexpectedly - Add separate leader_tx for proper leadership signaling when disabled - Use saturating_add for pending_scale_down to accumulate in-flight ops - Resolve node_id from status before spec in deletion, terminating, timeout cleanup, and wireguard handlers for dynamic provisioning - Add v1_29 feature to basilica-api k8s-openapi for workspace consistency

coderabbitai

Actionable comments posted: 11

♻️ Duplicate comments (3)

crates/basilica-autoscaler/src/provisioner/k3s.rs (1)
4-30: Silent character removal can cause identity collisions and mask configuration errors.

All three sanitization functions silently filter or truncate invalid characters without returning errors or warnings. This creates several risks:

Identity collisions: node-123! and node-123? both become node-123, potentially causing node identity conflicts.

Silent configuration errors: Invalid inputs are "fixed" rather than rejected, masking misconfigurations.

Missing validations:

sanitize_shell_word: No length limits (hostnames limited to 253 chars)

sanitize_label_key: No length limits (K8s limit: 253 chars for prefix, 63 for name)

No check for empty results after sanitization

No validation that labels start/end with alphanumeric (K8s requirement)

Consider returning Result<String, Error> from these functions to reject invalid inputs explicitly, or at minimum, validate and error on empty results after sanitization.
-fn sanitize_shell_word(s: &str) -> String {
+fn sanitize_shell_word(s: &str) -> Result<String, String> {
+    if s.is_empty() {
+        return Err("Input cannot be empty".to_string());
+    }
+    if s.len() > 253 {
+        return Err("Input exceeds hostname length limit of 253 chars".to_string());
+    }
+    
     let sanitized = s.chars()
         .filter(|c| c.is_alphanumeric() || matches!(c, '-' | '_' | '.'))
-        .collect()
+        .collect::<String>();
+    
+    if sanitized.is_empty() {
+        return Err(format!("Input '{}' contains no valid characters", s));
+    }
+    
+    Ok(sanitized)
 }
Apply similar patterns to sanitize_label_key and sanitize_label_value, and propagate errors in calling code.
crates/basilica-autoscaler/src/provisioner/ssh.rs (1)

140-147: SSH host key verification is disabled, creating MITM vulnerability.

ServerCheckMethod::NoCheck disables SSH host key verification, making connections vulnerable to man-in-the-middle attacks. For production infrastructure provisioning, this is a security concern.

Consider implementing host key verification, or at minimum, document the security tradeoff. Options include:

Store and verify host keys after first connection (TOFU)

Accept host keys from a trusted source (e.g., cloud provider API)

Use ServerCheckMethod::HashKnownHosts with pre-populated keys

Based on learnings, this issue was previously flagged but remains unaddressed.

crates/basilica-autoscaler/src/leader_election.rs (1)

110-180: Good: clean shutdown semantics + leadership notification separation.
Handling shutdown_rx.changed() error and using leader_tx (esp. in “disabled” mode) avoids the previous “never flips to false”/closed-channel pitfalls.

🧹 Nitpick comments (1)

crates/basilica-autoscaler/src/leader_election.rs (1)
182-234: Minor: avoid per-iteration allocation when comparing holder identity.
holder == Some(&identity.to_string()) allocates; compare as Option<&str>.
-            if holder == Some(&identity.to_string()) {
+            if holder.map(String::as_str) == Some(identity) {

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ad836a6 and ec0d80e.

📒 Files selected for processing (8)

crates/basilica-api/Cargo.toml (1 hunks)
crates/basilica-autoscaler/src/api/secure_cloud.rs (1 hunks)
crates/basilica-autoscaler/src/controllers/k8s_client.rs (1 hunks)
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (1 hunks)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1 hunks)
crates/basilica-autoscaler/src/leader_election.rs (1 hunks)
crates/basilica-autoscaler/src/provisioner/k3s.rs (1 hunks)
crates/basilica-autoscaler/src/provisioner/ssh.rs (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (3)

crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (2)

new (56-70)

add_condition (1009-1031)

crates/basilica-autoscaler/src/crd/scaling_policy.rs (3)

default (79-85)

default (124-132)

default (210-215)

crates/basilica-autoscaler/src/crd/node_pool.rs (2)

default (189-196)

default (242-251)

crates/basilica-autoscaler/src/provisioner/ssh.rs (2)

crates/basilica-autoscaler/src/provisioner/k3s.rs (2)

parse_k3s_version (158-165)

verify_k3s_status (153-155)

crates/basilica-autoscaler/src/provisioner/wireguard.rs (3)

generate_wg_config (85-117)

parse_wg_show_output (120-156)

new (27-33)

🔇 Additional comments (3)

crates/basilica-api/Cargo.toml (1)

52-52: This k8s-openapi configuration change is safe and compatible.

Verification confirms that k8s-openapi 0.21 with features = ["v1_29"] is fully compatible with kube 0.88 (kube 0.88.0 officially bumped k8s-openapi to 0.21 for Kubernetes v1.29 support). All workspace code uses only stable v1 APIs (core::v1, batch::v1, networking::v1, policy::v1, apps::v1, rbac::v1, etc.); no v1beta1 or v1beta2 dependencies exist. Disabling default-features is safe and reduces binary bloat by including only necessary Kubernetes API versions.

crates/basilica-autoscaler/src/provisioner/k3s.rs (2)

153-165: LGTM - Verification functions are clear and correct.

Both verify_k3s_status and parse_k3s_version implement straightforward parsing logic appropriate for their use cases. The trimming in verify_k3s_status handles potential trailing newlines, and parse_k3s_version safely extracts the version token.

167-270: Tests provide good coverage of happy paths.

The test suite validates the main functionality of script generation, sanitization, and parsing. The tests for dangerous character removal (lines 192-199) and shell escaping (lines 231-234) are particularly valuable for security verification.

Consider adding tests for edge cases once the sanitization functions return Result types:

Empty strings after sanitization

Inputs exceeding length limits

Empty keys/values in extra_labels

These can be addressed when implementing the suggested refactors.

crates/basilica-autoscaler/src/controllers/node_pool_controller.rs

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs

crates/basilica-autoscaler/src/leader_election.rs

crates/basilica-autoscaler/src/provisioner/k3s.rs

crates/basilica-autoscaler/src/provisioner/ssh.rs

Add module-level #![allow(unused_assignments)] to ed25519.rs and p256.rs. The ZeroizeOnDrop derive macro generates assignments in the Drop impl that trigger warnings. Field-level attributes don't suppress derive macro generated code, so module-level is required.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

crates/basilica-common/src/crypto/p256.rs (1)

6-6: Prefer narrowing #![allow(unused_assignments)] scope (module-wide suppression can hide real bugs).

If the warning is only emitted by the ZeroizeOnDrop-generated Drop impl, consider moving the allow to the smallest scope possible (e.g., item-level on the affected type, or #[expect(unused_assignments)] if your MSRV supports it) instead of the whole module. Also, please double-check that the lint is indeed coming from this derive and not from a genuine unused assignment elsewhere.
crates/basilica-common/src/crypto/ed25519.rs (1)
6-6: Prefer narrowing #![allow(unused_assignments)] to the specific ZeroizeOnDrop derives (avoid file-wide masking).
-#![allow(unused_assignments)] // ZeroizeOnDrop derive generates assignments in Drop impl
-
 use crate::error::CryptoError;
 use base64::{engine::general_purpose::STANDARD as BASE64, Engine};
 use ed25519_dalek::{
     Signature, Signer, SigningKey, Verifier, VerifyingKey, PUBLIC_KEY_LENGTH, SECRET_KEY_LENGTH,
 };
 use rand::rngs::OsRng;
 use std::fmt;
 use zeroize::ZeroizeOnDrop;

 /// Ed25519 private key wrapper with secure memory handling
+#[allow(unused_assignments)] // ZeroizeOnDrop derive generates assignments in Drop impl
 #[derive(ZeroizeOnDrop)]
 pub struct Ed25519PrivateKey {
     #[zeroize(skip)]
     signing_key: SigningKey,
 }

@@
 /// Ed25519 key pair containing both private and public keys
+#[allow(unused_assignments)] // ZeroizeOnDrop derive generates assignments in Drop impl
 #[derive(ZeroizeOnDrop)]
 pub struct Ed25519KeyPair {
     private_key: Ed25519PrivateKey,
     #[zeroize(skip)]
     public_key: Ed25519PublicKey,
 }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0705eeb and 07cea84.

📒 Files selected for processing (2)

crates/basilica-common/src/crypto/ed25519.rs (1 hunks)
crates/basilica-common/src/crypto/p256.rs (1 hunks)

…tion * Add idempotency guard to prevent multiple start() calls from spawning competing loops * Fix TOCTOU vulnerability in renew_lease() by using replace with resourceVersion * Fix TOCTOU vulnerability in release_lease() by using replace with resourceVersion * Replace unwrap_or_default() with explicit error for missing resourceVersion * Handle 409 conflict gracefully during lease release * Remove unused Patch and PatchParams imports

* Add validation to skip labels that become empty after sanitization * Apply sanitization to collect_node_labels with "unknown" fallback for empty values * Make /etc/hosts update idempotent using grep guard * Add structured logging for skipped invalid labels

* Use safe slice for peer.public_key to prevent panic on short/malformed keys * Make /etc/hosts update idempotent using grep guard

…king * Add required offering_id field to SecureCloudTemplate CRD * Add try_increment_pending_scale_down trait method to AutoscalerK8sClient * Implement optimistic locking for scale-down using resourceVersion * Handle 409 conflict gracefully to prevent race conditions

* Use resolved node_id from status for dynamic mode instead of spec.node_id * Reset failure_count to 0 on successful health check in Ready state * Apply optimistic locking for scale-down operations in ScalingPolicyController * Fix offering_id mapping to use correct field from SecureCloudTemplate

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (13)

crates/basilica-autoscaler/src/provisioner/k3s.rs (2)
136-175: Avoid "unknown" label collisions for node_id / pool_name fallbacks

Falling back to "unknown" is safe for API acceptance but can collapse distinct nodes/pools into the same label value, which is painful for scheduling/debugging and can break selectors/metrics assumptions. Consider a stable fallback like "unknown-<short-hash>" derived from the original string (or return an error upstream instead of silently normalizing).
This is in the same family as the earlier “lossy normalization can cause identity collisions” feedback.

13-30: Sanitization still doesn’t guarantee Kubernetes-valid label keys/values (can break installs/APIs)

sanitize_label_key / sanitize_label_value only filter characters (and truncate values), but Kubernetes label syntax also requires constraints like start/end alphanumeric, and label keys have prefix/name rules (single /, prefix is DNS subdomain, name max 63). As-is, inputs like "-bad" or "...." or "a//b" can remain non-empty yet still be rejected by kubelet/K8s, causing K3s agent startup/registration failures.
This is essentially the earlier “tighten validation vs lossy filtering” concern, just in a different edge-case shape.

Concrete direction: after sanitization, validate and skip (with warn) if invalid; ideally return Result to fail fast.
+fn is_valid_label_value(s: &str) -> bool {
+    if s.is_empty() || s.len() > 63 { return false; }
+    let bytes = s.as_bytes();
+    let is_alnum = |b: u8| b.is_ascii_alphanumeric();
+    if !is_alnum(bytes[0]) || !is_alnum(bytes[bytes.len() - 1]) { return false; }
+    bytes.iter().all(|b| is_alnum(*b) || matches!(*b, b'-' | b'_' | b'.'))
+}
+
+fn is_valid_label_key(s: &str) -> bool {
+    // Minimal safe check: 0 or 1 '/', name part like label-value rules but may include '/'
+    if s.is_empty() { return false; }
+    let mut parts = s.splitn(3, '/');
+    let first = parts.next().unwrap();
+    let second = parts.next();
+    let third = parts.next();
+    if third.is_some() { return false; } // more than one '/'
+
+    let name = second.unwrap_or(first);
+    // Note: if there is a prefix, it must be DNS-subdomain; consider enforcing RFC1123 here.
+    is_valid_label_value(name) // (or a slightly different name validator allowing more chars)
+}
@@
         for (key, value) in extra_labels {
             let safe_key = sanitize_label_key(key);
             let safe_value = sanitize_label_value(value);
-            // Skip labels that become empty after sanitization
-            if safe_key.is_empty() || safe_value.is_empty() {
+            if safe_key.is_empty() || safe_value.is_empty() || !is_valid_label_key(&safe_key) || !is_valid_label_value(&safe_value) {
                 tracing::warn!(
                     original_key = %key,
                     original_value = %value,
                     "Skipping invalid label: sanitized key or value is empty"
                 );
                 continue;
             }
             labels.push(format!("--node-label={}={}", safe_key, safe_value));
         }
Also applies to: 59-72
crates/basilica-autoscaler/src/leader_election.rs (1)
54-102: Idempotency guard incomplete: fails to prevent duplicate start() calls when leader election is disabled.

The guard at line 57 checks only shutdown_tx.is_some(), but the disabled path (lines 63-68) sets only leader_tx and leaves shutdown_tx as None. A second start() call when disabled would pass the guard, create a new leader_tx, and return a new receiver, while the first receiver remains active. This breaks the single-start invariant.

Apply this diff to fix the guard:
-        if self.shutdown_tx.is_some() {
+        if self.shutdown_tx.is_some() || self.leader_tx.is_some() {
             return Err(crate::error::AutoscalerError::LeaderElection(
                 "LeaderElector already started".to_string(),
             ));
         }
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (1)

1037-1054: Addressed: Deterministic node ID now uses SHA-256 for stability across Rust versions.

The implementation correctly uses sha2::Sha256 instead of DefaultHasher, ensuring stable node IDs across toolchain upgrades.

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (2)

128-170: Addressed: Scale-down now has optimistic locking matching scale-up.

The try_increment_pending_scale_down call (lines 139-148) with resourceVersion conflict detection prevents overlapping scale-downs.

171-184: Addressed: NoAction no longer resets pending counters.

The comment at lines 173-176 correctly documents that pending counters track in-flight operations and should only be decremented when NodePools reach Ready or Failed state.

crates/basilica-autoscaler/src/controllers/k8s_client.rs (2)

132-153: Addressed: Finalizer addition now preserves existing finalizers.

The fetch-modify-update pattern (lines 137-151) correctly reads existing finalizers, checks for duplicates, and appends only if needed.

155-186: Addressed: Finalizer removal now filters only the specific finalizer.

The implementation correctly filters out only FINALIZER (line 166) while preserving other finalizers. Setting to null when empty (lines 169-174) is appropriate.

crates/basilica-autoscaler/src/provisioner/ssh.rs (5)

203-222: Addressed: File upload now uses base64 encoding to prevent command injection.

The base64 encoding approach (lines 213-218) eliminates heredoc delimiter collision vulnerabilities.

401-403: Addressed: Public key logging now uses safe slice.

The get(..8).unwrap_or(&public_key) pattern prevents panic on short strings.

464-471: Addressed: Peer public key logging now uses safe slice.

The get(..8).unwrap_or(&peer.public_key) pattern is correctly applied.

569-569: Addressed: Node ID slicing now uses safe bounds.

The &node_id[..node_id.len().min(8)] pattern prevents panic on short node IDs.

140-147: SSH host key verification is disabled, creating MITM vulnerability.

ServerCheckMethod::NoCheck disables SSH host key verification. For production infrastructure provisioning, this allows man-in-the-middle attacks. Consider implementing host key verification via TOFU (trust-on-first-use), pre-populated known hosts, or cloud provider API-provided host keys.

At minimum, add a configuration flag to control this behavior and document the security trade-off.

🧹 Nitpick comments (6)

crates/basilica-autoscaler/src/leader_election.rs (1)
377-381: Log a warning when resourceVersion is missing instead of silently returning.

Line 380 returns Ok(()) when resourceVersion is None, but this prevents safe release and indicates an unexpected API state. Callers won't know the lease wasn't released.

Apply this diff to add visibility:
     let resource_version = match current.metadata.resource_version.clone() {
         Some(rv) => rv,
-        None => return Ok(()), // Cannot release without resourceVersion
+        None => {
+            warn!(identity = %identity, "Cannot release lease: missing resourceVersion");
+            return Ok(());
+        }
     };
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)

240-272: Consider adding scale-down trigger for idle nodes without GPU utilization metrics.

The current implementation only triggers utilization-based scale-down when average_gpu_utilization is available (lines 258-267). Since this metric requires Prometheus/DCGM integration (per line 235 comment), scale-down may never trigger until that integration is complete. Consider whether idle_nodes metric (collected at line 236) should also inform scale-down decisions.
crates/basilica-autoscaler/src/provisioner/ssh.rs (4)
11-17: Consider deriving Debug for SshConnectionConfig to aid troubleshooting.

Adding #[derive(Debug)] would help with logging during debugging. Note: if added, ensure private_key and passphrase fields are redacted to avoid leaking secrets in logs.

484-495: Handshake timestamp comparison may have clock skew issues.

The code compares the WireGuard handshake timestamp against local SystemTime::now(). If there's clock skew between the local system and the remote node, this comparison may incorrectly flag connectivity issues. Consider documenting this assumption or using the remote node's time.

590-599: K3s token is included in command string, potentially visible in process listings.

The K3s token is embedded directly in the install command (line 591), which may be visible in ps output on the remote machine during execution. Consider using environment variables or a temporary file with restricted permissions:
-        let install_cmd = format!(
-            r#"curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="agent" K3S_URL="{}" K3S_TOKEN="{}" sh -s - \
+        // Write token to temporary file to avoid exposure in process listings
+        let token_setup = format!(
+            "umask 077 && echo '{}' > /tmp/.k3s_token && trap 'rm -f /tmp/.k3s_token' EXIT",
+            token
+        );
+        let install_cmd = format!(
+            r#"{} && curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="agent" K3S_URL="{}" K3S_TOKEN="$(cat /tmp/.k3s_token)" sh -s - \
This is a defense-in-depth concern since SSH sessions are typically trusted, but reduces exposure window.

676-689: Unit test coverage is minimal.

The test only verifies struct construction. Consider adding tests for:

SSH command building with special characters

Base64 encoding of file content

Error handling paths (timeouts, failures)

Integration tests with a mock SSH server would provide better coverage.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 07cea84 and 866194a.

📒 Files selected for processing (7)

crates/basilica-autoscaler/src/controllers/k8s_client.rs (1 hunks)
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (1 hunks)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1 hunks)
crates/basilica-autoscaler/src/crd/scaling_policy.rs (1 hunks)
crates/basilica-autoscaler/src/leader_election.rs (1 hunks)
crates/basilica-autoscaler/src/provisioner/k3s.rs (1 hunks)
crates/basilica-autoscaler/src/provisioner/ssh.rs (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

crates/basilica-autoscaler/src/crd/scaling_policy.rs

🧰 Additional context used

🧬 Code graph analysis (2)

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (2)

crates/basilica-autoscaler/src/crd/scaling_policy.rs (3)

default (79-85)

default (124-132)

default (213-218)

crates/basilica-autoscaler/src/crd/node_pool.rs (2)

default (189-196)

default (242-251)

crates/basilica-autoscaler/src/provisioner/ssh.rs (3)

crates/basilica-autoscaler/src/provisioner/k3s.rs (2)

parse_k3s_version (187-194)

verify_k3s_status (182-184)

crates/basilica-autoscaler/src/provisioner/wireguard.rs (3)

generate_wg_config (85-117)

parse_wg_show_output (120-156)

new (27-33)

crates/basilica-api/src/wireguard/config.rs (1)

allowed_ips (61-73)

🔇 Additional comments (24)

crates/basilica-autoscaler/src/provisioner/k3s.rs (2)

181-194: Status/version helpers look fine for the intended outputs
verify_k3s_status() trimming and parse_k3s_version() token extraction match systemctl is-active and the common k3s version ... format.

196-299: Tests are solid; consider adding failing cases once label validation is tightened
After adding stricter label validation, add cases like "-bad" / "...." / "a//b" to ensure they’re skipped (or rejected) deterministically.

crates/basilica-autoscaler/src/leader_election.rs (6)

1-22: LGTM! Struct design properly separates shutdown and leadership signaling.

The separation of shutdown_tx and leader_tx correctly addresses the previous issue where leadership state changes couldn't be signaled independently from shutdown events.

24-52: LGTM! Identity generation and helper methods are well-implemented.

The fallback chain (POD_NAME → HOSTNAME → UUID) ensures a unique identity in various deployment environments.

104-187: LGTM! Shutdown handling and election loop properly implemented.

The loop correctly handles both explicit shutdown signals and sender-dropped scenarios (line 132), and the consecutive failure tracking with configurable limits provides resilience against transient API issues.

189-267: LGTM! Lease acquisition logic correctly handles all cases.

The function properly distinguishes between holding, expired, and non-existent leases, and delegates to the appropriate helper functions that enforce optimistic concurrency control.

269-312: LGTM! Lease operations properly enforce optimistic concurrency control.

Both acquire_lease() and renew_lease() correctly:

Fetch the current lease to obtain resourceVersion

Use ok_or_else() to fail fast on missing resourceVersion

Verify holder identity before acting (in renew_lease)

Use replace() instead of patch() to enforce atomic updates

This addresses the previous TOCTOU vulnerabilities.

Also applies to: 314-360

412-424: LGTM! Basic atomic test validates shared flag semantics.

While minimal, this test confirms the Arc<AtomicBool> pattern works correctly. Integration tests with a mock Kubernetes API would improve coverage but are not required for this review.

crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (9)

1-70: LGTM! Clean controller structure with proper dependency injection.

The generic controller design with trait bounds (AutoscalerK8sClient, SecureCloudApi, NodeProvisioner) enables testability through mock implementations. The Clone implementation correctly uses Arc::clone for shared ownership.

92-133: Phase timeout handling is well-structured with cleanup flag to prevent concurrent cleanup.

The cleanup_in_progress flag (lines 98-101, 104-108, 119) prevents race conditions when multiple reconciliations detect timeout simultaneously.

177-183: Addressed: Deletion handler now correctly resolves node_id from both status and spec.

The resolved_node_id pattern (line 178) properly handles both dynamic mode (status.node_id) and manual mode (spec.node_id).

396-397: Addressed: WireGuard installation now uses resolve_node_id helper.

This correctly supports both static (spec) and dynamic (status) node ID sources.

547-551: Addressed: Node labeling now uses resolved node_id from status.

The pattern at lines 548-550 correctly prefers status.node_id for dynamic mode while falling back to spec.node_id for manual mode.

604-614: Addressed: failure_count is now reset on successful health check.

Line 613 correctly resets status.failure_count = 0 when the node is confirmed ready.

784-790: Addressed: Timeout cleanup now uses consistent node_id resolution.

Both cleanup branches correctly use status.node_id.as_ref().or(pool.spec.node_id.as_ref()) pattern.

Also applies to: 805-811

1056-1076: Well-designed node_id resolution with clear precedence rules.

The helper properly prioritizes: (1) user-specified spec.node_id, (2) previously persisted status.node_id, (3) deterministic generation from pool name + IP. This ensures idempotent reconciliation.

1078-1111: Good test coverage for phase timeouts and node ID generation stability.

The tests verify critical invariants: timeouts are defined, node IDs are deterministic, and different inputs produce different IDs.

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (4)

30-51: LGTM! Controller structure with proper Clone and constructor.

The generic design enables testing with mock implementations of AutoscalerK8sClient and SecureCloudApi.

79-126: Well-implemented optimistic locking for scale-up operations.

The resourceVersion-based conflict detection (lines 82-105) prevents concurrent reconciles from double-counting scale-ups. The comment at lines 108-111 clearly documents that partial failures are handled by the NodePool controller.

435-466: LGTM! Comprehensive GPU request detection.

The requests_gpu function correctly checks both regular containers and init containers for nvidia.com/gpu resource requests.

400-405: The offering_id field in both SecureCloudTemplate and SecureCloudConfig is documented as the "GPU offering ID from Basilica API," confirming it correctly represents the actual GPU offering ID. No action needed.

crates/basilica-autoscaler/src/controllers/k8s_client.rs (3)

8-74: Well-designed trait abstraction for Kubernetes operations.

The AutoscalerK8sClient trait provides a comprehensive interface for all required K8s operations, enabling mock implementations for testing. The optimistic locking methods (try_increment_pending_scale_up/down) have clear semantics documented in comments.

397-428: LGTM! Pod eviction correctly handles PodDisruptionBudget blocks.

The eviction implementation uses the proper Eviction subresource API and correctly interprets HTTP 429 as "blocked by PDB" returning Ok(false).

216-255: Optimistic locking implementation is correct.

The approach of fetching the full object, comparing resourceVersion locally, and calling replace_status works as intended. Kubernetes replace_status uses the resourceVersion from metadata for optimistic concurrency control—any stale resourceVersion will result in a 409 Conflict, which the code correctly handles.

The concern about independent resourceVersions between spec and status is unfounded. Kubernetes uses a single resourceVersion per object that increments for any modification (spec or status); if the spec changes, the resourceVersion will be incremented, causing the status replace to fail with a 409 Conflict as expected.

crates/basilica-autoscaler/src/controllers/k8s_client.rs

crates/basilica-autoscaler/src/leader_election.rs

crates/basilica-autoscaler/src/provisioner/k3s.rs

crates/basilica-autoscaler/src/provisioner/ssh.rs

* Use grep guard to check if net.ipv4.ip_forward=1 already exists * Only append to /etc/sysctl.conf if entry is missing * Prevent duplicate entries on repeated provisioning runs

epappas added 4 commits December 12, 2025 00:12

epappas self-assigned this Dec 11, 2025

coderabbitai bot reviewed Dec 11, 2025

View reviewed changes

epappas added 2 commits December 12, 2025 00:22

coderabbitai bot reviewed Dec 11, 2025

View reviewed changes

epappas added 2 commits December 12, 2025 09:37

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

epappas added 6 commits December 12, 2025 10:56

chore(autoscaler): add sha2 and base64 dependencies

97a5009

* Add sha2 workspace dependency for stable hashing across Rust versions * Add base64 workspace dependency for safe file uploads via SSH * These dependencies are required for security and stability fixes

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

epappas added 2 commits December 12, 2025 12:12

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

epappas force-pushed the feat/k3s-autoscaler branch from 0705eeb to 07cea84 Compare December 12, 2025 12:54

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

epappas added 4 commits December 12, 2025 14:29

fix(autoscaler): improve SSH provisioner safety and idempotency

6b5c9e8

* Use safe slice for peer.public_key to prevent panic on short/malformed keys * Make /etc/hosts update idempotent using grep guard

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

fix(autoscaler): make IP forwarding sysctl configuration idempotent

80dc701

* Use grep guard to check if net.ipv4.ip_forward=1 already exists * Only append to /etc/sysctl.conf if entry is missing * Prevent duplicate entries on repeated provisioning runs

epappas merged commit 484cf92 into main Dec 12, 2025
3 checks passed

epappas deleted the feat/k3s-autoscaler branch December 12, 2025 13:58

This was referenced Dec 15, 2025

Feat/k3s autoscaler dynamic offering #296

Merged

feat(billing): implement progressive miner revenue recognition #297

Merged

Fix/k3s autoscaler race condition #302

Merged

coderabbitai bot mentioned this pull request Dec 22, 2025

feat(cloud): add HTTPS support for deployments ALB #306

Merged

Feat/k3s autoscaler #287

Feat/k3s autoscaler #287

Uh oh!

Conversation

epappas commented Dec 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

epappas commented Dec 11, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 11, 2025 •

edited

Loading