Skip to content

Conversation

@epappas
Copy link
Member

@epappas epappas commented Dec 15, 2025

Summary by CodeRabbit

  • New Features

    • Dynamic GPU offering selection, warm-pool sizing & metrics; image pre-puller DaemonSet.
    • vLLM and SGLang deployment templates (CLI, SDK, examples) and deployment helpers.
    • API SSH access, VPC peering support for K3s connectivity, and AEAD migration tooling.
  • Bug Fixes

    • Insufficient cluster capacity now logs non‑fatal warnings so deployments/scales proceed with pods pending.
  • Improvements

    • Enhanced offering caching, selection diagnostics, events, metrics, and reconciliation/monitoring.

✏️ Tip: You can customize this high-level summary in your review settings.

@epappas epappas self-assigned this Dec 15, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 15, 2025

Walkthrough

Dynamic GPU offering selection, warm-pool sizing, and offering caching added to the autoscaler; secure‑cloud/rental types and client updated; SSH provisioning switched to system SSH; NodePool/ScalingPolicy controllers and CRDs extended; CLI/Python deployment templates for vLLM/SGLang added; two capacity errors converted to non‑fatal warnings.

Changes

Cohort / File(s) Summary
Offering matcher & warm pool
crates/basilica-autoscaler/src/offering_matcher.rs, crates/basilica-autoscaler/src/warm_pool.rs, crates/basilica-autoscaler/src/lib.rs, crates/basilica-autoscaler/src/metrics.rs
New OfferingMatcher with cache, constraint/fallback matching and invalidation; warm-pool VRAM accounting, target/decision logic, metrics and public types; re-exports added.
Autoscaler controllers & runtime
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs, crates/basilica-autoscaler/src/controllers/node_pool_controller.rs, crates/basilica-autoscaler/src/runtime.rs, crates/basilica-autoscaler/src/config.rs
Controllers become generic over an OfferingSelector, wired to OfferingMatcher; pending-pod filtering, provisioning guards, warm-pool evaluation, offering invalidation and GC/cleanup flows added.
Secure Cloud API & client
crates/basilica-autoscaler/src/api/secure_cloud.rs, crates/basilica-autoscaler/src/api/mod.rs
API types reshaped (GpuSpecs, aliasing/deserializers), added get_offering/get_rental, list_offerings parsing updated, client endpoints/URLs adjusted and registration/ rental mappings changed.
Provisioner (SSH & WireGuard)
crates/basilica-autoscaler/src/provisioner/ssh.rs, crates/basilica-autoscaler/src/provisioner/wireguard.rs
Replaced in-process SSH with system ssh using temp key files; removed passphrase from config; configure_wireguard_peers gained node_ip; wg config generation simplified (MTU, allowed_ips helper).
Kubernetes client & events
crates/basilica-autoscaler/src/controllers/k8s_client.rs, crates/basilica-operator/src/k8s_client.rs
Added pod/nodepool event creation, pod listing, NodePool deletion; operator client gained ReplicaSet listing and replica-set scaling; mocks updated.
NodePool / ScalingPolicy CRDs & types
crates/basilica-autoscaler/src/crd/node_pool.rs, crates/basilica-autoscaler/src/crd/scaling_policy.rs, orchestrator/k8s/services/autoscaler/crds.yaml
Added offering_id, labels_applied, WireGuardPeerStatus, ssh_user/datacenter_id, offeringConstraints, warm_pool and image_prepull spec/status fields and related defaults.
Autoscaler rentals & API
crates/basilica-api/src/api/routes/secure_cloud.rs, crates/basilica-api/src/config/mod.rs
list_secure_cloud_rentals now refreshes active rentals from provider, joins offerings and billing, computes hourly/accumulated costs; secure-cloud health-check interval made configurable.
Deployment validation (warnings)
crates/basilica-api/src/api/routes/deployments/handlers.rs
Two insufficient-capacity early-return errors converted to non-fatal warnings with structured fields so deployments/scales proceed with pods pending until capacity is available.
Node labeling & operator changes
crates/basilica-operator/src/labels.rs, crates/basilica-operator/src/controllers/*
Added normalize_gpu_model(s) helpers, normalized/deduplicated GPU labels, controllers use normalized values; tests updated.
CLI templates & examples
crates/basilica-cli/src/cli/commands.rs, crates/basilica-cli/src/cli/handlers/deploy/templates/*, examples/19_deploy_vllm_template.py, examples/20_deploy_sglang_template.py
Added vllm and sglang deploy subcommands, common helpers and templates (model_size, vllm, sglang) plus example scripts.
Python SDK templates
crates/basilica-sdk-python/python/basilica/__init__.py, crates/basilica-sdk-python/python/basilica/templates/*
Added high-level deploy_vllm and deploy_sglang client methods and model_size helpers; templates package re-exports GpuRequirements and estimate_gpu_requirements.
Payments & reconciliation
crates/basilica-payments/src/blockchain/monitor.rs, crates/basilica-payments/src/blockchain/mod.rs, crates/basilica-payments/src/blockchain/client.rs, crates/basilica-payments/src/config.rs, crates/basilica-payments/src/reconciliation/{calculator,service}.rs, crates/basilica-payments/src/storage/reconciliation.rs
MonitorConfig added, reconnect/retry/pruning logic, transfer receipts include block_number; sweep rate-limiting, stale-sweep reconciliation and new storage APIs added; config validation extended.
Aggregator provider IP handling
crates/basilica-aggregator/src/providers/mod.rs
ProviderDeployment.ip_address now uses floating_ip only (fixed_ip fallback removed).
CLI & minor UX
crates/basilica-cli/src/cli/handlers/ssh_keys.rs, crates/basilica-cli/Cargo.toml
SSH key listing prints key ID inline; added workspace regex dependency.
Dependencies & build
crates/basilica-autoscaler/Cargo.toml, scripts/autoscaler/Dockerfile
Replaced async-ssh2-tokio and adjusted tempfile usage; runtime container user UID/GID pinned to 1000.
Terraform & infra
scripts/cloud/*, scripts/cloud/modules/networking/*, scripts/cloud/migration.tf, scripts/migrations/*, orchestrator/k8s/services/autoscaler/*, orchestrator/ansible/roles/api_ssh_access/*
Added VPC peering inputs/routes and ignore_changes, AEAD migration tooling and ECS task, new namespaces/scaling-policy templates/secrets/ServiceMonitor/image-prepuller manifests, Ansible role/playbook for API SSH access, and related Terraform variable/secret updates.

Sequence Diagram(s)

sequenceDiagram
    participant Scaling as ScalingPolicyController
    participant Offering as OfferingMatcher (OfferingSelector)
    participant Secure as SecureCloudApi / SecureCloudClient
    participant K8s as K8sClient
    participant NodePool as NodePool CR

    Scaling->>K8s: collect_metrics(ns) & list pending GPU pods
    Scaling->>Offering: find_offering_with_fallback(requirements, constraints)
    Offering->>Secure: list_offerings / get_offering (uses cache/TTL)
    Secure-->>Offering: offerings / offering
    Offering-->>Scaling: OfferingResult (Found / UsedFallback / NotFound)
    alt Found or UsedFallback
        Scaling->>K8s: create_node_pool_from_template(offering_id, is_warm_pool)
        K8s-->>NodePool: create NodePool CR
        NodePool->>K8s: persist status.offering_id & labels
        NodePool->>Secure: start_rental / register node with gpu_specs
        Secure-->>NodePool: rental + registration (wireguard info)
        NodePool->>K8s: label node, create events (create_node_pool_event)
    else NotFound
        Scaling->>K8s: emit_no_offering_events(pending pods)
    end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120+ minutes

Focus areas:

  • offering_matcher.rs (matching logic, cache, fallback, metrics)
  • scaling_policy_controller.rs & node_pool_controller.rs (generic OfferingSelector wiring, warm-pool flows, GC)
  • provisioner/ssh.rs (system ssh invocation, temp key handling, timeouts/signal handling)
  • secure_cloud API/client (serde changes, URL/endpoint changes, get_offering/get_rental)
  • WireGuard peer/config handling and connectivity checks
  • Payments reconciliation (sweep idempotency, stale-sweep reconciliation, DB/storage APIs)
  • CRD/schema and serde defaults (offering_id, peers, warm_pool, ssh_user)

Possibly related PRs

Suggested reviewers

  • itzlambda
  • distributedstatemachine

Poem

🐰 I found the best offering under the moon,
I warmed a pool and hummed a dev tune,
WireGuard stitched and SSH hopped right,
vLLM and SGLang wake at night—
I dug a tunnel, code deployed — hop delight!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main feature addition: dynamic offering selection for the K3s autoscaler. It is specific, concise, and directly reflects the core change across multiple files.
Docstring Coverage ✅ Passed Docstring coverage is 96.09% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/k3s-autoscaler-dynamic-offering

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/autoscaler/rollout.sh (1)

7-52: Help text and example still reference old default namespace

Runtime default is now basilica-autoscaler (Line 7), but:

  • Line 41: help says default is basilica-system.
  • Line 51: example uses --namespace basilica-system.

Recommend updating those strings to basilica-autoscaler to avoid confusion.

♻️ Duplicate comments (1)
crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (1)

230-377: Duplicate code with vllm.rs.

As noted in the vllm.rs review, these helper functions (parse_env_vars, create_with_retry, is_quota_exceeded, extract_quota_message, WaitResult, wait_for_ready) are duplicated. Please refer to the refactoring suggestion in vllm.rs.

🧹 Nitpick comments (23)
orchestrator/k8s/services/autoscaler/secrets-template.yaml (1)

1-40: Clarify secrets workflow and avoid secret-scanner noise on SSH key placeholder

  • The template structure and placeholders look good and align with the deployment/env vars.
  • The SSH key block (Lines 38–40) intentionally mimics a real private key and will be flagged by tools like gitleaks even though it’s a placeholder.

Two small suggestions:

  1. Consider adjusting the placeholder so it doesn’t use a real-looking OpenSSH header/footer (or add a repo-level allow rule) to keep CI scanners quieter.
  2. Since the file is named secrets-template.yaml, you might explicitly state that users should copy it to secrets.yaml before filling values, to align with the comment on Line 4 and .gitignore for secrets.yaml.
crates/basilica-autoscaler/src/lib.rs (1)

9-21: Public re-exports for offering matcher look coherent

Exposing offering_matcher and re-exporting GpuRequirements, OfferingMatcher*, OfferingSelector, etc., from the crate root is a clean way to make the dynamic-offering functionality available to controllers/runtime without deep module imports. Just be aware this solidifies these type names as part of the public API surface going forward.

crates/basilica-autoscaler/src/api/secure_cloud.rs (1)

263-265: get_offering implementation is correct; consider future optimization

The new trait method and implementation are straightforward and reuse list_offerings, which keeps circuit-breaker behavior consistent. For now this is fine.

If the offerings list grows large or get_offering is used in hot paths, you may eventually want a more direct API call or a cached view in the offering matcher layer to avoid fetching the full list on each lookup.

Also applies to: 431-434

crates/basilica-autoscaler/src/crd/node_pool.rs (1)

386-388: New status fields align with dynamic offering and labeling

  • offering_id: Option<String> with #[serde(default, skip_serializing_if = "Option::is_none")] is backward-compatible and gives you a clear link back to the Basilica offering for dynamic mode.
  • labels_applied: bool with #[serde(default)] is a simple, robust flag to gate GPU label application and avoid races.

Schema-wise this looks good. The main follow‑through is on the controller side: ensuring labels_applied is reset appropriately if node labels or selected offering change.

Also applies to: 454-457

crates/basilica-api/src/api/routes/deployments/handlers.rs (1)

74-82: Intentional non-blocking capacity validation looks good.

The change from error to warning allows deployments to proceed when capacity is insufficient, enabling the autoscaler to provision additional nodes dynamically. The structured log fields (namespace, reason) provide good observability.

Consider adding a metric here to track how often deployments proceed with insufficient capacity - this would help correlate with autoscaler provisioning events.

orchestrator/k8s/services/autoscaler/scaling-policy.yaml (1)

72-80: K3s server URL is hardcoded.

The server URL https://connect.basilica.ai:6443 is hardcoded. For production use, consider documenting that this needs to be customized, or add a comment indicating this is a placeholder that must be updated per deployment.

crates/basilica-autoscaler/src/controllers/k8s_client.rs (1)

474-474: Potential event name collision when nanoseconds unavailable.

Using timestamp_nanos_opt().unwrap_or(0) could create duplicate event names if multiple events are created when nanoseconds are unavailable (returns 0). Consider adding a random suffix or using a UUID fragment:

-let event_name = format!("{}.{}", pod_name, now.timestamp_nanos_opt().unwrap_or(0));
+let event_name = format!(
+    "{}.{:x}",
+    pod_name,
+    now.timestamp_nanos_opt().unwrap_or_else(|| rand::random::<i64>().unsigned_abs() as i64)
+);

Alternatively, use the simpler pattern with millis which is always available:

let event_name = format!("{}.{}", pod_name, now.timestamp_millis());
crates/basilica-autoscaler/src/metrics.rs (1)

191-199: High cardinality risk with offering_id label.

The offering_selection_total metric includes offering_id as a label. If there are many unique offering IDs (which is likely with dynamic offerings from multiple providers), this could cause metrics cardinality explosion and impact Prometheus performance.

Consider:

  1. Using a separate counter without offering_id for total selections
  2. Adding offering_id only to a lower-frequency metric or log
  3. Limiting the label to a normalized/bucketed value (e.g., GPU type only)
 let offering_selection_total = IntCounterVec::new(
     Opts::new(
         "autoscaler_offering_selection_total",
         "Total number of GPU offering selections",
     ),
-    &["policy", "offering_id", "gpu_type"],
+    &["policy", "gpu_type"],  // Remove offering_id to avoid cardinality issues
 )
 .unwrap();
orchestrator/k8s/services/autoscaler/crds.yaml (1)

676-679: Clarify nullable vs optional behavior for offeringId.

The offeringId field is marked as nullable: true but lacks a default value. When using dynamic selection, this field will be populated by the autoscaler. Consider adding a description clarifying that when null/empty, dynamic selection is triggered.

The current description mentions dynamic selection but doesn't explicitly state what happens when offeringId is omitted vs explicitly set.

crates/basilica-autoscaler/src/offering_matcher.rs (1)

295-315: Document the prefix matching behavior in user-facing docs.

The forward prefix matching (requested "A100" matches offering "A100SXM4") is useful but may surprise users. Consider adding documentation or an info log when prefix matching is used, so users understand why their pod was scheduled to a specific GPU variant.

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)

561-602: Consider rate-limiting fallback offering lookups.

When find_offering_with_fallback repeatedly fails to find a matching offering and falls back, it calls self.api.get_offering(fallback_id) on each reconciliation. If the fallback is invalid or unavailable, this generates repeated API calls.

Consider caching the fallback validation result or using the offering cache for fallback lookups.

crates/basilica-operator/src/controllers/user_deployment_controller.rs (1)

620-624: Annotations wiring looks good; consider test coverage for GPU annotation presence

Conditionally setting metadata.annotations to None when the map is empty avoids spurious changes for non-GPU workloads, and the behavior is backward‑compatible with previous templates that had no annotations.

Given the autoscaler will likely depend on autoscaler.basilica.ai/gpu-requirements, consider adding a dedicated test that:

  • Uses a UserDeploymentSpec with GpuSpec.
  • Asserts that the pod template metadata includes the gpu-request.basilica.ai/* labels and the autoscaler.basilica.ai/gpu-requirements annotation.
crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (2)

159-189: Consider extracting shared name sanitization logic.

The model name sanitization logic (lines 161-177) is duplicated in sglang.rs. Consider extracting this into a shared helper function in the parent module or model_size.rs to reduce duplication.

// In a shared module (e.g., templates/mod.rs or a new helpers.rs):
pub fn sanitize_model_name(model: &str, max_len: usize) -> String {
    let model_part = model
        .split('/')
        .next_back()
        .unwrap_or(model)
        .to_lowercase()
        .chars()
        .filter_map(|c| {
            if c.is_ascii_alphanumeric() {
                Some(c)
            } else if c == '-' || c == '_' || c == '.' {
                Some('-')
            } else {
                None
            }
        })
        .collect::<String>();

    let sanitized = model_part.trim_matches('-');
    if sanitized.len() > max_len {
        &sanitized[..max_len]
    } else {
        sanitized
    }.to_string()
}

20-21: Consider pinning the vLLM image to a specific version for better reproducibility.

Using :latest tag can lead to reproducibility issues when the image is updated upstream. While vLLM's official documentation recommends the :latest tag, pinning to a specific version (e.g., vllm/vllm-openai:v0.12.0) provides more consistent deployments. Alternatively, make this configurable to allow users to specify their preferred version.

crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (1)

73-110: Consider making model family patterns extensible.

The hardcoded model family patterns work for known models but may become outdated as new models are released. Consider adding a note or TODO about future extensibility, or accepting this as a best-effort fallback since the regex-based extraction is the primary method.

crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (1)

20-21: Consider pinning the SGLang Docker image to a specific version.

Using :latest tag compromises reproducibility across deployments. Pin to a specific version (e.g., v0.4.6-cu125 or the latest stable release available on Docker Hub) or make this configurable via environment variable or command-line option.

crates/basilica-cli/src/cli/commands.rs (2)

808-846: Consider adding basic validation for vLLM numeric / enum-like options

A few options could benefit from clap-level validation for better UX:

  • gpu_memory_utilization: Option<f32> is documented as 0.0-1.0 but not range-checked.
  • dtype and quantization are free-form Strings even though they are effectively enums (e.g., auto, float16, bfloat16, etc.).

Recommended (non-blocking) improvements:

  • Add a range parser for utilization:
-    #[arg(long)]
-    pub gpu_memory_utilization: Option<f32>,
+    #[arg(long, value_parser = clap::value_parser!(f32).range(0.0..=1.0))]
+    pub gpu_memory_utilization: Option<f32>,
  • Optionally later, replace dtype / quantization with ValueEnum types to surface valid values in --help and get automatic validation.

848-870: Mirror validation for SGLang mem_fraction_static and consider enums for quantization

Similar to vLLM:

  • mem_fraction_static: Option<f32> is documented as 0.0-1.0 but not range-validated.
  • quantization: Option<String> is conceptually an enum.

Non-blocking suggestion:

-    #[arg(long)]
-    pub mem_fraction_static: Option<f32>,
+    #[arg(long, value_parser = clap::value_parser!(f32).range(0.0..=1.0))]
+    pub mem_fraction_static: Option<f32>,

And optionally later, migrate quantization to a ValueEnum if you want tighter validation and better --help.

examples/20_deploy_sglang_template.py (1)

23-28: Fix redundant f-strings and optionally harden cleanup in the example

  • Lines 23 and 41 use f-strings with no interpolations; you can drop the f prefix to satisfy Ruff and keep things simple:
-print(f"SGLang deployment ready!")
+print("SGLang deployment ready!")
...
-print(f"Qwen 3B deployment ready!")
+print("Qwen 3B deployment ready!")
  • Optional: for a slightly more robust example, you could wrap deployment/printing in a try and move the delete() calls into a finally block so deployments are cleaned up even if something fails mid-script.

Also applies to: 41-48

crates/basilica-sdk-python/python/basilica/templates/model_size.py (1)

81-97: Align estimate_memory_from_params docstring with implementation

The function always rounds up to the next multiple of 8GB, not to the nearest 8GB. To avoid confusion for callers, consider tightening the wording:

-    Returns:
-        Estimated VRAM in GB, rounded to nearest 8GB
+    Returns:
+        Estimated VRAM in GB, rounded up to the nearest 8GB

The conservative “round up” behavior itself seems appropriate for sizing GPUs.

examples/19_deploy_vllm_template.py (1)

23-26: Clean up redundant f-strings and consider making cleanup more robust

  • Lines 23 and 41 use f-strings without placeholders; they can be plain strings:
-print(f"vLLM deployment ready!")
+print("vLLM deployment ready!")
...
-print(f"Llama-2-7B deployment ready!")
+print("Llama-2-7B deployment ready!")
  • Optional: like the SGLang example, you might wrap the deployments in a try and move the delete() calls into a finally so resources are cleaned up if something goes wrong between deploy and teardown.

Also applies to: 41-48

crates/basilica-sdk-python/python/basilica/__init__.py (2)

446-506: Factor shared GPU-sizing logic and avoid Optional[int] in GpuRequirementsSpec calls

The vLLM and SGLang helpers share nearly identical blocks for:

  • Estimating GPU requirements,
  • Choosing an effective GPU count and model list, and
  • Constructing GpuRequirementsSpec/ResourceRequirements.

Two small refactors would improve maintainability and type ergonomics:

  1. Use a non-optional local for the effective GPU count (helps type-checkers and makes intent explicit):
-        # Use user-specified GPU count or auto-detected
-        if gpu_count is None:
-            gpu_count = reqs.gpu_count
-
-        # Use user-specified GPU models or recommended GPU
-        effective_gpu_models = gpu_models if gpu_models else [reqs.recommended_gpu]
+        # Use user-specified GPU count or auto-detected
+        effective_gpu_count = gpu_count if gpu_count is not None else reqs.gpu_count
+
+        # Use user-specified GPU models or recommended GPU
+        effective_gpu_models = gpu_models if gpu_models else [reqs.recommended_gpu]
...
-        gpu_spec = GpuRequirementsSpec(
-            count=gpu_count,
+        gpu_spec = GpuRequirementsSpec(
+            count=effective_gpu_count,
             model=effective_gpu_models,
             min_cuda_version=None,
             min_gpu_memory_gb=None,
         )

…and mirror the same pattern in deploy_sglang().

  1. (Optional) Consider extracting a tiny internal helper for the repeated “estimate requirements → pick effective_gpu_count/effective_gpu_models → build gpu_spec/resources” sequence to keep the two methods in sync as you iterate on the heuristics.

Also applies to: 559-572, 600-651, 698-711


517-586: Consider making image and storage configuration overridable rather than hard-coded

Both deploy helpers hard-code:

  • Images: "vllm/vllm-openai:latest" and "lmsysorg/sglang:latest".
  • Storage backend details, including backend=StorageBackend.R2, bucket="", and credentials_secret="basilica-r2-credentials".

For long-term operability and self-hosted scenarios, it may be worth:

  • Exposing image (and maybe port) as optional parameters with sensible defaults, so users can pin specific tags or swap images without forking the SDK.
  • Making bucket/credentials either:
    • Derived from environment/config, or
    • Optional parameters (with these as defaults) so non-Basilica-R2 environments can still reuse the helpers.

This keeps the “one-liner” ergonomics while avoiding environment-specific assumptions being baked into the SDK.

Also applies to: 664-725

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3118197 and b9ce972.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (32)
  • crates/basilica-api/src/api/routes/deployments/handlers.rs (2 hunks)
  • crates/basilica-autoscaler/src/api/secure_cloud.rs (2 hunks)
  • crates/basilica-autoscaler/src/controllers/k8s_client.rs (3 hunks)
  • crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (11 hunks)
  • crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (13 hunks)
  • crates/basilica-autoscaler/src/crd/node_pool.rs (2 hunks)
  • crates/basilica-autoscaler/src/crd/scaling_policy.rs (3 hunks)
  • crates/basilica-autoscaler/src/error.rs (3 hunks)
  • crates/basilica-autoscaler/src/lib.rs (2 hunks)
  • crates/basilica-autoscaler/src/metrics.rs (5 hunks)
  • crates/basilica-autoscaler/src/offering_matcher.rs (1 hunks)
  • crates/basilica-autoscaler/src/runtime.rs (6 hunks)
  • crates/basilica-cli/Cargo.toml (1 hunks)
  • crates/basilica-cli/src/cli/commands.rs (1 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/mod.rs (2 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/templates/mod.rs (1 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (1 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (1 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (1 hunks)
  • crates/basilica-operator/src/controllers/user_deployment_controller.rs (2 hunks)
  • crates/basilica-sdk-python/python/basilica/__init__.py (2 hunks)
  • crates/basilica-sdk-python/python/basilica/templates/__init__.py (1 hunks)
  • crates/basilica-sdk-python/python/basilica/templates/model_size.py (1 hunks)
  • examples/19_deploy_vllm_template.py (1 hunks)
  • examples/20_deploy_sglang_template.py (1 hunks)
  • orchestrator/k8s/services/autoscaler/.gitignore (1 hunks)
  • orchestrator/k8s/services/autoscaler/crds.yaml (4 hunks)
  • orchestrator/k8s/services/autoscaler/deployment.yaml (1 hunks)
  • orchestrator/k8s/services/autoscaler/namespaces.yaml (1 hunks)
  • orchestrator/k8s/services/autoscaler/scaling-policy.yaml (1 hunks)
  • orchestrator/k8s/services/autoscaler/secrets-template.yaml (1 hunks)
  • scripts/autoscaler/rollout.sh (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (11)
crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (6)
crates/basilica-cli/src/progress/mod.rs (2)
  • complete_spinner_and_clear (137-139)
  • create_spinner (98-109)
crates/basilica-sdk-python/python/basilica/__init__.py (1)
  • GpuRequirementsSpec (94-112)
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (1)
  • estimate_gpu_requirements (32-50)
crates/basilica-cli/src/output/mod.rs (1)
  • json_output (12-17)
crates/basilica-sdk-python/python/basilica/exceptions.py (1)
  • DeploymentFailed (237-266)
crates/basilica-sdk/src/client.rs (1)
  • timeout (1151-1154)
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (2)
crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (1)
  • model (161-176)
crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (1)
  • model (162-177)
crates/basilica-cli/src/cli/handlers/deploy/templates/mod.rs (3)
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (1)
  • estimate_gpu_requirements (32-50)
crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (1)
  • handle_sglang_deploy (36-157)
crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (1)
  • handle_vllm_deploy (36-157)
crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (3)
crates/basilica-cli/src/progress/mod.rs (2)
  • complete_spinner_and_clear (137-139)
  • create_spinner (98-109)
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (1)
  • estimate_gpu_requirements (32-50)
crates/basilica-cli/src/output/mod.rs (1)
  • json_output (12-17)
crates/basilica-autoscaler/src/metrics.rs (1)
crates/basilica-billing/src/domain/rentals.rs (1)
  • offering_id (223-225)
crates/basilica-autoscaler/src/runtime.rs (2)
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (3)
  • new (59-74)
  • clone (41-50)
  • with_offering_selector (84-100)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (2)
  • new (58-70)
  • clone (42-49)
examples/19_deploy_vllm_template.py (3)
crates/basilica-cli/src/github_releases.rs (1)
  • basilica (21-26)
crates/basilica-sdk-python/python/basilica/deployment.py (1)
  • url (183-191)
crates/basilica-cli/src/cli/handlers/deploy/validation.rs (1)
  • memory (401-402)
examples/20_deploy_sglang_template.py (2)
crates/basilica-sdk-python/python/basilica/__init__.py (1)
  • deploy_sglang (600-737)
crates/basilica-sdk-python/python/basilica/deployment.py (1)
  • url (183-191)
crates/basilica-sdk-python/python/basilica/__init__.py (3)
crates/basilica-sdk-python/python/basilica/deployment.py (3)
  • name (178-180)
  • Deployment (114-453)
  • _from_response (436-453)
crates/basilica-sdk-python/python/basilica/templates/model_size.py (1)
  • estimate_gpu_requirements (30-60)
crates/basilica-sdk-python/python/basilica/decorators.py (2)
  • deployment (35-37)
  • deployment (129-208)
crates/basilica-sdk-python/python/basilica/templates/__init__.py (1)
crates/basilica-sdk-python/python/basilica/templates/model_size.py (2)
  • GpuRequirements (22-27)
  • estimate_gpu_requirements (30-60)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)
crates/basilica-autoscaler/src/offering_matcher.rs (4)
  • calculate_nodes_needed (638-658)
  • group_pending_pods_by_requirements (620-635)
  • has_gpu_node_affinity (499-542)
  • from_template (107-120)
🪛 Gitleaks (8.30.0)
orchestrator/k8s/services/autoscaler/secrets-template.yaml

[high] 38-40: Identified a Private Key, which may compromise cryptographic security and sensitive data encryption.

(private-key)

🪛 Ruff (0.14.8)
examples/19_deploy_vllm_template.py

23-23: f-string without any placeholders

Remove extraneous f prefix

(F541)


41-41: f-string without any placeholders

Remove extraneous f prefix

(F541)

examples/20_deploy_sglang_template.py

23-23: f-string without any placeholders

Remove extraneous f prefix

(F541)


41-41: f-string without any placeholders

Remove extraneous f prefix

(F541)

crates/basilica-sdk-python/python/basilica/__init__.py

520-520: Possible binding to all interfaces

(S104)


551-551: Possible hardcoded password assigned to argument: "credentials_secret"

(S106)


667-667: Possible binding to all interfaces

(S104)


690-690: Possible hardcoded password assigned to argument: "credentials_secret"

(S106)

🔇 Additional comments (64)
crates/basilica-cli/Cargo.toml (2)

66-67: The dependency addition is well-organized.

The placement after ssh-key and before internal dependencies follows a logical grouping convention. The comment clearly indicates the purpose, and using a workspace dependency ensures consistent versioning across the monorepo.


66-67: The regex dependency addition is correct and actively used in the codebase.

The regex workspace dependency is properly imported and used in crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs for extracting parameter counts from model names (matching patterns like "7b", "70b", "0.5b", etc.). The dependency is initialized efficiently using LazyLock<Regex>. The addition is appropriate.

orchestrator/k8s/services/autoscaler/.gitignore (1)

1-1: Good call ignoring secrets.yaml

Ignoring the concrete secrets.yaml file aligns with using secrets-template.yaml and reduces risk of committing real credentials.

orchestrator/k8s/services/autoscaler/namespaces.yaml (1)

1-16: Namespaces and labeling look consistent

basilica-autoscaler and basilica-ssh-keys namespaces, plus labels (including restricted: "true" for SSH keys), line up with the deployment and secrets usage and give you room for policy enforcement.

orchestrator/k8s/services/autoscaler/deployment.yaml (2)

1-212: Deployment, Service, and config wiring are consistent with new namespace and secrets

  • Deployment, Service, and ConfigMap all target basilica-autoscaler and share app: basilica-autoscaler labels, matching the new namespaces and basilica-autoscaler-credentials Secret.
  • Env var BASILICA_API_KEY correctly references basilica-autoscaler-credentials/api-key, which matches the secrets template.

183-212: ServiceMonitor and PDB configuration look sound

  • ServiceMonitor:
    • Selects app: basilica-autoscaler and restricts to the basilica-autoscaler namespace.
    • Scrapes the metrics port on /metrics with reasonable interval/timeout settings.
  • PodDisruptionBudget:
    • minAvailable: 1 with a single replica effectively forbids voluntary disruptions, which is a reasonable choice for a leader‑elected singleton if you prefer availability over graceful drain upgrades.

No issues from a manifest perspective; just confirm that the strict PDB behavior matches your upgrade/maintenance expectations.

crates/basilica-api/src/api/routes/deployments/handlers.rs (1)

359-368: Consistent non-blocking behavior for scale operations.

The warning pattern matches validate_deployment_prerequisites and provides relevant context (current_replicas, new_replicas, reason) for debugging. This aligns well with the autoscaler-driven capacity provisioning model.

orchestrator/k8s/services/autoscaler/scaling-policy.yaml (2)

84-91: Pre-setting basilica.ai/validated: "true" may bypass validation.

Setting basilica.ai/validated: "true" as a default label on provisioned nodes could skip any node validation logic that depends on this label being set only after successful validation. Ensure this is intentional and that the provisioning flow actually validates nodes before they become schedulable.


62-69: SSH key secrets reference looks correct, but verify namespace access.

The SSH key secret is referenced from a different namespace (basilica-ssh-keys). Ensure that the autoscaler's ServiceAccount has RBAC permissions to read secrets from this namespace, or the provisioning will fail at runtime.

crates/basilica-autoscaler/src/crd/scaling_policy.rs (3)

49-52: New offering_constraints field looks good.

The optional offering_constraints field with proper serde attributes allows dynamic GPU offering selection while maintaining backward compatibility with existing policies that don't specify constraints.


176-195: Dynamic selection fields are well-structured.

The new fields for dynamic offering selection (gpu_type, min_gpu_count, max_hourly_rate, preferred_provider, region) provide good flexibility. The defaults and skip_serializing patterns are consistent with the rest of the codebase.


170-175: CRD schema already reflects offering_id as optional—no breaking change for ScalingPolicy manifests.

The CRD definition in orchestrator/k8s/services/autoscaler/crds.yaml already marks offeringId as optional (nullable: true) and excludes it from the required fields for ScalingPolicy.spec.nodeTemplate.secureCloud. The #[serde(default)] attribute on the Rust struct ensures backward compatibility for existing deserialization. No action required on CRD schema updates or existing manifests; they can safely omit the field.

crates/basilica-autoscaler/src/controllers/k8s_client.rs (3)

75-98: New event creation trait methods provide good observability.

The trait methods for creating Pod and NodePool events enable visibility into autoscaler decisions. The signatures are clean and the parameters are sufficient for meaningful event messages.


493-495: Consider using first_timestamp and last_timestamp for event aggregation.

Setting first_timestamp: None and last_timestamp: None while using event_time is valid for newer-style events, but some older K8s tooling may expect the legacy timestamp fields. The current approach is correct for modern clusters.


509-557: NodePool event creation mirrors Pod event pattern correctly.

The implementation is consistent with create_pod_event. The same event name collision concern applies here (line 523).

crates/basilica-autoscaler/src/error.rs (3)

91-96: Enhanced NoMatchingOffering error provides useful context.

The struct variant with gpu_count, models, and min_memory_gb fields produces actionable error messages that help operators understand why no offering matched. The error message format is clear.


123-128: New ApiTimeout variant is well-structured.

The variant captures the operation name and timeout duration, providing context for debugging timeout issues. This aligns with the broader timeout handling patterns in the autoscaler.


147-159: Retryability classification is appropriate.

Adding NoMatchingOffering and ApiTimeout to is_retryable is correct:

  • Offerings may become available as capacity changes
  • API timeouts are typically transient

Removing NoMatchingOffering from is_permanent is consistent with the retryability change.

crates/basilica-autoscaler/src/metrics.rs (3)

44-55: New offering-related metrics provide good observability.

The metrics cover the key aspects of offering selection: successful selections, failures, duration, and cache behavior. This enables correlation between offering selection events and scaling decisions.


210-218: Histogram buckets are well-suited for cache/lookup operations.

The sub-second buckets (0.001s to 1.0s) are appropriate for offering selection which should be fast when cached. If API calls are expected, you might want to extend the upper buckets (e.g., add 2.0, 5.0).


412-439: Recording methods follow established patterns.

The new methods (record_offering_selection, record_no_matching_offering, etc.) are consistent with existing metric recording methods in the file. The method signatures provide appropriate context for the metrics.

orchestrator/k8s/services/autoscaler/crds.yaml (3)

380-383: LGTM - labelsApplied status field.

The labelsApplied boolean with default false correctly tracks GPU label application state to avoid race conditions between node readiness and label application.


417-420: LGTM - offeringId status field.

Tracking the offering ID in status enables correlation between the provisioned node and the selected GPU offering for debugging and auditing.


715-740: LGTM - offeringConstraints structure.

The constraints object provides appropriate filtering capabilities:

  • providers with empty default (any provider)
  • regions as nullable array
  • maxHourlyRate with proper double format
  • fallbackOfferingId for graceful degradation

The field structure aligns with the OfferingConstraints Rust struct in offering_matcher.rs.

crates/basilica-autoscaler/src/runtime.rs (4)

55-60: LGTM - OfferingMatcher initialization with metrics.

The offering matcher is correctly instantiated with:

  • Shared API client via Arc::clone
  • Default configuration (60s cache TTL, 30s API timeout)
  • Metrics integration for cache hit/miss observability

109-125: LGTM - Controller wiring with offering selector.

Both controllers are correctly wired:

  • NodePoolController::with_offering_selector receives the offering matcher for cache invalidation on rental failures
  • ScalingPolicyController::new receives it for dynamic offering selection during scale-up

The Arc::clone pattern ensures efficient shared ownership.


133-156: LGTM - Context structs with generic parameter.

The generic parameter S is correctly added to both context structs:

  • ControllerContext<K, A, P, S> for NodePool reconciliation
  • ScalingContext<K, A, S> with S: OfferingSelector bound for ScalingPolicy

This enables type-safe offering selection throughout the reconciliation lifecycle.


183-218: LGTM - NodePool reconcile functions with MaybeOfferingSelector bound.

Using MaybeOfferingSelector for the NodePool controller is the correct design choice - it allows the controller to optionally invalidate offerings on failure without requiring an offering selector to always be present.

crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (9)

77-101: LGTM - with_offering_selector constructor.

The separate constructor for controllers with an offering selector is a clean design pattern. It maintains backward compatibility with the existing new() constructor while enabling the new offering-aware behavior.


103-120: LGTM - invalidate_offering_on_failure implementation.

The MaybeOfferingSelector bound correctly enables optional offering invalidation. The warning log provides visibility into cache invalidation events for debugging failed rentals.


355-399: Good defensive pattern - persisting GPU metadata immediately.

Fetching offering details and persisting to status before starting the rental is a sound crash-recovery pattern. If the process crashes after fetching but before persisting, the next reconciliation will re-fetch.

The event emission for unavailable offerings provides user visibility.


406-415: LGTM - Offering invalidation on rental failure.

Invalidating the offering from the cache after a rental failure prevents the autoscaler from repeatedly attempting to provision with the same failing offering. The offering will reappear after the cache TTL expires, allowing transient issues to resolve.


646-696: Proactive GPU label application prevents scheduling race conditions.

Applying GPU labels immediately when the node is detected (before Ready) is the correct approach. This prevents pods from failing to schedule during the brief window between node readiness and label application.

The retry logic with warning logs and status persistence handles transient label application failures gracefully.


758-767: LGTM - Label re-application during Ready phase.

Re-checking and re-applying labels during health checks handles recovery from the Unhealthy state and ensures labels remain consistent even if they were inadvertently removed.


1194-1226: LGTM - build_gpu_labels helper.

The helper correctly:

  • Normalizes GPU model names for consistent matching
  • Prefers status.offering_id over spec.secure_cloud.offering_id
  • Returns an empty map when no GPU info is available (safe fallback)

1228-1241: LGTM - labels_need_reapplication helper.

Simple and correct logic - returns true if any expected label is missing or has a different value. Handles the None labels case appropriately.


1317-1431: LGTM - Comprehensive test coverage for label helpers.

The tests cover key scenarios:

  • All labels present → no reapplication needed
  • Missing labels → reapplication needed
  • Mismatched values → reapplication needed
  • Empty expected set → no reapplication needed
  • Node with no labels → reapplication needed

Good edge case coverage.

crates/basilica-autoscaler/src/offering_matcher.rs (9)

38-47: LGTM - GPU model normalization.

The normalization function correctly:

  • Converts to uppercase for case-insensitive matching
  • Removes non-alphanumeric characters (spaces, hyphens, underscores)
  • Produces consistent output for various input formats

This enables flexible user input while ensuring reliable matching.


49-72: LGTM - GpuRequirements with normalization.

Using BTreeSet for gpu_models ensures:

  1. Order-independent equality (important for hash-based grouping)
  2. Deterministic iteration order
  3. No duplicate models

Normalizing models in the constructor ensures all comparisons use the normalized form.


122-139: LGTM - merge_with for constraint composition.

The merge logic correctly implements "self takes precedence when set":

  • Empty providers list falls back to other's providers
  • Option fields use or_else for proper fallback
  • This enables template-level defaults with policy-level overrides

257-293: LGTM - Double-checked locking for cache refresh.

The implementation correctly:

  1. Uses read lock for fast-path cache hit
  2. Acquires write lock and re-checks before fetching (prevents thundering herd)
  3. Applies timeout to API call to prevent blocking indefinitely
  4. Records cache hits/misses for observability

The Arc<Vec<GpuOffering>> pattern enables efficient sharing of the cached offerings.


380-395: LGTM - Copy-on-Write pattern for cache invalidation.

The Arc::try_unwrap with fallback to clone is an efficient CoW pattern:

  • If no other references exist, reuses the allocation
  • If shared, clones only when necessary
  • Retains the filtered list in a new Arc

This avoids unnecessary allocations when the cache is exclusively owned.


412-419: LGTM - Non-blocking staleness check.

Using try_read() and returning false when the lock is held prevents the thundering herd problem. When another task holds the write lock (likely refreshing), callers proceed with potentially stale data rather than all blocking.


422-440: LGTM - Offering comparison logic.

The comparison priorities are well-designed:

  1. Exact GPU count match preferred (avoid over-provisioning)
  2. Lower hourly rate (cost optimization)
  3. Fewer GPUs (minimize waste when no exact match)
  4. Deterministic by ID (stable selection across reconciliations)

Using total_cmp for f64 handles NaN gracefully.


637-658: Verify division behavior when offering.gpu_count equals pod_gpu_request.

The calculation pods_per_node = offering.gpu_count / pod_gpu_request is correct, but consider adding a test case for when they're equal (e.g., 4-GPU offering, 4-GPU request → 1 pod per node).

Current tests cover 8/2=4 and 4/1=4 but not the edge case where they match exactly.


660-1136: LGTM - Comprehensive test suite.

The tests provide excellent coverage:

  • Normalization edge cases
  • Hash equality for requirements
  • Offering comparison priorities
  • Node calculation scenarios
  • Constraint filtering
  • Cache behavior (hit, miss, invalidation, refresh)
  • Mock API implementations

The mock implementations are minimal but sufficient for testing.

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (7)

94-126: LGTM - Provisioning count guard prevents stuck counter.

This is a good defensive measure. The pending_scale_up counter could become stuck if NodePools are deleted externally or if status updates fail. Clamping it to the actual count of provisioning NodePools ensures accurate state representation.

The warning log provides visibility into when this correction occurs.


165-224: LGTM - Rich error handling for NoMatchingOffering.

The error handling correctly:

  1. Records the OfferingAvailability condition with detailed diagnostics
  2. Persists status before returning the error (ensures condition is visible)
  3. Clears the condition on success to prevent stale warnings
  4. Preserves the original error for the caller

432-490: LGTM - Static vs dynamic offering selection.

The branching logic is clear:

  • Static mode: Uses explicit offering_id from template, bypasses dynamic selection
  • Dynamic mode: Groups pods by requirements, selects offerings per group

Returning early for static mode avoids unnecessary pod grouping.


648-653: Timestamp format includes milliseconds - verify uniqueness guarantee.

The pool name uses %Y%m%d%H%M%S%3f (milliseconds precision) with an index i. If scale_up_with_offering is called multiple times within the same millisecond (unlikely but possible under high load), the index differentiates them.

This should be sufficient for uniqueness, but consider using a UUID suffix if collision-resistance becomes a concern.


711-759: Good UX - warning events for missing nodeAffinity.

Emitting events to pods that request specific GPU models without nodeAffinity helps users understand potential scheduling issues. The message is clear and actionable.

The per-pod lookup to find the raw Pod is O(n²) but acceptable given typical pending pod counts are small.


806-830: LGTM - Constraint merging logic.

The effective constraints correctly merge template and policy fields:

  • Policy constraints take precedence via merge_with
  • Template fields are used only when policy constraints are absent
  • Returns None when no constraints are specified at either level

886-891: LGTM - OfferingResult enum.

Clean design for representing the three possible outcomes of offering lookup:

  • Found: Normal path, offering selected
  • UsedFallback: Fallback was used, nodes already created
  • NotFound: No matching offering, caller handles the error
crates/basilica-cli/src/cli/handlers/deploy/mod.rs (2)

12-12: LGTM!

The new templates module is correctly declared as pub to allow external access to the deployment template handlers.


41-50: LGTM!

The new Vllm and Sglang deploy actions are cleanly integrated following the existing pattern. The destructuring and delegation to template handlers is consistent with other action handlers in this match block.

crates/basilica-cli/src/cli/handlers/deploy/templates/mod.rs (1)

1-13: LGTM!

Clean module structure with appropriate documentation and re-exports. The public API surface is well-organized, exposing the key items (GpuRequirements, estimate_gpu_requirements, and the deploy handlers) for downstream use.

crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (1)

36-157: LGTM!

The main deployment handler is well-structured with clear separation of concerns: GPU requirement estimation, resource building, deployment creation with retry, and status waiting. The flow handles both attached and detached modes appropriately.

crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (2)

194-228: LGTM!

The SGLang command building correctly uses python -m sglang.launch_server with appropriate arguments for model path, host, port, and optional parameters like tensor parallelism, context length, quantization, and memory fraction.


36-157: LGTM!

The SGLang deployment handler follows the same well-structured pattern as vLLM, with appropriate framework-specific customizations for the default model, port, and command structure.

crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (3)

23-26: LGTM!

Good use of LazyLock for lazy initialization of the compiled regex. The pattern (\d+\.?\d*)b correctly captures parameter counts like "7b", "70b", "0.5b", "1.5b" after the input is lowercased.


62-71: LGTM!

The memory estimation formula (2GB per billion parameters × 1.2 overhead, rounded up to 8GB) is a reasonable heuristic for FP16 models with KV cache overhead.


131-187: LGTM!

Comprehensive unit tests covering parameter extraction, memory estimation, GPU count calculation, and GPU recommendations. The tests verify edge cases like fractional parameters (0.5B), capped GPU counts, and various model sizes.

crates/basilica-cli/src/cli/commands.rs (1)

736-762: New vllm / sglang deploy actions are wired cleanly

The new subcommands and their flattened option groups are consistent with existing clap patterns in this file and give a clear, dedicated surface for templated inference deployments. No issues from this diff alone.

crates/basilica-sdk-python/python/basilica/templates/__init__.py (1)

1-14: Template API re-exports look good

The module docstring and __all__ surface (GpuRequirements, estimate_gpu_requirements) are consistent with model_size.py and give a clear, minimal public API for templates. No changes needed.

…lity

* Extract namespace definitions to separate namespaces.yaml file
* Add secrets-template.yaml with placeholder credentials documentation
* Add .gitignore to prevent accidental secrets.yaml commits
* Remove ServiceMonitor from deployment.yaml (moved elsewhere)
* Update rollout.sh to use basilica-autoscaler namespace
* Add `basilica deploy vllm` command for vLLM inference server deployment
* Add `basilica deploy sglang` command for SGLang inference server deployment
* Implement model_size module for auto-detecting GPU requirements from model names
* Add TemplateCommonOptions for shared deployment parameters (name, gpu, memory, ttl)
* Add VllmOptions for vLLM-specific settings (tensor_parallel_size, dtype, quantization)
* Add SglangOptions for SGLang-specific settings (context_length, mem_fraction_static)
* Support automatic storage configuration for HuggingFace model caching
* Add regex dependency for parameter extraction from model names
* Add BasilicaClient.deploy_vllm() for one-line vLLM deployments
* Add BasilicaClient.deploy_sglang() for one-line SGLang deployments
* Implement templates/model_size.py for GPU requirement estimation
* Auto-detect GPU count and recommended GPU based on model size
* Support all vLLM options (tensor_parallel_size, dtype, quantization, etc.)
* Support all SGLang options (context_length, mem_fraction_static, etc.)
* Auto-configure persistent storage for HuggingFace model caching
* Auto-generate deployment names from model identifiers
* Add 19_deploy_vllm_template.py demonstrating deploy_vllm() method
* Add 20_deploy_sglang_template.py demonstrating deploy_sglang() method
* Show minimal configuration using defaults
* Show custom model deployment with advanced options
* Include cleanup examples
* Extend AutoscalerError with NoMatchingOffering struct variant for detailed error reporting
* Add ApiTimeout error variant for timeout handling in API operations
* Add get_offering method to SecureCloudApi trait for single offering lookup
* Implement K8s event creation methods for Pod and NodePool visibility
* Add offering selection metrics: selection counts, cache hits/misses, duration histograms
* Add no_matching_offering counter for tracking failed selections
* Implement OfferingMatcher with caching and TTL-based refresh (60s default)
* Add OfferingSelector trait for testability and dependency injection
* Support GPU model matching with normalization and prefix matching
* Add OfferingConstraints for provider/region/rate filtering with fallback support
* Implement pod grouping by GPU requirements for efficient batch scaling
* Add node label constants (basilica.ai/gpu-model, gpu-count, gpu-memory-gb)
* Extract GPU requirements from pod annotations and labels
* Support has_gpu_node_affinity detection for scheduling validation
* Include comprehensive test suite with mock API implementations
* Add offering_constraints field for dynamic offering selection
* Make SecureCloudTemplate.offering_id optional to enable dynamic mode
* Update field descriptions to clarify dynamic selection behavior
* Support provider, region, max_hourly_rate, and fallback_offering_id constraints
…Policy controller

* Add OfferingSelector dependency to controller for dynamic offering lookup
* Implement pod grouping by GPU requirements for batch provisioning
* Support static offering_id mode (bypass dynamic selection) for backwards compatibility
* Add fallback_offering_id support when no dynamic match found
* Emit K8s events to pods when no matching offering is available
* Add OfferingAvailability condition to ScalingPolicy status for diagnostics
* Implement pending_scale_up counter correction to prevent stuck state
* Add millisecond precision to NodePool names for collision avoidance
* Emit MissingNodeAffinity warnings for pods lacking GPU node affinity
* Build effective constraints by merging policy-level and template-level settings
* Add offering_id field to track which GPU offering provisioned the node
* Add labels_applied boolean to track GPU label application state
* Prevents race conditions between node Ready and label application
…ache invalidation

* Add optional OfferingSelector for cache invalidation on rental failures
* Fetch and persist GPU metadata (model, count, memory) from offering API
* Apply GPU labels to nodes before Ready to prevent scheduling race conditions
* Implement labels_need_reapplication check for recovery scenarios
* Build GPU labels using node_labels constants for consistency
* Add OfferingUnavailable event emission when offering not found
* Include GpuInfo in node registration API call
* Fix phase timeout to skip u64::MAX (infinite timeout for Ready/Failed phases)
* Re-apply GPU labels during health checks for crash recovery
* Add comprehensive unit tests for label reapplication logic
* Initialize OfferingMatcher with metrics integration on startup
* Wire offering matcher to NodePool controller for cache invalidation
* Wire offering matcher to ScalingPolicy controller for dynamic selection
* Update controller context structs to support generic offering selector type
* Add MaybeOfferingSelector bounds to reconcile functions for flexibility
…tion

* Update ScalingPolicy CRD with offeringConstraints schema
* Make SecureCloudTemplate.offeringId nullable for dynamic mode
* Add labelsApplied and offeringId fields to NodePool status
* Add ServiceMonitor for Prometheus metrics scraping
* Add default ScalingPolicy manifest with dynamic offering example
* Document offering_constraints fields with usage examples
* Add gpu-request.basilica.ai/count label to pods with GPU requirements
* Add gpu-request.basilica.ai/model label for GPU model preferences
* Add autoscaler.basilica.ai/gpu-requirements annotation with full JSON spec
* Enables autoscaler to extract GPU requirements from pending pods
…ning

* Change capacity validation from error to warning for insufficient capacity
* Allow deployments to proceed with pending pods for autoscaler to detect
* Allow scale operations to proceed even when cluster lacks immediate capacity
* Autoscaler will provision nodes when it detects pending GPU pods
* Format model_size.rs test comment on separate line
* Format sglang.rs generate_sglang_name return expression
* Format sglang.rs print_sglang_success curl example
* Format vllm.rs gpu-memory-utilization argument extension
…dpoint

* Add GpuPricesResponse wrapper struct for paginated API response
* Rename GpuOffering fields to match API (gpu_memory_gb_per_gpu, hourly_rate_per_gpu)
* Add deserialize_string_or_f64 helper for flexible numeric parsing
* Update API endpoint from /v1/secure-cloud/offerings to /secure-cloud/gpu-prices
* Update RentalInfo struct to match API response (deployment_id, ip_address fields)
* Change stop_rental from DELETE to POST /rentals/{id}/stop
* Add convenience accessor methods (gpu_memory_gb, hourly_rate, available)
* Add tests for deserialization with string rates and field aliases
…tcher tests

* Add allow_model_fallback field to OfferingConstraints for flexible GPU selection
* Update accessor methods to use GpuOffering's new convenience methods
* Simplify region matching logic (region is now non-optional)
* Add test_offering helper function to reduce test boilerplate
* Refactor all tests to use test_offering helper for cleaner assertions
* Update NodePoolController to use GpuOffering accessor methods
* Update RentalInfo field mappings (deployment_id, ip_address)
* Add model fallback logic in ScalingPolicyController when allow_model_fallback enabled
* Log warnings when falling back to alternative GPU model
…ntals

* Add health_check_interval_secs to AggregatorConfig (default 60s)
* Update rental_health_check_interval to use configured value
* Rename constant to DEFAULT_SECURE_CLOUD_HEALTH_CHECK_INTERVAL_SECS
* Add ID field to registered SSH key display for easier reference
* Add allowModelFallback field to ScalingPolicy CRD offeringConstraints
* Extract ServiceMonitor to separate servicemonitor.yaml file
* Update scaling-policy.yaml with new scaleUp/scaleDown structure
* Add metrics collection configuration to ScalingPolicy
* Add scaling-policy.template.yaml for reference configuration
* Rename secrets-template.yaml to secrets.template.yaml for consistency
* Add scaling-policy.yaml to .gitignore (now generated from template)
* Update lifecycle config with deleteGracePeriodSeconds and forceDrain
@epappas epappas force-pushed the feat/k3s-autoscaler-dynamic-offering branch from b9ce972 to 0753e19 Compare December 15, 2025 16:20
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)

84-127: Address inconsistent pending_scale_up handling between error cases

The comment on lines 161–164 states "pending_scale_up remains incremented" when scale_up fails, but this is inconsistent:

  • try_increment_pending_scale_up atomically increments the counter in the CRD
  • On scale_up() success: you increment status.pending_scale_up locally (line 171), preserving the change
  • On AutoscalerError::NoMatchingOffering: you do not increment status.pending_scale_up, then call update_scaling_policy_status (line 215) with the unmodified status object, which patches the CRD and resets the counter back to its pre-increment value
  • On other errors: you return immediately without calling update_scaling_policy_status, leaving the counter incremented as intended

If the intent is to keep the counter incremented only when NodePools were actually created and will be handled by the NodePool controller, then the NoMatchingOffering case is correct (do not persist the increment). Otherwise, either explicitly adjust status.pending_scale_up in the NoMatchingOffering handler or refine the comment to clarify that the counter is reset only when no offerings match.

♻️ Duplicate comments (3)
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (1)

118-129: GPU recommendation strings do not align with autoscaler offerings.

This issue was already flagged in a previous review. The GPU strings like "NVIDIA-RTX-A4000" and "A100-40GB" don't match the autoscaler's offering matcher which uses plain model names like "A100" and "H100".

crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (1)

241-258: Shared helpers are still duplicated between vLLM and SGLang templates.

parse_env_vars, create_with_retry, is_quota_exceeded, extract_quota_message, WaitResult, wait_for_ready, and format_phase_message are effectively the same as in sglang.rs. Extracting these into a shared templates::common module (or similar) would reduce maintenance and keep behavior aligned across frameworks.

For example:

  • Move the shared helpers into crates/basilica-cli/src/cli/handlers/deploy/templates/common.rs.
  • Re-export them from templates::mod.rs if you want a stable public path.
  • Update both vllm.rs and sglang.rs to use the shared helpers instead of their own copies.

This is non-blocking but will pay off as more templates are added.

Also applies to: 333-395, 397-478

crates/basilica-cli/src/cli/commands.rs (1)

737-762: Template subcommands still define a separate --json flag instead of using the global one

DeployCommand defines a global --json flag (Line 456–458), and TemplateCommonOptions defines another --json (Line 803–805), which is flattened into deploy vllm / deploy sglang (Line 743–747, 756–760). This means:

  • basilica deploy --json vllm … sets the global flag but the template handlers only see common.json.
  • basilica deploy vllm --json … sets the template-local flag, diverging from other deploy subcommands that rely solely on the global one.

This is inconsistent UX and risks clap argument conflicts.

Recommend removing json from TemplateCommonOptions and threading the global DeployCommand.json into the template handlers instead:

 pub struct TemplateCommonOptions {
@@
-    /// Output as JSON
-    #[arg(long)]
-    pub json: bool,
 }

Then adjust the deploy handler wiring so handle_vllm_deploy / handle_sglang_deploy receive cmd.json (from DeployCommand) as a separate parameter rather than reading common.json.

Also applies to: 764-806, 456-458

🧹 Nitpick comments (14)
crates/basilica-autoscaler/src/error.rs (3)

91-96: Richer NoMatchingOffering payload looks good; consider tightening message formatting

The struct-like variant with gpu_count, models, and min_memory_gb is a nice upgrade for observability. One nit: using {:?} for min_memory_gb plus the GB suffix will produce strings like Some(24)GB or NoneGB, which are a bit awkward to read in logs. If you care about operator UX, consider formatting the optional memory more explicitly (e.g., "min_memory: {min_memory_gb.map(|v| format!(\"{v}GB\")).unwrap_or_else(|| \"none\".into())}" or similar), even if that means introducing a small helper or a preformatted field.


123-127: ApiTimeout variant is clear; optionally avoid stringly-typed operations

The dedicated ApiTimeout { operation, timeout_secs } variant makes the source and SLA of timeouts more explicit than overloading PhaseTimeout, which is good. If the set of operations is finite and relatively stable (e.g., SecureCloudProvision, KubeScaleUp, etc.), you might later want to replace the String with an enum to avoid typos and ease filtering/metrics, but the current shape is fine for now.


145-171: Reclassifying NoMatchingOffering as retryable is a behavioral change; confirm it’s intended

Marking NoMatchingOffering { .. } and ApiTimeout { .. } as retryable and removing NoMatchingOffering from is_permanent changes control-flow semantics for callers that rely on these helpers to decide whether to keep scaling or give up. For ApiTimeout this is intuitive, but for NoMatchingOffering it depends on whether the underlying condition is expected to be transient (e.g., provider capacity flapping) or effectively a configuration issue (e.g., impossible GPU requirements) in a given code path.

If some call sites are closer to “unsatisfiable config” than “transient shortage”, you may want to either:

  • Introduce a distinct permanent variant (e.g., UnfulfillableOfferingRequirements), or
  • Gate retries with a counter/backoff to avoid infinite retry loops with noisy logs.

At minimum, worth double‑checking the main scaling loops/watchers that use is_retryable/is_permanent to ensure this reclassification doesn’t accidentally change a previously-terminal failure into an unbounded retry.

orchestrator/k8s/services/autoscaler/scaling-policy.yaml (2)

66-82: Replace hardcoded SSH key ID with a placeholder.

The SSH key ID 35f0ab63-e983-40b0-b53d-953cf67b53d2 appears to be a real UUID. Templates should use placeholder values to avoid accidental usage of incorrect credentials.

       # SSH key ID registered with your cloud provider
-      sshKeyId: "35f0ab63-e983-40b0-b53d-953cf67b53d2"
+      sshKeyId: "<your-ssh-key-id>"  # Replace with your registered SSH key ID

83-108: Consider using a placeholder for the K3s server URL.

The server URL https://connect.basilica.ai:6443 may be environment-specific. A placeholder would make the template more universally applicable.

       # K3s server URL (your cluster's API endpoint)
-      serverUrl: "https://connect.basilica.ai:6443"
+      serverUrl: "<your-k3s-server-url>"  # e.g., https://k3s.example.com:6443
crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (1)

159-192: Extract shared name generation logic to avoid duplication.

This function is nearly identical to generate_vllm_name in vllm.rs (see relevant code snippets). Consider extracting to a shared utility function.

Create a shared helper in the templates module:

// In templates/mod.rs or a new templates/common.rs
pub fn generate_deployment_name(model: &str, prefix: &str) -> String {
    let model_part = model
        .split('/')
        .next_back()
        .unwrap_or(model)
        .to_lowercase()
        .chars()
        .filter_map(|c| {
            if c.is_ascii_alphanumeric() {
                Some(c)
            } else if c == '-' || c == '_' || c == '.' {
                Some('-')
            } else {
                None
            }
        })
        .collect::<String>();

    let sanitized = model_part.trim_matches('-');
    let prefix_part = if sanitized.len() > 40 {
        &sanitized[..40]
    } else {
        sanitized
    };
    let prefix_part = if prefix_part.is_empty() { prefix } else { prefix_part };
    format!("{}-{}-{}", prefix, prefix_part, &uuid::Uuid::new_v4().to_string()[..8])
}
crates/basilica-autoscaler/src/controllers/k8s_client.rs (1)

460-507: Consider adding a random suffix to prevent event name collisions.

Using only the nanosecond timestamp for event name uniqueness could lead to collisions if multiple events are created within the same nanosecond (e.g., during rapid pod processing). Consider adding a short random suffix.

+        use rand::Rng;
         let now = Utc::now();
-        let event_name = format!("{}.{}", pod_name, now.timestamp_nanos_opt().unwrap_or(0));
+        let random_suffix: u16 = rand::thread_rng().gen();
+        let event_name = format!(
+            "{}.{}.{}",
+            pod_name,
+            now.timestamp_nanos_opt().unwrap_or(0),
+            random_suffix
+        );
crates/basilica-autoscaler/src/api/secure_cloud.rs (1)

497-500: Consider caching or a dedicated API endpoint for single-offering lookup.

The current implementation calls list_offerings() and then filters, which is O(n) and makes a full API call each time. If single-offering lookups are frequent, consider:

  1. Adding server-side endpoint for single offering lookup
  2. Caching the offerings list with TTL

For now, this is acceptable if lookups are infrequent, but worth noting for future optimization.

examples/20_deploy_sglang_template.py (1)

23-23: Remove redundant f-strings in constant messages (and optionally harden cleanup).

Lines 23 and 41 use f"...ready!" without any placeholders, which triggers Ruff F541 and is unnecessary. You can drop the f prefix:

-print(f"SGLang deployment ready!")
+print("SGLang deployment ready!")

-print(f"Qwen 3B deployment ready!")
+print("Qwen 3B deployment ready!")

Optionally, you could wrap the body in a try/finally so that deployment.delete() still runs if something fails between the two deployments, but that’s not critical for a demo script.

Also applies to: 41-41

orchestrator/k8s/services/autoscaler/secrets.template.yaml (1)

27-40: Template private key block is fine but may keep triggering scanners.

The inlined OpenSSH private key is clearly a placeholder and marked “REPLACE_WITH_YOUR_SSH_PRIVATE_KEY”, so functionally this is OK. If CI security scanners (like gitleaks) are noisy on this, consider either:

  • Renaming the field to something like ssh-private-key-placeholder, or
  • Adding this path/signature to your scanner’s allowlist.

That way you keep the helpful scaffold without false-positive noise.

examples/19_deploy_vllm_template.py (1)

23-23: Drop unused f-strings in static messages (and optionally add a finally block).

Lines 23 and 41 use f"..." without interpolation; you can simplify to plain strings to satisfy Ruff F541:

-print(f"vLLM deployment ready!")
+print("vLLM deployment ready!")

-print(f"Llama-2-7B deployment ready!")
+print("Llama-2-7B deployment ready!")

As with the SGLang script, you could also wrap the main body in try/finally so deployment.delete() and deployment2.delete() run even if something fails, but that’s optional for an example.

Also applies to: 41-41

crates/basilica-sdk-python/python/basilica/__init__.py (2)

85-113: Consider propagating estimated VRAM into min_gpu_memory_gb for tighter offering constraints

deploy_vllm / deploy_sglang call estimate_gpu_requirements(model) and use:

  • reqs.gpu_count to set GpuRequirementsSpec.count
  • reqs.recommended_gpu to set candidate models

but leave min_gpu_memory_gb=None in GpuRequirementsSpec (Line 562–564, 701–703). That means offering selection only constrains by count + model string, not by the estimated VRAM requirement you already computed.

If the backend’s notion of gpu_type ever includes multiple memory SKUs, this could under-constrain scheduling. A lightweight improvement would be to derive a per‑GPU minimum from the estimate, e.g.:

per_gpu_mem = max(1, reqs.memory_gb // max(gpu_count, 1))
gpu_spec = GpuRequirementsSpec(
    count=gpu_count,
    model=effective_gpu_models,
    min_cuda_version=None,
    min_gpu_memory_gb=per_gpu_mem,
)

or at least make min_gpu_memory_gb configurable via an optional parameter so callers can tighten constraints when needed.

Also applies to: 446-599, 600-737


543-556: Static-analysis S104/S106 findings are expected in this context

The Ruff warnings about binding to all interfaces (--host 0.0.0.0) and a “hardcoded password” on credentials_secret refer to:

  • vLLM/OpenAI server and SGLang server listening inside the pod (--host 0.0.0.0).
  • A Kubernetes secret name ("basilica-r2-credentials"), not an inlined credential.

Given these run in cluster-scoped pods and only reference a secret identifier, there’s no obvious security issue. If you want to quiet the lints, you could:

  • Add per-rule ignores around these lines, or
  • Centralize the secret name in a constant and document that it’s an identifier, not a secret value.

Also applies to: 683-695

crates/basilica-autoscaler/src/offering_matcher.rs (1)

443-474: Optionally also consider GPU limits when deriving gpu_count from pods

get_gpu_count_from_pod inspects only resources.requests["nvidia.com/gpu"] on containers and init containers. In clusters where GPU requests are expressed solely via limits (with no matching request), such pods would be ignored by the matcher.

If you expect that pattern, you could extend the helper to fall back to resources.limits when requests is absent:

-        if let Some(resources) = &container.resources {
-            if let Some(requests) = &resources.requests {
-                if let Some(gpu_quantity) = requests.get("nvidia.com/gpu") {
+        if let Some(resources) = &container.resources {
+            let gpu_map = resources
+                .requests
+                .as_ref()
+                .or(resources.limits.as_ref());
+            if let Some(gpu_quantity) = gpu_map.and_then(|m| m.get("nvidia.com/gpu")) {

Same pattern could be applied to init containers. If your admission / templates always set requests, this is optional.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9ce972 and 0753e19.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (36)
  • crates/basilica-api/src/api/routes/deployments/handlers.rs (2 hunks)
  • crates/basilica-api/src/config/mod.rs (4 hunks)
  • crates/basilica-autoscaler/src/api/secure_cloud.rs (8 hunks)
  • crates/basilica-autoscaler/src/controllers/k8s_client.rs (3 hunks)
  • crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (10 hunks)
  • crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (13 hunks)
  • crates/basilica-autoscaler/src/crd/node_pool.rs (2 hunks)
  • crates/basilica-autoscaler/src/crd/scaling_policy.rs (3 hunks)
  • crates/basilica-autoscaler/src/error.rs (3 hunks)
  • crates/basilica-autoscaler/src/lib.rs (2 hunks)
  • crates/basilica-autoscaler/src/metrics.rs (5 hunks)
  • crates/basilica-autoscaler/src/offering_matcher.rs (1 hunks)
  • crates/basilica-autoscaler/src/runtime.rs (6 hunks)
  • crates/basilica-cli/Cargo.toml (1 hunks)
  • crates/basilica-cli/src/cli/commands.rs (1 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/mod.rs (2 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/templates/mod.rs (1 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (1 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (1 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (1 hunks)
  • crates/basilica-cli/src/cli/handlers/ssh_keys.rs (1 hunks)
  • crates/basilica-operator/src/controllers/user_deployment_controller.rs (2 hunks)
  • crates/basilica-sdk-python/python/basilica/__init__.py (2 hunks)
  • crates/basilica-sdk-python/python/basilica/templates/__init__.py (1 hunks)
  • crates/basilica-sdk-python/python/basilica/templates/model_size.py (1 hunks)
  • examples/19_deploy_vllm_template.py (1 hunks)
  • examples/20_deploy_sglang_template.py (1 hunks)
  • orchestrator/k8s/services/autoscaler/.gitignore (1 hunks)
  • orchestrator/k8s/services/autoscaler/crds.yaml (4 hunks)
  • orchestrator/k8s/services/autoscaler/deployment.yaml (0 hunks)
  • orchestrator/k8s/services/autoscaler/namespaces.yaml (1 hunks)
  • orchestrator/k8s/services/autoscaler/scaling-policy.template.yaml (1 hunks)
  • orchestrator/k8s/services/autoscaler/scaling-policy.yaml (1 hunks)
  • orchestrator/k8s/services/autoscaler/secrets.template.yaml (1 hunks)
  • orchestrator/k8s/services/autoscaler/servicemonitor.yaml (1 hunks)
  • scripts/autoscaler/rollout.sh (1 hunks)
💤 Files with no reviewable changes (1)
  • orchestrator/k8s/services/autoscaler/deployment.yaml
✅ Files skipped from review due to trivial changes (1)
  • orchestrator/k8s/services/autoscaler/servicemonitor.yaml
🚧 Files skipped from review as they are similar to previous changes (11)
  • crates/basilica-operator/src/controllers/user_deployment_controller.rs
  • crates/basilica-autoscaler/src/lib.rs
  • orchestrator/k8s/services/autoscaler/.gitignore
  • crates/basilica-cli/src/cli/handlers/deploy/mod.rs
  • crates/basilica-api/src/api/routes/deployments/handlers.rs
  • crates/basilica-sdk-python/python/basilica/templates/init.py
  • crates/basilica-cli/src/cli/handlers/deploy/templates/mod.rs
  • orchestrator/k8s/services/autoscaler/namespaces.yaml
  • crates/basilica-cli/Cargo.toml
  • crates/basilica-autoscaler/src/metrics.rs
  • scripts/autoscaler/rollout.sh
🧰 Additional context used
🧬 Code graph analysis (8)
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (2)
crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (1)
  • model (161-176)
crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (1)
  • model (162-177)
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (1)
crates/basilica-autoscaler/src/offering_matcher.rs (1)
  • normalize_gpu_model (41-47)
examples/19_deploy_vllm_template.py (3)
crates/basilica-cli/src/github_releases.rs (1)
  • basilica (21-26)
crates/basilica-sdk-python/python/basilica/deployment.py (1)
  • url (183-191)
crates/basilica-cli/src/cli/handlers/deploy/validation.rs (1)
  • memory (401-402)
crates/basilica-sdk-python/python/basilica/__init__.py (2)
crates/basilica-sdk-python/python/basilica/deployment.py (3)
  • name (178-180)
  • Deployment (114-453)
  • _from_response (436-453)
crates/basilica-sdk-python/python/basilica/templates/model_size.py (1)
  • estimate_gpu_requirements (30-60)
crates/basilica-autoscaler/src/runtime.rs (2)
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (3)
  • new (59-74)
  • clone (41-50)
  • with_offering_selector (84-100)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (2)
  • new (58-70)
  • clone (42-49)
crates/basilica-sdk-python/python/basilica/templates/model_size.py (1)
crates/basilica-autoscaler/src/api/secure_cloud.rs (1)
  • gpu_memory_gb (215-217)
crates/basilica-autoscaler/src/api/secure_cloud.rs (1)
crates/basilica-billing/src/domain/rentals.rs (1)
  • offering_id (223-225)
examples/20_deploy_sglang_template.py (2)
crates/basilica-sdk-python/python/basilica/__init__.py (1)
  • deploy_sglang (600-737)
crates/basilica-sdk-python/python/basilica/deployment.py (1)
  • url (183-191)
🪛 Gitleaks (8.30.0)
orchestrator/k8s/services/autoscaler/secrets.template.yaml

[high] 38-40: Identified a Private Key, which may compromise cryptographic security and sensitive data encryption.

(private-key)

🪛 Ruff (0.14.8)
examples/19_deploy_vllm_template.py

23-23: f-string without any placeholders

Remove extraneous f prefix

(F541)


41-41: f-string without any placeholders

Remove extraneous f prefix

(F541)

crates/basilica-sdk-python/python/basilica/__init__.py

520-520: Possible binding to all interfaces

(S104)


551-551: Possible hardcoded password assigned to argument: "credentials_secret"

(S106)


667-667: Possible binding to all interfaces

(S104)


690-690: Possible hardcoded password assigned to argument: "credentials_secret"

(S106)

examples/20_deploy_sglang_template.py

23-23: f-string without any placeholders

Remove extraneous f prefix

(F541)


41-41: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (38)
crates/basilica-cli/src/cli/handlers/ssh_keys.rs (1)

405-405: LGTM! Good enhancement to display the key ID.

This addition provides users with the SSH key ID, which can be useful for debugging or cross-referencing with API calls. The formatting and styling are consistent with the adjacent fields.

crates/basilica-api/src/config/mod.rs (5)

28-29: LGTM!

The constant is appropriately named and documented. The 60-second default is reasonable for health check intervals.


129-131: LGTM!

The field is properly documented and uses appropriate types. The serde default attribute ensures backward compatibility with existing configuration files that don't specify this field.


141-143: LGTM!

The function follows the established pattern in this file for providing serde defaults and correctly returns the constant.


149-149: LGTM!

The field initialization correctly uses the default function, maintaining consistency with the existing pattern.


311-313: No remaining usages of the removed constant detected.

The verification confirms that the constant RENTAL_HEALTH_CHECK_INTERVAL_SECS has been completely removed from the codebase with no dangling references. The method rental_health_check_interval() now correctly uses the configurable field self.aggregator.health_check_interval_secs, and all related code has been properly updated. The change is safe to merge.

orchestrator/k8s/services/autoscaler/scaling-policy.yaml (4)

1-27: LGTM - Well-documented policy header and scale-up configuration.

The file header clearly explains the purpose and includes the apply command. The scale-up configuration with pendingPodThreshold: 1 ensures responsive scaling while cooldownSeconds: 60 prevents thrashing.


28-40: LGTM - Reasonable scale-down configuration.

The scale-down parameters provide good defaults: 10-minute idle timeout, 10% GPU utilization threshold, and 5-minute zero-pods duration strike a balance between cost savings and stability.


41-65: LGTM - Dynamic offering constraints well-documented.

The offering constraints section clearly explains each option with helpful inline comments. The allowModelFallback: true default is sensible for availability.


109-115: LGTM - Lifecycle hooks configuration is sensible.

The drain-before-removal with a 10-minute timeout and 30-second grace period provides safe node removal while forceDrain: false respects PDBs.

crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (6)

36-68: LGTM - Clean deployment flow with sensible defaults.

The function properly handles optional model input, auto-detects GPU requirements, and provides informative output when auto-detecting. The name generation fallback is appropriate.


230-247: LGTM - Robust environment variable parsing.

The parse_env_vars function correctly validates the KEY=VALUE format and provides clear error messages. Using splitn(2, '=') properly handles values containing = characters.


292-320: LGTM - Well-configured health checks with appropriate timing.

The health check configuration uses sensible values:

  • Startup probe with 60 failure threshold allows up to 10 minutes for model loading
  • Liveness probe with 60s initial delay gives time for model initialization
  • Readiness probe is appropriately more sensitive with 10s period

322-354: LGTM - Solid retry implementation with jitter.

The exponential backoff with jitter (0.75-1.25 factor) is well-implemented and follows best practices. The quota detection logic properly surfaces quota errors immediately without retrying.


386-452: LGTM - Well-structured wait loop with phase-aware polling.

The wait loop implementation is clean:

  • Dynamic sleep duration based on phase is efficient
  • Phase transitions trigger informative user messages
  • Terminal states (Failed, Terminating) are properly detected

20-27: The use of lmsysorg/sglang:latest aligns with SGLang's official documentation, which explicitly recommends this tag. The current stable release is v0.5.6.post2. No changes needed.

Likely an incorrect or invalid review comment.

crates/basilica-autoscaler/src/controllers/k8s_client.rs (2)

75-97: LGTM - Well-designed event trait methods.

The trait extensions for pod and node pool events provide good observability hooks. The method signatures appropriately make pod_uid/pool_uid optional to handle cases where UIDs may not be available.


509-556: LGTM - Consistent event creation for NodePool resources.

The NodePool event implementation correctly uses the CRD's API version (autoscaler.basilica.ai/v1alpha1) and appropriate action type (Provisioning). The structure mirrors the pod event implementation for consistency.

crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (4)

1-50: LGTM - Clean module structure with clear documentation.

The module is well-organized with appropriate documentation. The estimate_gpu_requirements function provides a sensible fallback chain: parameter extraction → model family detection → default.


52-71: LGTM - Reasonable memory estimation heuristics.

The formula (2GB × params × 1.2 overhead, rounded up to 8GB) is a reasonable approximation for FP16 inference. The 20% overhead accounts for KV cache and runtime memory.


73-110: LGTM - Good coverage of common model families.

The model family detection covers major models (Llama variants, Mixtral, Qwen, Mistral, Phi, Gemma, TinyLlama) with appropriate memory estimates. The 16GB default for unknown models is conservative.


131-187: LGTM - Comprehensive test coverage.

The tests thoroughly cover parameter extraction, memory estimation, GPU count calculation, and end-to-end estimation. The floating-point comparisons use appropriate epsilon values.

crates/basilica-autoscaler/src/api/secure_cloud.rs (6)

153-187: LGTM - Robust deserialization for flexible API responses.

The deserialize_string_or_f64 implementation correctly handles string, f64, i64, and u64 inputs, making it resilient to API variations where numeric fields may be serialized as strings (common with decimal types).


196-228: LGTM - Well-structured GpuOffering with backwards compatibility.

The struct uses #[serde(alias = ...)] for backward compatibility with legacy field names (available, hourly_rate). The convenience accessors (gpu_memory_gb(), hourly_rate(), available()) provide a clean API.


329-330: LGTM - Clean trait extension for offering lookup.

Adding get_offering to the trait provides a convenient way to retrieve a specific offering by ID.


543-575: LGTM - Stop rental endpoint correctly uses POST.

The change from DELETE to POST for /secure-cloud/rentals/{id}/stop is appropriate since stopping a rental is an action rather than resource deletion.


767-829: LGTM - Comprehensive deserialization tests.

The tests cover:

  • Standard API response format
  • Legacy field aliases (available, hourly_rate)
  • String-formatted hourly rates (API's Decimal type)

This ensures robust handling of API variations.


230-243: The RentalInfo struct does not have a region field. The struct contains only 8 fields: rental_id, deployment_id, provider, status, ip_address, ssh_command, and hourly_cost. The region field exists in the GpuOffering struct (as a required String), not in RentalInfo.

Likely an incorrect or invalid review comment.

crates/basilica-autoscaler/src/crd/node_pool.rs (1)

386-389: New status fields for offering and label state look consistent and safe.

offering_id: Option<String> and labels_applied: bool are wired with sensible serde defaults and match the camelCase CRD fields (offeringId, labelsApplied). This should integrate cleanly with dynamic offering tracking and GPU label application without breaking existing consumers.

Also applies to: 454-457

crates/basilica-autoscaler/src/crd/scaling_policy.rs (1)

7-7: Dynamic offering fields are well-integrated with the CRD model.

Making SecureCloudTemplate.offering_id optional and adding offering_constraints: Option<OfferingConstraints> on ScalingPolicySpec cleanly express “static versus dynamic” offering selection in the type system. The serde defaults/skip_serializing_if usage match the CRD and should preserve backward compatibility for existing manifests that don’t set these fields.

Also applies to: 49-52, 170-201

orchestrator/k8s/services/autoscaler/scaling-policy.template.yaml (1)

11-115: ScalingPolicy template matches the CRD and clearly documents dynamic offering behavior.

The offeringConstraints block and nodeTemplate.secureCloud fields line up with the CRD schema and give a sensible default policy for dynamic GPU provisioning (including clear comments about optional offeringId overrides and cost caps). No issues from a config/CRD compatibility perspective.

crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (1)

35-157: vLLM deployment flow, resources, and readiness handling look solid.

The overall flow in handle_vllm_deploy (GPU estimation, resource construction, storage + health checks, exponential-backoff create_with_retry, and phase-aware wait_for_ready) is cohesive and matches the intended UX for template-based deployments. build_vllm_resources and build_vllm_health_check provide sane defaults, and the success output from print_vllm_success gives users clear endpoints and usage examples. I don’t see correctness issues in these paths given the surrounding types.

Also applies to: 191-331, 333-478, 480-519

orchestrator/k8s/services/autoscaler/crds.yaml (1)

380-383: CRD schema updates align with Rust CRD structs and dynamic offering semantics.

The additions of status.labelsApplied, status.offeringId, spec.nodeTemplate.secureCloud.offeringId, and spec.offeringConstraints are consistent with the corresponding Rust types (NodePoolStatus, SecureCloudTemplate, ScalingPolicySpec). Defaults and nullability are set appropriately, so these CRDs should round-trip cleanly with the controller code while exposing the new dynamic offering and label-application state.

Also applies to: 417-420, 661-711, 715-744

crates/basilica-autoscaler/src/runtime.rs (1)

23-24: OfferingMatcher integration into controller runtime looks sound

The runtime cleanly injects a single OfferingMatcher instance (with metrics) into both NodePoolController::with_offering_selector and ScalingPolicyController::new, and the updated ControllerContext / ScalingContext generics plus reconcile bounds (S: MaybeOfferingSelector vs S: OfferingSelector) are consistent with the controller impls.

No obvious correctness issues in how the selector is constructed, shared via Arc, or threaded through reconcile/error-policy paths.

Also applies to: 55-61, 109-126, 131-156, 183-218, 221-254

crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (2)

20-33: Offering selector integration and rental failure invalidation are wired correctly

The new generic parameter S with default () plus:

  • NodePoolController::new (S = (), offering_selector: None)
  • NodePoolController::with_offering_selector (S: OfferingSelector, offering_selector: Some(Arc<S>))
  • invalidate_offering_on_failure guarded on Option<Arc<S>> and S: MaybeOfferingSelector

gives a clean way to enable offering‑cache invalidation only when an OfferingMatcher is present. The call from handle_provisioning on start_rental failure (Line 406–415) looks correct and ensures repeated rental failures for a given offering_id will evict that entry from the in‑memory cache without affecting non‑dynamic setups.

No changes needed here.

Also applies to: 35-51, 53-75, 77-101, 103-121, 401-455


646-736: GPU label application and reapplication logic is robust and well‑tested

The new helpers:

  • build_gpu_labels (Line 1194–1225) centralize construction of GPU/node/offering labels using normalized GPU model strings and status/spec precedence.
  • labels_need_reapplication (Line 1228–1241) correctly detects missing/mismatched labels, with thorough unit tests (Line 1318–1431).

Their use in:

  • handle_waiting_for_node to eagerly apply labels as soon as the node appears (before Ready), and to ensure labels are applied before transitioning to Ready.
  • handle_ready to re-apply labels when drift is detected.

nicely covers both the initial race window and later drift / recovery scenarios. This should materially reduce “node is Ready but unschedulable due to missing labels” issues.

Also applies to: 738-785, 1194-1241, 1318-1431

crates/basilica-autoscaler/src/offering_matcher.rs (1)

49-72: Core offering matcher design (caching, selection, grouping) looks solid

Highlights:

  • GpuRequirements normalizes model names into a BTreeSet, making grouping by requirements deterministic and order‑insensitive (Line 49–72, 696–705).
  • OfferingMatcher’s cache (RwLock<OfferingCache>, TTL, tokio::time::timeout wrapped API calls) with metrics hooks is well‑structured (Line 206–221, 228–301).
  • Model matching (exact + forward prefix only), memory filtering, and constraint checks are sensibly layered before compare_offerings, which encodes a clear preference order (exact GPU count → cheaper → fewer GPUs → stable ID) (Line 303–357, 423–441).
  • Pod grouping and node‑count calculation (group_pending_pods_by_requirements, calculate_nodes_needed) are correct for homogenous groups and well tested (Line 620–660, 734–783).

The test suite exercises normalization, equality, comparison ordering, cache behavior, and constraint filters thoroughly, which gives good confidence in the matcher’s behavior.

Also applies to: 206-221, 228-301, 303-357, 359-421, 423-441, 620-660, 661-905

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)

432-490: Dynamic GPU offering–aware scale-up, constraints, and pod visibility are well-structured

The new dynamic path in scale_up and helpers:

  • Groups pending GPU pods by normalized GpuRequirements (via group_pending_pods_by_requirements) and guards against over‑creating nodes with the max_count / remaining logic (Line 432–490, 501–559, 630–659).
  • Uses OfferingSelector + OfferingConstraints with a clear precedence/merge strategy (build_effective_constraints) so policy‑level constraints override template hints while still honoring template defaults (Line 501–559, 831–855).
  • Handles three offering cases explicitly: exact match, relaxed model match (when allow_model_fallback), and explicit fallback offering_id, each with appropriate logging, metrics and, in the fallback case, immediate NodePool creation (Line 561–627, 911–916).
  • Surfaces user‑visible K8s events for two important failure modes:
    • Pods that specify GPU models but lack nodeAffinity (MissingNodeAffinity) (Line 738–784).
    • Pods for which no offering could be found (OfferingUnavailable) (Line 786–829).
  • Ensures NodePools are always created via scale_up_with_offering that wires the chosen offering_id into SecureCloudConfig and tags NodePools with the scaling policy label (Line 661–699, 857–901).

Overall this gives a clear, explainable autoscaling behavior around GPU offerings and makes diagnosis much easier for users.

Also applies to: 492-559, 561-627, 630-659, 738-784, 786-829, 831-855, 857-901, 911-916, 918-949

…hain monitor

* Add reconnect() method to re-establish blockchain connection on failures
* Reuse LegacyRpcMethods client instead of creating new RPC per block fetch
* Wrap client and rpc_methods in Arc<RwLock> for thread-safe reconnection
* Add is_block_likely_pruned() helper to detect pruned blocks on non-archive nodes
* Extract is_insecure_endpoint() helper to reduce code duplication
* Fix get_transfers_from_block to be a static method for cleaner code
…and skip behavior

* Add MonitorConfig with max_block_retries, block_retention_threshold, max_block_gap, auto_reconnect
* Add config validation for new blockchain resilience parameters
* Implement automatic block skipping when failures exceed retry threshold
* Add reconnection attempts after multiple consecutive connection failures
* Detect and handle likely-pruned blocks using bittensor is_block_likely_pruned
* Handle large block gaps on service restart by skipping to current block
* Export MonitorConfig from blockchain module for use in main
* Initialize ChainMonitor with config from BlockchainConfig
…n list

* Use only floating_ip for Hyperstack VMs (fixed_ip is VPC-internal only)
* Refresh active rentals from provider before listing to get latest IP/status
* Poll provider for pending/provisioning rentals to capture IP assignment
* Autoscaler will now poll until floating_ip is assigned by Hyperstack
…s and API types

* Add WireGuardPeerStatus struct for storing peer config in NodePool status
* Add peers field to WireGuardStatus for registration response storage
* Add ssh_user field to SecureCloudConfig and SecureCloudTemplate (default: ubuntu)
* Add datacenter_id field to SecureCloudTemplate for node registration
* Change WireGuardConfig default enabled to true for remote nodes
* Refactor NodeRegistrationRequest to use GpuSpecs instead of optional GpuInfo
* Update NodeRegistrationResponse to make wireguard optional with peers list
* Add WireGuardPeer.allowed_ips() method to compute allowed IPs including pod network
* Add get_rental() method to SecureCloudApi trait for polling rental status
* Add ListRentalsResponse and RentalListItem for parsing list endpoint
* Add ConnectivityCheck error variant for better diagnostics
* Add list_replica_sets_for_deployment to query RS owned by a deployment
* Add scale_replica_set to adjust RS replica count via JSON patch
* Implement methods in MockK8sClient, KubeClient, and RateLimitedKubeClient
* Filter RS by owner references to find deployment-owned replicas
…e ReplicaSets

* Use normalize_gpu_models in job, rental, and user_deployment controllers
* Add cleanup_stale_replica_sets to scale down old RS with 0 ready replicas
* Prevent autoscaler from provisioning nodes for pods with outdated nodeAffinity
* Add unit tests for GPU model normalization and deduplication in nodeAffinity
…nerd config

* Add PendingPodFilterConfig to limit max pending age and skip stale node affinity
* Add filter_pods_by_age to exclude pods pending too long from scaling decisions
* Add filter_pods_with_stale_affinity to skip pods whose GPU model matches existing nodes
* Refactor NVIDIA container runtime script to use base64 encoding for transfer
* Fix containerd v3 config section ordering for default_runtime_name
* Add comprehensive unit tests for pending pod filtering behavior
…dling

* Return 0 balance for accounts not found on chain instead of parsing errors
* Return error instead of 0 when balance parsing fails to prevent silent failures
* Extract block_number from finalized transactions for sweep tracking
* Add block_number field to TransferReceipt struct
* Separate wait_for_finalized and wait_for_success for better error handling
…rage

* Add get_pending_sweep query to check for existing pending/submitted sweeps
* Add list_stale_sweeps to find sweeps stuck beyond threshold for reconciliation
* Pre-compute required_reserve in SweepCalculator to avoid repeated addition
* Add required_reserve() accessor for sweep decision visibility
…config

* Add max_sweeps_per_cycle config to prevent transaction flooding (default: 50)
* Reject default all-zeros AEAD key in production environment
* Reject default AEAD key when reconciliation is enabled
* Validate AEAD key format: must be exactly 64 hex characters (32 bytes)
* Add comprehensive unit tests for AEAD key validation rules
… wallet monitoring

* Add idempotency check to skip accounts with pending/submitted sweeps
* Re-verify balance before sweep to minimize race condition window
* Add reconcile_stale_sweeps to recover sweeps stuck in pending/submitted state
* Add monitor_cold_wallet to track treasury balance for operational visibility
* Apply max_sweeps_per_cycle rate limit to prevent transaction flooding
* Update sweep status to Confirmed on successful transfer with block_number
* Mark stale sweeps as Confirmed or Failed based on on-chain balance verification
* Fix PAYMENTS_AEAD_KEY_HEX to PAYMENTS_TREASURY__AEAD_KEY_HEX (nested config)
* Add ECR repository for AEAD migration image with lifecycle policy
* Add ECS task definition for running migration in Fargate
* Add migration tool: standalone Rust binary with AEAD decrypt/re-encrypt logic
* Include Dockerfile for building minimal migration container
* Document dry-run workflow with aws ecs run-task commands in outputs
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/basilica-payments/src/reconciliation/service.rs (1)

176-177: Extract repeated address preview logic into a helper function.

The pattern address.chars().take(N).collect::<String>() appears throughout this file (lines 176, 193, 205, 216, 230, 260, 287-294, 457-462, 502) and in other files (per relevant code snippets). This violates DRY.

Add helper functions to a common module:

fn truncate_account_hex(account_hex: &str) -> String {
    account_hex.chars().take(8).collect()
}

fn truncate_ss58_address(address: &str) -> String {
    address.chars().take(10).collect()
}

Then use throughout:

-let account_preview = account_hex.chars().take(8).collect::<String>();
+let account_preview = truncate_account_hex(account_hex);
-let hotwallet_preview = hotwallet_ss58.chars().take(10).collect::<String>();
+let hotwallet_preview = truncate_ss58_address(&hotwallet_ss58);

Also applies to: 193-194, 205-206, 216-217, 230-231, 260-261, 287-294, 457-462, 502-503

crates/basilica-operator/src/controllers/rental_controller.rs (1)

148-155: Align GPU model normalization with BasilicaQueue per‑model limits

build_node_affinity now normalizes gpu.model via normalize_gpu_models(&gpu.model) for the basilica.ai/gpu-model selector, which is good for scheduling and for how you attribute GPUs to models when walking pod nodeAffinity later in reconcile.

However, the BasilicaQueue per‑model limit path below still uses the raw spec.container.resources.gpus.model:

  • requested_models is built directly from spec.container.resources.gpus.model (lines around 899–904).
  • Per‑model caps are looked up with per_model.get(&m) where m is one of those raw strings.

If a user specifies models with memory suffixes or mixed case (e.g. ["A100-40GB", "H100-80GB"]), node affinity and model_gpusaccounting will see normalized short names like"A100"/"H100", but requested_modelswill still be"A100-40GB"/"H100-80GB". A GpuLimits.models` map keyed by normalized short names would then never be consulted for such requests, effectively disabling per‑model caps for those workloads.

Consider normalizing requested models before enforcing per‑model limits, e.g.:

-                    let requested_models = if spec.container.resources.gpus.model.is_empty() {
-                        vec!["any".to_string()]
-                    } else {
-                        spec.container.resources.gpus.model.clone()
-                    };
+                    let requested_models = if spec.container.resources.gpus.model.is_empty() {
+                        vec!["any".to_string()]
+                    } else {
+                        normalize_gpu_models(&spec.container.resources.gpus.model)
+                    };

and document that GpuLimits.models keys are expected to use the same normalized short names. This keeps affinity, usage accounting, and caps all speaking the same model vocabulary.

♻️ Duplicate comments (3)
crates/basilica-operator/src/controllers/user_deployment_controller.rs (1)

535-553: Fix invalid GPU model label value for multi‑model specs and surface JSON errors to aid autoscaler debugging

When gpu_config has multiple models, this block:

labels.insert(
    "gpu-request.basilica.ai/model".to_string(),
    gpu.model.join(","),
);

will produce label values like "A100,H100". Kubernetes label values must be alphanumeric and may only contain -, _, and . as separators; commas are not allowed. A Deployment with such a label will be rejected by the API server for multi‑model specs, breaking those deployments. This was already called out in a previous review.

To make the label both valid and consistent with your normalization story, consider:

  • Normalizing the models with normalize_gpu_models(&gpu.model) (same as node affinity), and
  • Joining with a label‑safe delimiter such as "-" or ".".

For example:

-        if !gpu.model.is_empty() {
-            labels.insert(
-                "gpu-request.basilica.ai/model".to_string(),
-                gpu.model.join(","),
-            );
-        }
+        if !gpu.model.is_empty() {
+            let models = normalize_gpu_models(&gpu.model);
+            // Kubernetes label values cannot contain commas; join with a label-safe separator.
+            labels.insert(
+                "gpu-request.basilica.ai/model".to_string(),
+                models.join("-"),
+            );
+        }

Also, JSON serialization failures for the autoscaler annotation are silently ignored:

if let Ok(json) = serde_json::to_string(gpu) {
    annotations.insert("autoscaler.basilica.ai/gpu-requirements".to_string(), json);
}

If this ever fails (schema drift, unexpected values), the autoscaler loses its primary source of truth with no signal. It would be safer to at least emit a debug log:

match serde_json::to_string(gpu) {
    Ok(json) => {
        annotations.insert(
            "autoscaler.basilica.ai/gpu-requirements".to_string(),
            json,
        );
    }
    Err(e) => {
        debug!(
            error = %e,
            "Failed to serialize GPU requirements for autoscaler annotation"
        );
    }
}

Finally, it’s worth adding a small unit test that renders a Deployment with a multi‑model GpuSpec and asserts both the label (gpu-request.basilica.ai/model) and the autoscaler.basilica.ai/gpu-requirements annotation are present and valid.

Also applies to: 618-625

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)

771-771: Fix invalid chrono format specifier for milliseconds.

Line 771 uses %3f which is not a valid chrono strftime specifier. The correct format for millisecond precision is %.3f (with a leading dot).

Apply this fix:

-                Utc::now().format("%Y%m%d%H%M%S%3f"),
+                Utc::now().format("%Y%m%d%H%M%S%.3f"),

Without this fix, milliseconds will not be included in the timestamp, reducing uniqueness guarantees when creating multiple NodePools rapidly within the same second.

crates/basilica-autoscaler/src/provisioner/ssh.rs (1)

453-456: Forcefully killing apt/dpkg and removing locks is still risky.

This issue was previously flagged: killing dpkg with -9 and removing lock files can corrupt the package database if an active operation is in progress (not just a stale lock).

The wait loop above (lines 443-451) correctly uses fuser to detect lock holders, but then forcefully kills processes and removes locks anyway. A safer approach is to fail with a clear error if locks are still held after the timeout, as suggested in the previous review.

🧹 Nitpick comments (24)
crates/basilica-payments/src/blockchain/client.rs (2)

99-121: Debug format string parsing is fragile.

The fallback relies on the Debug trait's output format ({:?}), which is not a stable API and may change between Rust/library versions. This could cause silent failures if the format changes.

Consider adding a comment documenting this fragility, or alternatively, inspect the raw decoded value's structure more robustly using subxt's Value API (e.g., iterating over composite fields).


220-224: TransferStatus::InBlock appears to be dead code.

The transfer method now waits for finalization and only returns Finalized status. The InBlock variant is never used.

Consider removing it if no longer needed, or add #[allow(dead_code)] with a comment explaining its intended future use.

crates/basilica-autoscaler/src/config.rs (2)

183-205: Consider refactoring to eliminate duplicate defaults.

The Default implementation duplicates the same default values as from_env() (600 and true). This violates DRY and creates maintenance risk if defaults need to change.

Consider one of these approaches:

  • Remove the Default impl if it's not needed outside tests
  • Or have Default::default() delegate to from_env() style defaults via a shared constant

Example refactor:

+const DEFAULT_MAX_PENDING_AGE_SECONDS: u64 = 600;
+const DEFAULT_SKIP_STALE_NODE_AFFINITY: bool = true;
+
 impl PendingPodFilterConfig {
     fn from_env() -> Self {
         Self {
             max_pending_age_seconds: parse_u64_env(
                 "BASILICA_AUTOSCALER_MAX_PENDING_AGE_SECS",
-                600, // 10 minutes default
+                DEFAULT_MAX_PENDING_AGE_SECONDS,
             ),
             skip_stale_node_affinity: parse_bool_env(
                 "BASILICA_AUTOSCALER_SKIP_STALE_NODE_AFFINITY",
-                true,
+                DEFAULT_SKIP_STALE_NODE_AFFINITY,
             ),
         }
     }
 }
 
 impl Default for PendingPodFilterConfig {
     fn default() -> Self {
         Self {
-            max_pending_age_seconds: 600,
-            skip_stale_node_affinity: true,
+            max_pending_age_seconds: DEFAULT_MAX_PENDING_AGE_SECONDS,
+            skip_stale_node_affinity: DEFAULT_SKIP_STALE_NODE_AFFINITY,
         }
     }
 }

183-205: Consider adding test coverage for the new configuration.

The existing test suite validates other config sections but doesn't cover PendingPodFilterConfig. Adding tests would ensure the environment variable parsing and defaults work as expected.

Example test:

#[test]
fn pending_pod_filter_defaults() {
    let config = PendingPodFilterConfig::from_env();
    assert_eq!(config.max_pending_age_seconds, 600);
    assert_eq!(config.skip_stale_node_affinity, true);
}

#[test]
fn pending_pod_filter_in_autoscaler_config() {
    let config = AutoscalerConfig::from_env();
    assert_eq!(config.pending_pod_filter.max_pending_age_seconds, 600);
}
scripts/cloud/migration.tf (1)

136-153: Helpful operator-facing output, but clarify the DRY_RUN default.

The generated commands are well-documented. Note that the first command (DRY RUN) relies on the task definition's default DRY_RUN=true, which is correct. Consider adding a comment in the output itself clarifying that the first command uses the default dry-run mode without needing an override.

   value       = <<-EOT
-    # DRY RUN (review logs before proceeding):
+    # DRY RUN (uses default DRY_RUN=true from task definition - review logs before proceeding):
     aws ecs run-task \
scripts/migrations/Dockerfile (1)

14-18: Include Cargo.lock for reproducible builds.

Without a Cargo.lock, each build may pull different dependency versions, potentially introducing subtle behavioral differences in cryptographic operations.

 # Copy migration source
-COPY Cargo.toml migrate_aead_key.rs ./
+COPY Cargo.toml Cargo.lock migrate_aead_key.rs ./

Generate the lock file first:

cd scripts/migrations && cargo generate-lockfile
scripts/migrations/migrate_aead_key.rs (2)

118-122: Consider reducing key material in logs.

Printing the first 8 hex characters (32 bits) of cryptographic keys in logs is low-risk but unnecessary. In production CloudWatch logs, even partial key information adds to the attack surface. Consider reducing to 4 characters or just logging a hash/fingerprint for key identification.


176-206: Consider zeroizing sensitive mnemonic data after use.

The decrypted mnemonic remains in memory after the loop iteration. For defense-in-depth, consider using the zeroize crate to clear sensitive data, though this is optional given the short-lived nature of this migration task.

scripts/migrations/Cargo.toml (1)

13-20: Consider pinning dependency versions for reproducibility.

For a migration tool handling sensitive cryptographic operations, pinning exact versions (e.g., aes-gcm = "=0.10.3") or using a committed Cargo.lock ensures identical behavior across builds and environments. This is optional for a one-time tool but recommended for auditability of cryptographic code.

crates/basilica-payments/src/reconciliation/calculator.rs (2)

18-20: Consider returning Result instead of panicking on overflow.

The expect call will panic if target_balance_plancks + estimated_fee_plancks overflows. For a financial reconciliation system, this configuration error should ideally be surfaced to the caller rather than causing a panic.

Consider refactoring to return a Result:

-    pub fn new(
+    pub fn new(
         minimum_threshold_plancks: u128,
         target_balance_plancks: u128,
         estimated_fee_plancks: u128,
-    ) -> Self {
+    ) -> Result<Self, String> {
         let required_reserve_plancks = target_balance_plancks
             .checked_add(estimated_fee_plancks)
-            .expect("target_balance + estimated_fee overflow - invalid configuration");
+            .ok_or_else(|| "target_balance + estimated_fee overflow - invalid configuration".to_string())?;

-        Self {
+        Ok(Self {
             minimum_threshold_plancks,
             target_balance_plancks,
             estimated_fee_plancks,
             required_reserve_plancks,
-        }
+        })
     }

5-8: Document why fields are retained despite being unused.

The #[allow(dead_code)] annotations suppress warnings for target_balance_plancks and estimated_fee_plancks, but these fields are only used during construction to compute required_reserve_plancks. If they serve no other purpose, consider removing them or adding a comment explaining why they're retained (e.g., for debugging, transparency, or API stability).

crates/basilica-payments/src/storage/reconciliation.rs (1)

211-211: Document or justify the hardcoded limit of 100 stale sweeps.

The query limits results to 100 stale sweeps. If more than 100 sweeps become stale, they won't be reconciled in this cycle. Consider whether this is acceptable or if pagination/batching should be implemented.

crates/basilica-payments/src/reconciliation/service.rs (3)

142-165: Rate limiting prevents overload but may defer accounts indefinitely.

The per-cycle sweep limit is good for preventing overload, but accounts processed later in the list could be deferred indefinitely if the limit is consistently reached. Consider implementing round-robin or priority-based account selection to ensure fairness.


324-340: Sweep record uses current_balance instead of initial_balance—verify this is intentional.

Line 333 records current_balance (verified at lines 296-301) rather than initial_balance (passed as parameter). This is correct if the intent is to record the most recent verified balance at sweep initiation, but it differs from the parameter name initial_balance, which could cause confusion.

Consider renaming the parameter to reflect its role:

 async fn execute_sweep(
     &self,
     account_hex: &str,
-    initial_balance: u128,
+    caller_observed_balance: u128,
     initial_sweep_amount: u128,
 ) -> Result<()> {

538-583: Document the 80% heuristic for sweep success detection.

The 80% threshold (line 543) is used to determine if a stale sweep likely succeeded based on balance drop. While reasonable, this heuristic could produce false positives/negatives:

False negatives (marking success as failure):

  • If a large deposit arrived after the sweep, the balance might not have dropped enough
  • This would incorrectly mark a successful sweep as failed

False positives (marking failure as success):

  • If the balance dropped due to external transactions unrelated to the sweep
  • This would incorrectly mark a failed sweep as confirmed

Consider these improvements:

  1. Add tx_hash verification: If sweep.tx_hash exists, query the blockchain to verify the transaction status rather than relying solely on balance changes
  2. Document the heuristic: Add a comment explaining the 80% threshold and known limitations
  3. Emit metrics: Track reconciliation outcomes (confirmed vs failed) for monitoring

Example improvement:

// If we have a tx_hash, verify it on-chain first
if let Some(ref tx_hash) = sweep.tx_hash {
    match self.blockchain.get_transaction_status(tx_hash).await {
        Ok(status) if status.is_finalized() => {
            // Transaction is confirmed on-chain
            sweep_likely_succeeded = true;
        }
        Ok(_) => {
            // Transaction not finalized, use heuristic
        }
        Err(e) => {
            warn!("Failed to query tx_hash {}: {}, using balance heuristic", tx_hash, e);
        }
    }
}
crates/basilica-operator/src/labels.rs (1)

165-205: Case‑insensitive GPU model extraction and normalization are well‑designed

Uppercasing once in extract_short_gpu_model and matching against GPU_MODEL_PATTERNS, then layering normalize_gpu_models on top with dedup + order preservation, gives you a clean canonical model vocabulary and the tests exercise the key cases (mixed case, memory suffixes, duplicates, empty input).

One minor future‑proofing suggestion: since matching now assumes product_upper.contains(pattern), it’s worth either documenting that GPU_MODEL_PATTERNS entries must be uppercase or normalizing the patterns at compile time to avoid subtle mis‑matches if someone later adds a lowercase pattern. Otherwise the implementation and tests look solid.

Also applies to: 453-513

crates/basilica-operator/src/controllers/user_deployment_controller.rs (1)

913-917: ReplicaSet cleanup logic is reasonable but has a few behavioral nuances

cleanup_stale_replica_sets is a good addition to prevent the autoscaler from provisioning capacity for pods belonging to old rollouts: it scopes to ReplicaSets for this deployment, finds the newest revision, and scales older RS down to 0 when ready_replicas == 0 but spec.replicas > 0. Errors are logged and not propagated, keeping reconcile resilient.

A couple of behavior notes to keep in mind:

  • Any older RS that still has non‑ready pods (Pending/Failed) but zero ready replicas will be aggressively scaled to 0. That’s usually what you want for stale rollouts, but it means you can’t keep an old RS around with non‑ready pods for debugging without it being cleaned up on the next reconcile.
  • RS without a deployment.kubernetes.io/revision annotation are treated as revision 0; if there are multiple such RS, they’ll all be considered “newest” and thus skipped. That’s safe, but might be worth documenting or filtering out explicitly if list_replica_sets_for_deployment could ever return unrelated RS.

Overall this is a sensible best‑effort cleanup; if you start relying on it heavily, consider adding a small unit/integration test around a mocked list_replica_sets_for_deployment to lock in the behavior.

Also applies to: 1052-1133

crates/basilica-operator/src/k8s_client.rs (3)

739-744: Consider verifying owner reference kind for accuracy.

The filter only checks r.name == deployment_name but doesn't verify that the owner reference is actually a Deployment (via r.kind). This could match owner references from other resource types (e.g., a StatefulSet) that happen to have the same name.

                     rs.metadata
                         .owner_references
                         .as_ref()
-                        .map(|refs| refs.iter().any(|r| r.name == deployment_name))
+                        .map(|refs| refs.iter().any(|r| r.kind == "Deployment" && r.name == deployment_name))
                         .unwrap_or(false)

753-763: Silent success when ReplicaSet doesn't exist differs from real K8s behavior.

The mock returns Ok(()) even when the namespace or ReplicaSet doesn't exist. Real Kubernetes would return a NotFound error. This could cause tests to pass incorrectly when they should fail due to missing resources.

Consider returning an error when the ReplicaSet isn't found, consistent with other methods like update_gpu_rental_status.

     async fn scale_replica_set(&self, ns: &str, name: &str, replicas: i32) -> Result<()> {
         let mut map = self.replica_sets.write().await;
-        if let Some(ns_map) = map.get_mut(ns) {
-            if let Some(rs) = ns_map.get_mut(name) {
-                if let Some(spec) = rs.spec.as_mut() {
-                    spec.replicas = Some(replicas);
-                }
-            }
-        }
-        Ok(())
+        let ns_map = map
+            .get_mut(ns)
+            .ok_or_else(|| anyhow!("namespace not found: {}", ns))?;
+        let rs = ns_map
+            .get_mut(name)
+            .ok_or_else(|| anyhow!("ReplicaSet not found: {}/{}", ns, name))?;
+        if let Some(spec) = rs.spec.as_mut() {
+            spec.replicas = Some(replicas);
+        }
+        Ok(())
     }

1446-1451: Same owner reference filter issue as in the mock.

The filter should also verify r.kind == "Deployment" to avoid incorrectly matching ReplicaSets owned by other resource types with the same name.

             .filter(|rs| {
                 rs.metadata
                     .owner_references
                     .as_ref()
-                    .map(|refs| refs.iter().any(|r| r.name == deployment_name))
+                    .map(|refs| refs.iter().any(|r| r.kind == "Deployment" && r.name == deployment_name))
                     .unwrap_or(false)
             })
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (1)

1312-1330: Label sanitization may be too permissive.

The sanitize_label_value function (lines 1312-1330) converts non-compliant characters to hyphens but doesn't enforce Kubernetes label value length limits (max 63 characters) or the pattern requirement that values cannot consist solely of hyphens/dots.

Consider adding validation:

 fn sanitize_label_value(value: &str) -> String {
     let sanitized: String = value
         .chars()
         .map(|c| {
             if c.is_ascii_alphanumeric() || c == '-' || c == '_' || c == '.' {
                 c
             } else {
                 '-'
             }
         })
         .collect();
-    // Trim leading/trailing non-alphanumeric chars
-    sanitized
+    // Trim leading/trailing non-alphanumeric chars and enforce length limit
+    let trimmed = sanitized
         .trim_start_matches(|c: char| !c.is_ascii_alphanumeric())
         .trim_end_matches(|c: char| !c.is_ascii_alphanumeric())
-        .to_string()
+        .to_string();
+    
+    // Enforce Kubernetes 63-character limit
+    if trimmed.len() > 63 {
+        trimmed[..63].to_string()
+    } else {
+        trimmed
+    }
 }
crates/basilica-autoscaler/src/provisioner/ssh.rs (3)

192-323: Optional: Consider creating key file once outside retry loop.

The current implementation creates a new temporary key file on each retry attempt. While functionally correct and the overhead is minimal, you could create it once before the retry loop for slightly better efficiency.

Apply this diff if desired:

+    // Create temp file for private key (must stay alive during command execution)
+    let key_file = create_key_file(&ssh_config.private_key)?;
+
     for attempt in 0..self.max_retries {
         if attempt > 0 {
             debug!(host = %host, attempt = %attempt, "Retrying SSH connection");
             tokio::time::sleep(self.retry_delay).await;
         }

-        // Create temp file for private key (must stay alive during command execution)
-        let key_file = create_key_file(&ssh_config.private_key)?;
-
         let mut cmd = Command::new("ssh");

784-793: Consider more robust handshake validation.

The current string matching for handshake recency ("seconds ago", "1 minute", "Less than a minute") is fragile and depends on the output format of wg show, which could vary by version or locale.

Consider parsing the numeric latest-handshake field from wg show dump or wg show wg0 latest-handshakes for more reliable validation, or use a broader regex pattern to extract time values.

Example approach using wg show wg0 latest-handshakes:

// In validate_wireguard_connectivity, replace the handshake check with:
let handshake_output = self
    .execute_ssh_command(
        host,
        port,
        ssh_config,
        "sudo wg show wg0 latest-handshakes 2>/dev/null || true",
    )
    .await?;

// Parse timestamps and check if any peer has handshake within last 120s
let has_recent_handshake = handshake_output.lines().any(|line| {
    line.split_whitespace()
        .nth(1)
        .and_then(|ts| ts.parse::<i64>().ok())
        .map(|ts| {
            let now = std::time::SystemTime::now()
                .duration_since(std::time::UNIX_EPOCH)
                .unwrap()
                .as_secs() as i64;
            now - ts < 120
        })
        .unwrap_or(false)
});

1060-1074: Complex sed manipulation could be fragile.

The script uses sed to insert configuration lines based on the containerd config version. While functionally correct, this approach is brittle and could fail if:

  • The config structure changes in future K3s versions
  • The expected section markers don't exist
  • Multiple matches exist for the patterns

Consider adding validation after sed operations to ensure the expected content was inserted, or log the config file if sed fails to find expected patterns.

Add validation after sed operations:

# After inserting default_runtime_name
if ! sudo grep -q 'default_runtime_name' "$TMPL_FILE"; then
    echo "ERROR: Failed to add default_runtime_name to config"
    sudo cat "$TMPL_FILE"
    exit 1
fi
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e738854 and d14f5ac.

📒 Files selected for processing (20)
  • crates/basilica-autoscaler/src/config.rs (4 hunks)
  • crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (15 hunks)
  • crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (13 hunks)
  • crates/basilica-autoscaler/src/offering_matcher.rs (1 hunks)
  • crates/basilica-autoscaler/src/provisioner/ssh.rs (16 hunks)
  • crates/basilica-operator/src/controllers/job_controller.rs (2 hunks)
  • crates/basilica-operator/src/controllers/rental_controller.rs (2 hunks)
  • crates/basilica-operator/src/controllers/user_deployment_controller.rs (7 hunks)
  • crates/basilica-operator/src/k8s_client.rs (6 hunks)
  • crates/basilica-operator/src/labels.rs (3 hunks)
  • crates/basilica-payments/src/blockchain/client.rs (3 hunks)
  • crates/basilica-payments/src/config.rs (6 hunks)
  • crates/basilica-payments/src/reconciliation/calculator.rs (3 hunks)
  • crates/basilica-payments/src/reconciliation/service.rs (8 hunks)
  • crates/basilica-payments/src/storage/reconciliation.rs (3 hunks)
  • scripts/cloud/compute.tf (1 hunks)
  • scripts/cloud/migration.tf (1 hunks)
  • scripts/migrations/Cargo.toml (1 hunks)
  • scripts/migrations/Dockerfile (1 hunks)
  • scripts/migrations/migrate_aead_key.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-13T15:34:46.544Z
Learnt from: epappas
Repo: one-covenant/basilica PR: 184
File: crates/basilica-validator/src/rental/billing.rs:129-130
Timestamp: 2025-10-13T15:34:46.544Z
Learning: In the basilica-validator crate, rental_id is a String type (not UUID) throughout the system - in RentalInfo struct, database rentals.id column, and all rental operations. The persistence layer methods should accept &str for rental_id parameters, not &Uuid, since they immediately convert to String for SQL binding anyway.

Applied to files:

  • crates/basilica-operator/src/controllers/rental_controller.rs
🧬 Code graph analysis (8)
crates/basilica-operator/src/controllers/rental_controller.rs (1)
crates/basilica-operator/src/labels.rs (1)
  • normalize_gpu_models (197-204)
crates/basilica-payments/src/reconciliation/service.rs (1)
crates/basilica-payments/src/blockchain/client.rs (1)
  • account_hex (79-79)
crates/basilica-payments/src/storage/reconciliation.rs (2)
crates/basilica-payments/src/reconciliation/service.rs (6)
  • account_hex (176-176)
  • account_hex (193-193)
  • account_hex (205-205)
  • account_hex (216-216)
  • account_hex (230-230)
  • account_hex (260-260)
crates/basilica-payments/src/blockchain/client.rs (1)
  • account_hex (79-79)
crates/basilica-operator/src/controllers/job_controller.rs (1)
crates/basilica-operator/src/labels.rs (1)
  • normalize_gpu_models (197-204)
crates/basilica-operator/src/controllers/user_deployment_controller.rs (3)
crates/basilica-operator/src/labels.rs (1)
  • normalize_gpu_models (197-204)
crates/basilica-operator/src/crd/user_deployment.rs (4)
  • new (338-347)
  • new (379-386)
  • new (400-429)
  • new (500-520)
crates/basilica-operator/src/controllers/node_profile_controller.rs (4)
  • new (28-30)
  • s (260-260)
  • s (265-265)
  • s (400-400)
crates/basilica-payments/src/config.rs (2)
crates/basilica-payments/src/blockchain/monitor.rs (1)
  • default (102-109)
crates/basilica-payments/src/reconciliation/types.rs (1)
  • default (94-96)
crates/basilica-autoscaler/src/provisioner/ssh.rs (1)
crates/basilica-autoscaler/src/provisioner/wireguard.rs (2)
  • generate_wg_config (85-120)
  • parse_wg_show_output (123-159)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (2)
crates/basilica-autoscaler/src/offering_matcher.rs (9)
  • calculate_nodes_needed (739-759)
  • filter_pods_by_age (644-659)
  • filter_pods_with_stale_affinity (665-714)
  • group_pending_pods_by_requirements (625-640)
  • has_gpu_node_affinity (504-547)
  • new (62-75)
  • new (241-251)
  • default (218-223)
  • from_template (117-131)
crates/basilica-autoscaler/src/config.rs (5)
  • default (199-204)
  • s (224-224)
  • s (232-232)
  • s (239-239)
  • s (246-246)
🔇 Additional comments (45)
crates/basilica-payments/src/blockchain/client.rs (2)

172-197: LGTM on the finalization flow.

The updated sequence correctly waits for finalization before extracting block info, then verifies success via wait_for_success. This provides stronger guarantees about transaction finality and properly captures the block number for the receipt.


81-84: LGTM on handling non-existent accounts.

Returning Ok(0) for accounts not found on-chain is the correct behavior—new accounts without prior transactions won't have storage entries.

crates/basilica-autoscaler/src/config.rs (3)

13-13: LGTM!

The new pending_pod_filter field follows the established pattern for configuration fields and is properly initialized.

Also applies to: 27-27


172-181: LGTM!

The struct definition is well-documented and the fields are appropriately typed with clear purpose.


243-248: LGTM!

The parse_u64_env helper follows the same pattern as the existing parse_u16_env and parse_u32_env functions, maintaining consistency across the codebase.

crates/basilica-payments/src/config.rs (8)

79-91: LGTM! Well-documented blockchain resilience configuration.

The new fields provide sensible defaults for blockchain monitoring resilience. The documentation clearly explains each field's purpose, and the defaults align with the MonitorConfig implementation shown in the relevant code snippets.


93-107: LGTM! Default functions are correct.

The default helper functions return the expected values and match the defaults used in the PaymentsConfig::default() implementation.


142-149: LGTM! Sensible rate limiting for reconciliation.

The max_sweeps_per_cycle field with a default of 50 provides reasonable protection against transaction flooding during reconciliation cycles.


259-263: LGTM! Default implementation correctly initialized.

The new fields are properly initialized in the Default implementation with values matching the default helper functions.

Also applies to: 288-288


378-396: LGTM! Strong security validation for AEAD key.

The validation correctly rejects the default all-zeros AEAD key in production or when reconciliation is enabled, with clear error messages guiding users to set a secure key.


398-411: LGTM! Comprehensive AEAD key format validation.

The validation correctly ensures the AEAD key is exactly 64 hexadecimal characters (32 bytes) and contains only valid hex digits. The error messages are clear and informative.


419-438: LGTM! Sound validation logic for blockchain configuration.

The validation rules are logically correct:

  • Ensures positive values for retry and retention settings
  • Ensures max_block_gap >= block_retention_threshold, which makes sense since the maximum allowed gap should accommodate the retention window

Error messages are clear and include actual values for debugging.


597-668: LGTM! Comprehensive test coverage for AEAD key validation.

The tests thoroughly cover all validation scenarios including:

  • Rejection in production and with reconciliation enabled
  • Valid key acceptance
  • Format validation (length and character constraints)
  • Permitted scenarios (development without reconciliation)

Test names are clear and assertions verify both error occurrence and message content.

scripts/cloud/compute.tf (1)

277-284: LGTM! Clear documentation of the naming convention.

The rename from PAYMENTS_AEAD_KEY_HEX to PAYMENTS_TREASURY__AEAD_KEY_HEX correctly reflects the nested config structure, and the comments helpfully explain the double-underscore convention for readers unfamiliar with this pattern.

scripts/cloud/migration.tf (1)

77-84: Confirm the hardcoded old key is intentional.

Hardcoding OLD_AEAD_KEY_HEX as all zeros in the Terraform config is appropriate if this was indeed the incorrectly used default key. However, ensure this is the correct key and consider adding a comment in the code (not just the header) confirming this is the known broken state being migrated from.

scripts/migrations/Dockerfile (1)

31-33: LGTM! Good security practice.

Running as a non-root user with explicit ownership is the correct approach for containerized tools, especially those handling sensitive cryptographic material.

scripts/migrations/migrate_aead_key.rs (3)

208-221: Good fallback logic for already-migrated detection.

The approach of attempting decryption with the new key when the old key fails is a robust way to handle partially-completed migrations and re-runs. This makes the migration idempotent.


76-89: LGTM! Correct nonce generation.

Using rand::thread_rng().fill_bytes() for nonce generation is appropriate for AES-GCM. The 12-byte nonce with CSPRNG ensures uniqueness.


52-74: Validate nonce length before creating Nonce.

The code correctly validates nonce length at line 63-65 before using it at line 67. This prevents panics from Nonce::from_slice on invalid input.

crates/basilica-payments/src/reconciliation/calculator.rs (1)

50-52: LGTM: Clean accessor for computed reserve.

The public accessor provides a clean interface to expose the computed reserve threshold, which can be useful for logging, metrics, or debugging.

crates/basilica-payments/src/storage/reconciliation.rs (1)

136-149: LGTM: Idempotency check implementation is correct.

The get_pending_sweep method correctly queries for existing pending or submitted sweeps, ordered by most recent first. This enables idempotency checks in the reconciliation service.

crates/basilica-payments/src/reconciliation/service.rs (6)

103-111: LGTM: Pre-cycle reconciliation and monitoring enhance reliability.

Adding stale sweep reconciliation and cold wallet monitoring before each cycle improves operational visibility and prevents stuck sweeps from accumulating. Error handling is appropriate—logging warnings without stopping the loop ensures the service remains resilient.


262-277: LGTM: Idempotency check prevents duplicate sweeps.

Checking for existing pending/submitted sweeps before initiating a new sweep is essential for preventing duplicate transfers, which could lead to incorrect balances or fund loss.


342-363: LGTM: Dry-run mode correctly simulates post-sweep balance.

The dry-run implementation properly simulates the balance after sweep using saturating_sub (line 356), which prevents underflow. This provides accurate simulation without executing blockchain transactions.


377-410: LGTM: Transfer execution with comprehensive error handling.

The blockchain transfer execution and post-transfer balance verification are well-structured:

  • Logs transaction hash and block number (lines 379-382)
  • Fetches and records post-sweep balance (lines 384-393)
  • Gracefully handles balance query failures with a warning (lines 387-392)
  • Updates sweep status to Confirmed with all relevant data

436-478: LGTM: Cold wallet monitoring provides valuable operational visibility.

The cold wallet balance monitoring:

  • Correctly converts SS58 address to hex for balance queries
  • Logs balance in both human-readable TAO and raw plancks
  • Emits metrics for observability
  • Handles errors gracefully with warnings

This is valuable for detecting unexpected balance changes or verifying successful sweeps.


517-536: LGTM: Parse error handling prevents incorrect reconciliation.

The explicit error handling for parsing balance_before_plancks and sweep_amount_plancks (lines 517-536) correctly skips sweeps with corrupted data rather than making incorrect reconciliation decisions. The continue statements ensure the loop processes remaining sweeps.

crates/basilica-operator/src/controllers/job_controller.rs (1)

85-116: GPU node affinity normalization for Jobs looks correct

Using normalize_gpu_models(&gpu.model) for the basilica.ai/gpu-model selector keeps Job scheduling consistent with the label/normalization utilities and with user deployments, while still guarding on "miner", "validated", and "jobs" node groups. No issues from a correctness or scheduling perspective.

crates/basilica-operator/src/controllers/user_deployment_controller.rs (1)

412-476: GPU node affinity normalization and tests are aligned

Switching the basilica.ai/gpu-model selector to normalize_gpu_models(&gpu.model) and adding tests that cover memory‑suffixed inputs ("A100-40GB", "H100-80GB") and duplicate entries ensures that:

  • Node affinity always uses canonical short model names (e.g. "A100", "H100"), independent of how the UI formats them.
  • The selector values are deduplicated, reducing label cardinality and avoiding redundant constraints.

The new tests validate both normalization and deduplication behavior end‑to‑end through render_deployment, so this change looks good.

Also applies to: 1978-2073

crates/basilica-operator/src/k8s_client.rs (4)

8-8: LGTM!

Import correctly added to support the new ReplicaSet management methods.


150-158: LGTM!

Trait method signatures are well-defined and consistent with existing patterns in the K8sClient trait.


181-181: LGTM!

Field follows the established pattern for mock state storage.


1820-1835: LGTM!

Rate-limited wrapper correctly delegates to the inner client after acquiring the rate limit permit, consistent with all other methods.

crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (2)

442-449: Offering invalidation on rental failure addresses timing gap.

The code now invalidates the offering from the cache when rental fails (lines 442-449), which addresses the past review concern about offering availability timing gaps. This is a good defensive measure.

However, note that cache invalidation only affects the in-memory cache and the offering will reappear after TTL expiry. For offerings that are consistently unavailable, consider whether additional retry backoff or persistent invalidation is needed.


770-789: LGTM! Early label application prevents pod scheduling race.

Applying GPU labels immediately when the node is detected (before it becomes Ready) is a smart defensive measure. This prevents a race condition where pods might attempt to schedule during the brief window between node Ready and label application.

The validation logic at lines 792-814 ensures labels are present before transitioning to Ready, providing an additional safety layer.

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (2)

115-178: Excellent provisioning guard prevents duplicate VM creation.

The provisioning guard (lines 115-147) counts NodePools in provisioning phases and blocks scale-up if any are in-flight (lines 157-178). This is a critical safety feature that prevents race conditions where multiple reconcile loops could create duplicate VMs while waiting for nodes to register.

The correction logic (lines 139-147) also handles stuck counters gracefully.


1019-1050: LGTM! GPU request detection covers both regular and init containers.

The requests_gpu function properly checks both regular containers (lines 1026-1036) and init containers (lines 1039-1047) for GPU requests. This ensures that pods with GPU requirements in init containers are correctly identified for autoscaling.

crates/basilica-autoscaler/src/provisioner/ssh.rs (8)

16-23: LGTM: Clear documentation on key limitations.

The documentation appropriately notes that passphrase-protected keys are not supported when using system SSH, which aligns with the architectural choice to use external SSH binary.


96-124: LGTM: Secure temporary key file handling.

The implementation correctly creates a temporary file with 0o600 permissions (SSH requirement) and uses appropriate error handling. The NamedTempFile will be automatically cleaned up when dropped.


325-345: LGTM: Secure file upload implementation.

The use of base64 encoding prevents command injection and heredoc delimiter collisions, and sudo tee appropriately handles writes to protected system paths.


490-577: LGTM: Comprehensive network performance tuning.

The sysctl configuration is well-documented and appropriate for a K3s GPU cluster with WireGuard networking. The values are reasonable and the configuration includes proper module loading and error handling.


583-660: LGTM: Proper WireGuard key lifecycle management.

The implementation correctly preserves existing keys (good for idempotency), validates key format (44 characters), and sets appropriate file permissions (600 for private key, 644 for public key).


662-742: LGTM: Robust WireGuard peer configuration with IP mismatch handling.

The implementation properly handles IP address changes by tearing down the existing interface when a mismatch is detected, and includes verification of the configured peers.


861-896: LGTM: Robust hostname and NVIDIA version detection.

The implementation properly handles errors in NVIDIA version detection by falling back to "unknown", and uses tr -d '[:space:]' to ensure clean version strings. The driver and CUDA versions as node labels will be useful for GPU workload scheduling.


1118-1141: LGTM: Basic tests cover core functionality.

The tests validate provisioner construction and key file creation. While additional tests for SSH execution and provisioning methods would be beneficial, those are better suited for integration tests given the external dependencies.

…ic format

* Add normalize_gpu_model function for consistent uppercase alphanumeric output
* Change normalize_gpu_models to use normalize_gpu_model instead of extract_short_gpu_model
* Update extract_nfd_labels to use normalize_gpu_model for consistent labeling
* Update test assertions to match new normalized format (e.g., A10040GB vs A100)
* This ensures operator and autoscaler use identical normalization logic
…icy CRD

* Add WarmPoolConfig with VRAM target, cost limits, and hysteresis thresholds
* Add WarmPoolStatus for VRAM accounting (total, allocated, idle VRAM in GB)
* Add ImagePrepullConfig and PrepullImage types for container image pre-pulling
* Add warm_pool and image_prepull fields to ScalingPolicySpec
* Add warm_pool status field to ScalingPolicyStatus
* Export new types from crd module
…ng logic

* Add calculate_vram_metrics for tracking total/allocated/idle VRAM across warm pool nodes
* Add calculate_warm_pool_target to determine target nodes based on VRAM and cost constraints
* Add evaluate_warm_pool_scaling with hysteresis thresholds to prevent flapping
* Add build_warm_pool_status to construct status for CRD updates
* Add gpu_type_to_memory mapping for common GPU models
* Add comprehensive unit tests for scaling decisions and threshold behavior
…GC timeout

* Add skip_failing_deployments config to prevent provisioning for CrashLoopBackOff pods
* Increase max_pending_age_seconds default from 600s to 1200s (20 min)
* Add FAILED_GC_TIMEOUT constant (10 min) for auto-cleanup of Failed NodePools
* Add BASILICA_AUTOSCALER_SKIP_FAILING_DEPLOYMENTS env var support
* Add warm_pool_total_vram_gb gauge for total VRAM across warm pool nodes
* Add warm_pool_allocated_vram_gb gauge for allocated VRAM
* Add warm_pool_idle_vram_gb gauge for available idle VRAM
* Add warm_pool_target_vram_gb gauge for configured VRAM target
* Add warm_pool_idle_nodes and warm_pool_target_nodes gauges
* Add warm_pool_cost_per_hour gauge for estimated hourly cost
* Add warm_pool_scale_decisions_total counter by decision type
* Add set_warm_pool_metrics and record_warm_pool_scale_decision methods
* Add filter_pods_from_failing_deployments to skip pods from troubled deployments
* Add is_pod_failing to detect CrashLoopBackOff, ImagePullBackOff, and other errors
* Add get_pod_owner_uid to group pods by ReplicaSet owner
* Add FAILING_WAIT_REASONS constant for container failure detection
* Add comprehensive unit tests for failing deployment filtering
…K8sClient

* Add delete_node_pool for automatic NodePool CR cleanup in GC flows
* Add list_pods_in_namespace to support failing deployment filtering
* These methods enable auto-cleanup of Failed/Deleted NodePools
… scale-down

* Add evaluate_and_apply_warm_pool to ScalingPolicyController for proactive capacity
* Add warm pool metrics collection and status updates to reconciliation loop
* Add failing deployment filter to apply_pending_pod_filters using list_pods_in_namespace
* Add idle-time-based scale-down for non-warm-pool nodes
* Add handle_failed with auto-cleanup after FAILED_GC_TIMEOUT expires
* Add handle_deleted to clean up stuck NodePools from before the fix
* Delete NodePool CR in handle_terminating after successful cleanup
* Add is_warm_pool parameter to scale_up_with_offering for node labeling
* Add collect_pods_by_node, get_offering_rates, get_available_offerings helpers
* Add count_warm_pool_nodes, count_pending_warm_pool_nodes, find_idle_nodes_for_scale_down
* Add select_warm_pool_offering for offering selection based on GPU preferences
…ests

* Add warmPool spec and status fields to ScalingPolicy CRD schema
* Add imagePrepull spec field with images array for pre-pulling
* Add image-prepuller DaemonSet manifest for GPU nodes
* Configure scaling-policy.yaml with warm pool enabled (160GB target, $15/hr limit)
* Configure PyTorch and vLLM images for pre-pulling on warm pool nodes
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
crates/basilica-autoscaler/src/offering_matcher.rs (2)

133-151: allow_model_fallback merge logic uses OR semantics.

Line 148 uses || which means the merged result is true if either constraint allows fallback. This contradicts the doc comment "preferring self's values where set" and differs from other fields that prefer self.

This was flagged in a previous review. The fix should change line 148 to use self.allow_model_fallback to maintain consistent precedence behavior.


665-714: GPU model normalization incompatibility between operator and autoscaler.

The stale affinity filter normalizes both pod requests and node labels using normalize_gpu_model. However, node labels are set by the operator using extract_short_gpu_model, which has different behavior. This can cause false negatives in stale affinity detection.

This was flagged in a previous review. The codebase should use a shared normalization function or ensure consistent label formats.

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)

1071-1076: Fix invalid chrono format specifier (duplicate issue).

Line 1074 uses %3f which is not a valid chrono strftime specifier. The correct format for millisecond precision is %.3f (with a leading dot).

🔎 Apply this diff to fix the format specifier:
             let pool_name = format!(
                 "{}-{}-{}",
                 policy_name,
-                Utc::now().format("%Y%m%d%H%M%S%3f"),
+                Utc::now().format("%Y%m%d%H%M%S%.3f"),
                 i
             );

Based on learnings, this issue was previously identified but remains unfixed.

crates/basilica-operator/src/controllers/user_deployment_controller.rs (1)

543-548: The comma delimiter issue for multi-model label values remains unaddressed.

The gpu.model.join(",") still produces invalid Kubernetes label values when multiple models are specified (e.g., A100,H100). Kubernetes label values only allow alphanumerics, -, _, and ..

🔎 Apply this diff to use a label-safe delimiter:
         if !gpu.model.is_empty() {
             labels.insert(
                 "gpu-request.basilica.ai/model".to_string(),
-                gpu.model.join(","),
+                gpu.model.join("_"),
             );
         }
🧹 Nitpick comments (7)
orchestrator/k8s/services/autoscaler/image-prepuller.yaml (2)

37-71: ConfigMap image-prepuller-config is defined but not used.

The images are hardcoded in the initContainers (lines 39, 56) while the ConfigMap (lines 100-102) lists the same images but is never mounted or referenced. This creates a maintenance burden where updates require changes in two places.

Consider either:

  1. Remove the unused ConfigMap if the static approach is intentional
  2. Use a controller/operator pattern that reads from the ConfigMap and dynamically creates the pods

89-90: Consider using a lower priority class than system-cluster-critical.

system-cluster-critical is typically reserved for essential cluster components like kube-proxy, CNI plugins, and core system pods. Using it for image pre-pulling means these pods could preempt legitimate user workloads during resource pressure.

Consider using a lower priority class such as system-node-critical or a custom priority class with an appropriate value, since image pre-pulling is an optimization rather than a critical system function.

orchestrator/k8s/services/autoscaler/scaling-policy.yaml (1)

108-117: Hardcoded environment-specific values may cause deployment issues.

The sshKeyId and datacenterId contain specific UUIDs that appear to be environment-specific. Users deploying this manifest will need to replace these values. Consider either:

  1. Using placeholder values with clear documentation (e.g., <YOUR-SSH-KEY-ID>)
  2. Referencing a template file instead (the summary mentions scaling-policy.template.yaml)
      # SSH key ID registered with your cloud provider
      sshKeyId: "<YOUR-SSH-KEY-ID>"  # Replace with your SSH key UUID
      datacenterId: "<YOUR-DATACENTER-ID>"  # Replace with your datacenter ID
crates/basilica-autoscaler/src/controllers/k8s_client.rs (1)

526-573: Consider extracting common event construction logic.

The create_pod_event and create_node_pool_event methods share significant code. A helper method could reduce duplication:

🔎 Optional refactoring suggestion
async fn create_event(
    &self,
    ns: &str,
    involved_object: ObjectReference,
    event_type: &str,
    reason: &str,
    message: &str,
    action: &str,
) -> Result<()> {
    use chrono::Utc;
    use kube::api::{Api, PostParams};

    let api: Api<Event> = Api::namespaced(self.client.clone(), ns);
    let now = Utc::now();
    let obj_name = involved_object.name.as_deref().unwrap_or("unknown");
    let event_name = format!("{}.{}", obj_name, now.timestamp_nanos_opt().unwrap_or(0));

    let event = Event {
        metadata: ObjectMeta {
            name: Some(event_name),
            namespace: Some(ns.to_string()),
            ..Default::default()
        },
        involved_object,
        type_: Some(event_type.to_string()),
        reason: Some(reason.to_string()),
        message: Some(message.to_string()),
        event_time: Some(MicroTime(now)),
        reporting_component: Some("basilica-autoscaler".to_string()),
        reporting_instance: Some(
            std::env::var("HOSTNAME").unwrap_or_else(|_| "autoscaler".to_string()),
        ),
        action: Some(action.to_string()),
        count: Some(1),
        ..Default::default()
    };

    api.create(&PostParams::default(), &event).await?;
    Ok(())
}

This is optional since the current implementation works correctly.

crates/basilica-autoscaler/src/warm_pool.rs (1)

253-260: Normalize function differs from offering_matcher's normalize_gpu_model.

This normalize_gpu_type function filters to alphanumeric only, while offering_matcher::normalize_gpu_model does the same. They appear functionally identical, but having two separate implementations could lead to drift.

Consider re-exporting the normalization function from offering_matcher to ensure consistent behavior across modules:

-/// Normalize GPU type for comparison (uppercase, alphanumeric only)
-fn normalize_gpu_type(gpu_type: &str) -> String {
-    gpu_type
-        .chars()
-        .filter(|c| c.is_ascii_alphanumeric())
-        .collect::<String>()
-        .to_uppercase()
-}
+use crate::offering_matcher::normalize_gpu_model as normalize_gpu_type;
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)

366-428: Consider caching pods_by_node to avoid redundant API calls.

If warm pool evaluation ran (lines 338-363), collect_pods_by_node is called again at line 367 for idle scale-down. This results in duplicate Kubernetes API calls to list pods on every node.

🔎 Consider refactoring to reuse pods_by_node:

Collect pods_by_node once early in the reconciliation flow and pass it to both warm pool and idle scale-down logic to reduce API overhead.

crates/basilica-operator/src/controllers/user_deployment_controller.rs (1)

549-552: Consider logging when GPU spec serialization fails.

While serialization of GpuSpec is unlikely to fail, silent failures could make autoscaler issues difficult to diagnose.

🔎 Suggested improvement:
         // Annotation with full GPU requirements (JSON) for autoscaler
-        if let Ok(json) = serde_json::to_string(gpu) {
-            annotations.insert("autoscaler.basilica.ai/gpu-requirements".to_string(), json);
-        }
+        match serde_json::to_string(gpu) {
+            Ok(json) => {
+                annotations.insert("autoscaler.basilica.ai/gpu-requirements".to_string(), json);
+            }
+            Err(e) => {
+                debug!(error = %e, "Failed to serialize GPU requirements for autoscaler annotation");
+            }
+        }
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d14f5ac and 6636766.

📒 Files selected for processing (16)
  • crates/basilica-autoscaler/src/config.rs (5 hunks)
  • crates/basilica-autoscaler/src/controllers/k8s_client.rs (6 hunks)
  • crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (17 hunks)
  • crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (13 hunks)
  • crates/basilica-autoscaler/src/crd/mod.rs (1 hunks)
  • crates/basilica-autoscaler/src/crd/scaling_policy.rs (7 hunks)
  • crates/basilica-autoscaler/src/lib.rs (1 hunks)
  • crates/basilica-autoscaler/src/metrics.rs (5 hunks)
  • crates/basilica-autoscaler/src/offering_matcher.rs (1 hunks)
  • crates/basilica-autoscaler/src/warm_pool.rs (1 hunks)
  • crates/basilica-operator/src/controllers/node_profile_controller.rs (3 hunks)
  • crates/basilica-operator/src/controllers/user_deployment_controller.rs (7 hunks)
  • crates/basilica-operator/src/labels.rs (8 hunks)
  • orchestrator/k8s/services/autoscaler/crds.yaml (10 hunks)
  • orchestrator/k8s/services/autoscaler/image-prepuller.yaml (1 hunks)
  • orchestrator/k8s/services/autoscaler/scaling-policy.yaml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • crates/basilica-operator/src/controllers/node_profile_controller.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/basilica-autoscaler/src/lib.rs
🧰 Additional context used
🧬 Code graph analysis (3)
crates/basilica-operator/src/controllers/user_deployment_controller.rs (1)
crates/basilica-operator/src/labels.rs (1)
  • normalize_gpu_models (222-229)
crates/basilica-autoscaler/src/warm_pool.rs (1)
crates/basilica-autoscaler/src/api/secure_cloud.rs (2)
  • gpu_memory_gb (216-218)
  • hourly_rate (221-223)
crates/basilica-autoscaler/src/metrics.rs (1)
crates/basilica-billing/src/domain/rentals.rs (1)
  • offering_id (223-225)
🔇 Additional comments (80)
orchestrator/k8s/services/autoscaler/image-prepuller.yaml (3)

1-10: LGTM!

Metadata and labels follow Kubernetes recommended labeling conventions with proper app.kubernetes.io/* annotations.


11-36: LGTM!

The update strategy with maxUnavailable: 100% is appropriate for a pre-pulling DaemonSet where availability doesn't matter. Security context is well-hardened with non-root user, seccomp profile, and proper group settings.


92-102: Remove the unused ConfigMap definition (lines 92-102).

The ConfigMap image-prepuller-config is not referenced by the DaemonSet or any other resource in the codebase. It duplicates the container images already specified in the initContainers section and serves no purpose.

Likely an incorrect or invalid review comment.

crates/basilica-autoscaler/src/crd/mod.rs (1)

7-13: LGTM!

The new CRD type exports (WireGuardPeerStatus, ImagePrepullConfig, PrepullImage, WarmPoolConfig, WarmPoolStatus) are properly organized and align with the dynamic GPU offering and warm-pool features introduced in this PR.

orchestrator/k8s/services/autoscaler/scaling-policy.yaml (2)

119-128: Verify the K3s server URL is the intended default.

The serverUrl points to connect.basilica.ai:6443. Ensure this is the correct default endpoint for your deployment environment. Users deploying to different clusters will need to update this value.


66-89: Well-designed warm pool configuration with proper hysteresis.

The warm pool configuration includes appropriate safeguards:

  • Cost cap (maxIdleCostPerHour: 15.0) prevents runaway spending
  • Hysteresis thresholds (80% up, 120% down) prevent scaling oscillation
  • Clear documentation for each parameter

Good default settings for production use.

crates/basilica-autoscaler/src/config.rs (3)

172-213: Well-structured configuration with sensible defaults.

The PendingPodFilterConfig is cleanly implemented with:

  • Consistent pattern with other config structs (environment-based with defaults)
  • Default implementation that mirrors from_env() defaults (good for testing)
  • Clear documentation on each field's purpose

The 20-minute max pending age is a reasonable safeguard against infinite retry loops.


253-258: LGTM!

The parse_u64_env helper follows the established pattern of other env parsing functions in this file.


227-228: LGTM - reasonable timeout for failed NodePool cleanup.

The 10-minute timeout provides adequate time for operators to investigate failures before automatic garbage collection, while preventing indefinite accumulation of failed resources.

crates/basilica-autoscaler/src/controllers/k8s_client.rs (3)

477-524: Event creation implementation looks correct.

The implementation properly constructs Kubernetes Events with appropriate fields. A few observations:

  1. Using timestamp_nanos_opt().unwrap_or(0) is a safe fallback for event name uniqueness.
  2. The HOSTNAME env var fallback to "autoscaler" is sensible for reporting_instance.

214-219: LGTM!

The delete_node_pool implementation follows the established pattern in this file.


430-436: LGTM!

The list_pods_in_namespace implementation is correct and follows the existing pattern.

crates/basilica-autoscaler/src/metrics.rs (4)

43-64: Comprehensive metrics coverage for the new features.

The new metrics provide good observability for:

  • Offering selection performance and outcomes
  • Cache hit/miss rates for debugging performance issues
  • Warm pool VRAM utilization and cost tracking
  • Scaling decision visibility

The metric naming follows Prometheus conventions consistently.


556-579: LGTM!

The set_warm_pool_metrics method cleanly updates all warm pool gauges from the status struct. Using a single method to update all related metrics ensures consistency in the metric snapshot.


220-227: Histogram buckets look appropriate for typical cache/selection operations.

The buckets (1ms to 1s) cover the expected range for in-memory cache lookups and quick API calls. If offering selection involves slower external API calls, you may want to add higher buckets (e.g., 2.0, 5.0) in the future based on observed p99 latencies.


201-209: Monitor for high cardinality on offering_id label.

The offering_selection_total metric includes offering_id as a label. If the number of unique offering IDs grows significantly, this could cause metric cardinality issues in Prometheus.

For GPU cloud offerings, this is likely fine (tens of unique IDs), but worth monitoring. If cardinality becomes problematic, consider:

  • Removing offering_id from the counter and tracking it separately
  • Using a histogram with offering metadata in a separate info metric
crates/basilica-autoscaler/src/warm_pool.rs (8)

9-17: Threshold calculation logic is sound.

The calculate_thresholds function correctly uses saturating conversion to prevent overflow when computing hysteresis thresholds from percentage values. The approach of clamping to u32::MAX before casting is appropriate for preventing wraparound on very large VRAM values.


59-143: VRAM accounting logic is well-structured.

The calculate_vram_metrics function correctly:

  • Filters only Ready nodes with the warm-pool label
  • Calculates total VRAM from GPU memory and count
  • Tracks allocated VRAM by summing pod GPU requests per node
  • Identifies idle nodes (zero allocated GPUs)
  • Uses saturating_sub to prevent underflow when computing idle VRAM

145-183: Init container GPU handling correctly uses max instead of sum.

The count_pod_gpu_requests function properly handles init containers by taking the maximum GPU count rather than summing, since init containers run sequentially and don't consume GPUs concurrently. This matches Kubernetes resource accounting semantics.


210-219: Offering comparison uses total node cost, not per-GPU cost.

The sorting correctly computes total_hourly_rate = hourly_rate_per_gpu * gpu_count for comparison. This ensures offerings are compared by their actual cost impact on the warm pool budget rather than just per-GPU rates.


221-235: Consider edge case: vram_per_node could be zero.

If gpu_memory_gb() returns 0 (the default when gpu_memory_gb_per_gpu is None), the loop will never meet the VRAM target and will be constrained only by max_idle_nodes and cost limits. This may be intentional but could lead to unexpected behavior.

Consider adding a guard or debug log when vram_per_node == 0:

🔎 Suggested defensive check
     if let Some(best_offering) = offerings.first() {
         let vram_per_node = best_offering.gpu_memory_gb() * best_offering.gpu_count;
         let total_hourly_rate = best_offering.hourly_rate_per_gpu * best_offering.gpu_count as f64;
+        
+        if vram_per_node == 0 {
+            debug!("Warning: best offering has 0 VRAM per node, cannot meet VRAM target");
+            return WarmPoolTarget {
+                nodes: 0,
+                total_vram_gb: 0,
+                estimated_hourly_cost: 0.0,
+            };
+        }

         while total_vram < target_vram_gb

357-360: Node removal selection is simplistic but safe.

select_node_for_removal returns the first idle node. This is simple but may not be optimal for cost savings (e.g., prefer removing the most expensive idle node). However, given the conservative one-at-a-time scale-down approach at line 315, this is acceptable.


384-400: GPU memory mapping covers common types but may need updates.

The gpu_type_to_memory lookup table includes major GPU types (H100, A100, L40S, RTX4090, etc.) with their correct memory sizes. The fallback to 24GB is conservative.

As new GPU types are released, this table may need updates. Consider whether this mapping should be driven by the offerings API data instead of hardcoded values.


402-634: Comprehensive test coverage for warm pool logic.

The test suite covers:

  • Preference ordering and normalization
  • GPU type to memory mappings
  • Disabled warm pool behavior
  • Scaling decisions (up, down, no-action in hysteresis band)
  • Deference to reactive scaling when pending pods exist
  • Status building
crates/basilica-autoscaler/src/crd/scaling_policy.rs (4)

45-59: New optional configuration fields properly integrated.

The warm_pool, image_prepull, and offering_constraints fields are correctly:

  • Marked as Option<T> for backward compatibility
  • Using skip_serializing_if = "Option::is_none" to avoid cluttering serialized output
  • Using #[serde(default)] to handle deserialization of older CRs

158-225: WarmPoolConfig has well-chosen defaults.

The default values are sensible:

  • min_idle_vram_gb: 80 - equivalent to 1x A100-80GB
  • max_idle_nodes: 3 - reasonable upper bound
  • max_idle_cost_per_hour: 10.0 - prevents runaway costs
  • Hysteresis band of 80-120% provides good stability

269-315: SecureCloudTemplate properly supports dynamic offering selection.

Making offering_id optional (line 277) with updated documentation correctly enables dynamic offering selection based on pending pod requirements. The new datacenter_id and ssh_user fields with defaults complete the provisioning configuration.


399-442: WarmPoolStatus captures complete VRAM accounting state.

The status struct includes all metrics needed for observability:

  • Current VRAM state (total, allocated, idle)
  • Target and thresholds
  • Node counts and names
  • Cost estimation

This enables operators to debug warm pool behavior via kubectl get scalingpolicy -o yaml.

orchestrator/k8s/services/autoscaler/crds.yaml (7)

222-229: New sshUser field correctly added to NodePool secureCloud.

The field has the correct default value (ubuntu) matching the Rust struct definition, and appropriate description.


276-285: WireGuard default changed from false to true.

This is a potentially breaking change for existing deployments. Nodes created after this change will have WireGuard enabled by default, which may fail if the WireGuard infrastructure isn't set up.

Verify that:

  1. Existing NodePool CRs without explicit wireguard.enabled won't have their behavior changed (Kubernetes preserves existing values)
  2. Documentation is updated to reflect this default change
  3. The WireGuard setup is robust enough to handle being enabled by default

384-387: labelsApplied status field enables idempotent label management.

This boolean flag prevents race conditions between node readiness and GPU label application. The description is clear about its purpose.


482-504: WireGuard peers array structure is complete.

The peer status includes all necessary fields:

  • endpoint and publicKey (required)
  • wireguardIp and vpcSubnet (required for routing)
  • routePodNetwork (optional, defaults to false)

716-774: SecureCloudTemplate in ScalingPolicy correctly mirrors NodePool additions.

The datacenterId, offeringId, and sshUser fields are properly added with matching types and defaults. Note that datacenterId is now required (line 771), which is appropriate for the ScalingPolicy template.


873-916: WarmPool CRD spec matches Rust struct definition.

All fields, types, defaults, and descriptions match the WarmPoolConfig struct in scaling_policy.rs. The JSON schema validation constraints (minimum: 0.0 for integer fields) are correctly applied.


1030-1093: WarmPoolStatus in CRD matches Rust struct definition.

All status fields are present with correct types and defaults. This enables kubectl inspection of warm pool state.

crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (15)

21-101: Generic OfferingSelector integration is well-designed.

The controller now properly supports optional offering selection through:

  • Generic type parameter S with default ()
  • Separate new() constructor for non-offering-selector use
  • with_offering_selector() for full capability
  • MaybeOfferingSelector trait bound for conditional invalidation

This maintains backward compatibility while enabling dynamic offering selection.


103-120: Offering invalidation on rental failure is a good resilience pattern.

The invalidate_offering_on_failure method removes failed offerings from cache to prevent retry loops with broken offerings. The method gracefully handles the case when no offering selector is present.


142-148: Phase timeout check correctly skips infinite-timeout phases.

Using u64::MAX as a sentinel for "no timeout" and explicitly checking for it prevents spurious timeouts on Ready, Failed, and Deleted phases.


356-400: GPU metadata persistence before rental prevents data loss.

Fetching offering details and persisting to status immediately (line 396-399) ensures GPU metadata survives process crashes. This is critical for correct node labeling after recovery.

Note: The past review comment about "timing gap between fetching offering metadata and starting rental" is addressed by the offering invalidation on failure mechanism, so the concern is mitigated.


402-475: Rental handling correctly prevents duplicate rentals.

The flow:

  1. Check rental_id first before creating new rental (line 404)
  2. Persist rental_id immediately after creation (line 460-462)
  3. Handle async IP assignment with retry logic

This prevents orphaned VMs on crash/retry scenarios.


512-541: Node registration includes GPU specs from status.

The gpu_specs struct correctly uses status values with sensible defaults for driver/CUDA versions. The WireGuard peer information is properly extracted and converted to status format.


633-651: WireGuard peer configuration uses registration response.

Converting WireGuardPeerStatus from status to WireGuardPeer for provisioner configuration ensures peers are configured from the authoritative source (API registration response) rather than a separate API call.

The warning at line 647 when peers are empty is helpful for debugging connectivity issues.


765-854: GPU label application race condition is properly addressed.

The implementation:

  1. Applies labels immediately when node is detected (before Ready)
  2. Tracks labels_applied flag in status
  3. Validates labels before transitioning to Ready
  4. Re-applies if validation fails

This prevents pods from failing to schedule during the brief window between node Ready and label application.


876-885: Label re-application in Ready phase handles recovery scenarios.

Verifying and re-applying labels on each Ready reconciliation handles cases where labels were removed (e.g., during Unhealthy recovery or manual intervention).


1030-1082: Failed NodePool GC with configurable timeout.

The handle_failed implementation:

  1. Waits for FAILED_GC_TIMEOUT before cleanup
  2. Stops rental to prevent cost leakage
  3. Deregisters node from API
  4. Deletes K8s node if created
  5. Deletes NodePool CR

This prevents stuck Failed resources from accumulating while giving operators time to investigate.


1084-1123: handle_deleted cleans up stuck resources.

This handles NodePools that reached Deleted phase but weren't removed (pre-fix). The idempotent cleanup ensures resources are freed even for legacy CRs.


1402-1420: Label sanitization correctly handles Kubernetes constraints.

The sanitize_label_value function:

  • Replaces invalid characters with -
  • Trims leading/trailing non-alphanumeric characters
  • Produces valid Kubernetes label values

1422-1470: GPU label building is comprehensive.

The build_gpu_labels function correctly:

  • Sets required node-type: gpu label
  • Normalizes GPU model using normalize_gpu_model
  • Sanitizes datacenter value for label compliance
  • Prefers status values over spec values for dynamic provisioning

1472-1485: Label comparison correctly detects missing or mismatched labels.

The labels_need_reapplication function checks both existence and value equality, which is necessary for detecting partial label application or value drift.


1561-1676: Comprehensive test coverage for label reapplication logic.

Tests cover:

  • All labels present and matching
  • Missing labels
  • Mismatched values
  • Empty expected labels
  • Node with no labels
crates/basilica-autoscaler/src/offering_matcher.rs (13)

16-29: Node label constants follow existing conventions.

The node_labels module defines constants matching the basilica.ai/ prefix convention used throughout the codebase. Including both standard labels (GPU_MODEL, GPU_COUNT) and autoscaler-specific ones (OFFERING_ID) is appropriate.


42-51: Normalization removes all non-alphanumeric characters.

The normalize_gpu_model function converts to uppercase and strips non-alphanumerics. This provides consistent matching but note that "A100-40GB" becomes "A10040GB" which differs from how the operator's extract_short_gpu_model would handle it.


153-174: MaybeOfferingSelector trait enables optional selector pattern.

The blanket implementation for () (no-op) and the forwarding implementation for T: OfferingSelector provides a clean abstraction for controllers that may or may not have an offering selector.


269-305: Cache implementation uses double-checked locking correctly.

The get_offerings method:

  1. Fast path: read lock, check TTL
  2. Slow path: write lock, re-check TTL (prevents thundering herd)
  3. Fetch with timeout
  4. Update cache

This is a correct and efficient caching pattern.


307-327: Forward-prefix matching only is a safe choice.

The matches_model method intentionally only supports:

  • Exact match: "A100" == "A100"
  • Forward prefix: "A100" matches "A100SXM4"

Not supporting reverse prefix (offering "A100" matching request "A100SXM4") prevents incorrect node selection.


385-400: Cache invalidation uses CoW pattern efficiently.

The invalidate_offering method:

  1. Checks if offering exists (fast path)
  2. Uses Arc::try_unwrap to avoid clone when possible
  3. Retains non-matching offerings

This minimizes allocations for the common case.


417-424: is_cache_stale correctly handles lock contention.

Returning false when lock can't be acquired prevents thundering herd - if someone else holds the write lock, they're likely refreshing the cache.


427-445: Offering comparison prioritizes exact GPU count match.

The compare_offerings function correctly:

  1. Prefers exact GPU count match
  2. Then cheaper hourly rate
  3. Then lower GPU count (avoid over-provisioning)
  4. Finally by ID for determinism

Using total_cmp handles NaN/infinity correctly.


504-547: has_gpu_node_affinity checks both required and preferred terms.

The function correctly examines both requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution for GPU-related labels.


717-760: Failing pod detection covers common failure reasons.

The FAILING_WAIT_REASONS list includes:

  • CrashLoopBackOff
  • ImagePullBackOff / ErrImagePull
  • CreateContainerError / CreateContainerConfigError
  • InvalidImageName

This prevents over-provisioning for deployments with fundamental issues.


774-807: Failing deployment filter prevents wasted node provisioning.

The filter_pods_from_failing_deployments function uses owner UID matching to identify sibling pods from the same ReplicaSet. This prevents provisioning new nodes for deployments already known to be broken.


831-852: Node calculation uses ceiling division correctly.

The calculate_nodes_needed function:

  1. Guards against division by zero
  2. Uses div_ceil for correct node count calculation
  3. Warns when offering has fewer GPUs than pod requests

854-1509: Comprehensive test suite covers edge cases.

The test module covers:

  • Normalization behavior
  • Requirements hashing/equality
  • Offering comparison
  • Node calculation scenarios
  • Constraint filtering
  • Cache operations
  • Age/stale affinity/failing deployment filters

The tests use well-structured helper functions and mock implementations.

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (11)

31-96: LGTM: Clean generic parameterization and constructor design.

The addition of the OfferingSelector generic parameter and pending_pod_filter field is well-integrated. The dual constructor pattern (new with defaults + with_pending_pod_filter for customization) provides good flexibility.


120-183: Excellent safeguards against duplicate provisioning.

The provisioning count tracking (lines 120-152) correctly identifies NodePools in transitional phases and resets stuck counters. Blocking scale-up when provisioning_count > 0 (lines 162-183) prevents creating duplicate VMs while waiting for node registration—a critical reliability improvement.


240-275: Good error handling with status update.

The code correctly updates the OfferingAvailability condition before returning the NoMatchingOffering error (lines 265-267), ensuring status reflects the failure reason for diagnostics.


437-608: LGTM: Well-structured warm pool evaluation.

The warm pool scaling logic is clear and correct:

  • Properly skips evaluation when nodes are provisioning (lines 473-492)
  • Handles missing offerings gracefully (line 514)
  • Correctly differentiates warm pool nodes from reactive nodes via labeling
  • Updates status conditions consistently across all decision paths

610-727: LGTM: Efficient metrics collection with clear filter pipeline.

The metrics collection correctly:

  • Returns both the snapshot and raw filtered pods for downstream processing
  • Conditionally fetches namespace pods only when skip_failing_deployments is enabled (lines 621-635)
  • Applies filters in a sensible order with clear logging of filter impact (lines 715-724)

806-1015: LGTM: Robust dynamic offering selection with multi-tier fallback.

The scale-up logic correctly:

  • Supports both static (offering_id in config) and dynamic (selector-based) modes
  • Refreshes offering cache when stale (lines 873-880)
  • Implements a 3-tier fallback strategy (exact match → model fallback → explicit fallback_offering_id) with clear diagnostics
  • Respects max_count across all pod groups
  • Properly emits events and metrics for observability

1137-1228: LGTM: Clear event emission for user visibility.

The event emission methods provide valuable user feedback:

  • emit_missing_affinity_warnings (lines 1137-1183) alerts users when GPU model requests lack nodeAffinity, preventing scheduling confusion
  • emit_no_offering_events (lines 1185-1228) informs users why their pods remain pending
  • Both handle errors gracefully with warning logs rather than propagating failures

1230-1254: LGTM: Correct constraint merging with policy precedence.

The build_effective_constraints method properly merges template-level and policy-level constraints, with policy constraints taking precedence when both are set. This provides flexibility while maintaining a clear override hierarchy.


1303-1461: LGTM: Well-implemented warm pool helper functions.

The warm pool helpers are correctly implemented:

  • count_warm_pool_nodes and count_pending_warm_pool_nodes use consistent labeling and phase checks
  • find_idle_nodes_for_scale_down properly excludes warm pool nodes and checks idle duration + pod presence
  • select_warm_pool_offering implements sensible preference ordering (preferred types → cost)

1463-1518: LGTM: Correct NodePool creation with warm-pool support.

The create_node_pool_from_template method correctly:

  • Applies warm-pool label conditionally based on is_warm_pool parameter
  • Propagates offering_id to the SecureCloudConfig
  • Extracts and sets datacenter_id from the template when available
  • Maintains proper labeling for autoscaler management

1528-1588: LGTM: Clean helper types and functions.

  • OfferingResult enum clearly represents the three possible outcomes of offering selection
  • requests_gpu correctly checks both regular and init containers for GPU requests
  • add_condition follows the standard Kubernetes pattern for updating or appending conditions
crates/basilica-operator/src/controllers/user_deployment_controller.rs (2)

1055-1133: LGTM – well-designed stale ReplicaSet cleanup.

The logic correctly identifies and scales down old ReplicaSets that have pending pods but no ready replicas, which prevents the autoscaler from provisioning nodes for outdated pod specs. Error handling is non-blocking with appropriate debug logging.


1978-2074: Good test coverage for GPU model normalization and deduplication.

The tests appropriately verify that:

  • Model names with memory suffixes are normalized (e.g., A100-40GBA10040GB)
  • Case-insensitive deduplication works correctly
  • Order is preserved (first occurrence wins)
crates/basilica-operator/src/labels.rs (4)

142-165: Well-implemented normalization function.

The implementation correctly:

  • Trims whitespace
  • Returns a sensible default for empty/whitespace input
  • Normalizes to uppercase alphanumeric for consistent matching

222-229: LGTM – correct deduplication with order preservation.

The implementation properly deduplicates normalized models while preserving order (first occurrence wins). The clone() in the filter is necessary for HashSet ownership and is acceptable for typical small model lists.


478-556: Comprehensive test coverage for normalization functions.

The tests appropriately cover:

  • Case-insensitive matching in extract_short_gpu_model
  • Edge cases for normalize_gpu_model (empty, whitespace, various formats)
  • Deduplication, order preservation, and case-insensitive handling in normalize_gpu_models

310-314: Verify GPU model label normalization strategy across operator and autoscaler.

The code at lines 310-314 uses normalize_gpu_model() to set node labels, converting "Tesla-A100-SXM4-80GB""TESLAA100SXM480GB". However, in user_deployment_controller.rs, pod affinity rules normalize user-specified models the same way, so user input "A100" remains "A100".

This creates a label mismatch:

  • Node label value: TESLAA100SXM480GB
  • Pod affinity value: A100

The extract_short_gpu_model() function exists but is unused. Confirm whether:

  1. This labeling strategy is intentional (users must request full normalized names)
  2. The offering_matcher in basilica-autoscaler is the actual matching mechanism (which normalizes both sides using the same function)
  3. Node labels should instead use extract_short_gpu_model() for backward compatibility

* Format debug/info/error macro calls with multi-line arguments
* Use is_some_and instead of map_or for clearer Option checking
* Format method chains and string literals for readability
* No functional changes
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
crates/basilica-payments/src/reconciliation/service.rs (3)

483-587: Balance heuristics remain insufficient for confirming sweep finalization.

The stale sweep reconciliation still marks sweeps as Confirmed using only an 80% balance-drop threshold. As previously noted, this approach doesn't verify transaction finalization or query the blockchain for tx_hash status. In fork/reorg scenarios, a transaction could revert after the balance temporarily changed, yet the code would still mark it as confirmed.

The previous review recommended storing finalization status when transfers are submitted, or querying the blockchain client via tx_hash to verify finalization before marking a sweep as Confirmed.


262-277: Idempotency check helps but doesn't eliminate TOCTOU.

The check for pending sweeps reduces duplicate sweep attempts. However, the TOCTOU window persists between creating the sweep record (lines 328-344) and executing the transfer (lines 374-378). The past review recommended account-level locking or database constraints to fully close this gap.


295-326: Balance re-verification narrows but doesn't eliminate TOCTOU.

The balance re-check before creating the sweep record is an improvement, but as previously noted, a race condition remains between verification and the blockchain transfer execution. Between creating the sweep record and executing the transfer, deposits or other transactions could alter the balance, causing transfer failures or over-sweeping.

🧹 Nitpick comments (8)
crates/basilica-payments/src/config.rs (1)

426-445: Validation logic is sound; consider wording consistency.

The validation correctly ensures:

  • Positive values for retry counts and retention thresholds
  • Logical relationship between max_block_gap and block_retention_threshold

Minor observation: Line 441 uses "should be" while other validation messages use "must be". Since this is a hard validation error (returns Err), consider using "must be" for consistency with other validation messages.

🔎 Optional wording adjustment for consistency:
             details: format!(
-                "blockchain.max_block_gap ({}) should be >= block_retention_threshold ({})",
+                "blockchain.max_block_gap ({}) must be >= block_retention_threshold ({})",
                 self.blockchain.max_block_gap, self.blockchain.block_retention_threshold
             ),
crates/basilica-payments/src/blockchain/client.rs (2)

175-208: LGTM! Proper finalization flow with transaction success verification.

The sequence of waiting for finalization, extracting block info, and verifying success via wait_for_success is the correct pattern for ensuring transaction integrity.

Minor note: Line 187-188 uses .ok() to silently convert block query failures to None. This is acceptable since block_number is optional, but consider logging a debug/warning if the block query fails to aid troubleshooting.


102-124: Consider adding context about version compatibility for the Debug format fallback.

The string-based fallback parsing (lines 102-124) uses format!("{:?}", value) and relies on unstable Debug representation. Derived Debug formats may change with future Rust versions. While the error handling is appropriately defensive, add a comment explaining which chain versions require this fallback or what specific metadata mismatch it accommodates. This provides clarity for future maintenance and indicates whether updating chain metadata would eliminate this workaround.

crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (5)

402-475: Consider handling externally deleted rentals.

The logic correctly prevents duplicate rentals and handles IP polling. However, if a rental exists in status but get_rental returns None (lines 430-434), this could indicate the rental was deleted externally (e.g., manually via provider console). Currently, this returns an error for retry, but the rental may never reappear.

Consider transitioning to Failed phase in this scenario, or implement a retry counter to detect persistent rental absence.


488-504: GPU spec defaults may cause scheduling issues.

The defaults (model="GPU", memory_gb=0) could lead to pod scheduling failures if the status doesn't contain the expected GPU metadata. However, since the offering fetch at lines 356-400 should populate these fields, the defaults should rarely be used.

Consider adding debug or warn logging when defaults are applied to detect potential bugs in the offering fetch flow.


553-578: Consider extracting duplicate IP polling logic.

The IP polling logic here is nearly identical to lines 404-437 in handle_provisioning. Consider extracting a helper method like ensure_external_ip(&self, ns: &str, pool: &NodePool, status: &mut NodePoolStatus) -> Result<String> to reduce duplication and simplify maintenance.


764-814: Thorough but complex labeling logic.

The two-stage label application (early application on detection + validation before Ready transition) correctly prevents the race condition described in comments. However, the deeply nested conditionals make this section complex.

Consider extracting the labeling logic into helper methods like apply_node_labels_if_needed and validate_labels_before_ready to improve readability while preserving the correctness guarantees.


1397-1415: Handle edge case: empty label values.

If the input contains no alphanumeric characters, sanitize_label_value returns an empty string. While Kubernetes allows empty label values, this might indicate a data issue upstream.

Consider adding a check to return a default value (e.g., "unknown") or logging a warning when sanitization produces an empty string.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6636766 and 57c656c.

📒 Files selected for processing (4)
  • crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (17 hunks)
  • crates/basilica-payments/src/blockchain/client.rs (3 hunks)
  • crates/basilica-payments/src/config.rs (6 hunks)
  • crates/basilica-payments/src/reconciliation/service.rs (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
crates/basilica-payments/src/reconciliation/service.rs (1)
crates/basilica-payments/src/blockchain/client.rs (1)
  • account_hex (79-79)
🔇 Additional comments (24)
crates/basilica-payments/src/config.rs (6)

79-107: LGTM! Well-documented blockchain configuration fields.

The new blockchain configuration fields are clearly documented with sensible defaults. The time-based context in comments (e.g., "~51 minutes", "~1 hour") helps operators understand the practical implications of these values.


142-149: LGTM! Good defensive configuration for rate limiting.

The max_sweeps_per_cycle field provides important protection against transaction flooding. The default value of 50 strikes a reasonable balance between throughput and safety.


259-262: LGTM! Default values properly initialized.

The new configuration fields are correctly initialized with values that match their respective default helper functions.

Also applies to: 288-288


378-397: LGTM! Excellent security validation for AEAD keys.

This validation properly prevents the use of default all-zeros AEAD keys in production or when reconciliation is enabled. The error messages clearly guide operators to set a secure key via environment variables.


399-418: LGTM! Comprehensive AEAD key format validation.

The validation correctly ensures the AEAD key is exactly 64 hexadecimal characters (32 bytes) and contains only valid hex digits. This prevents configuration errors that would lead to runtime failures.


604-675: LGTM! Comprehensive test coverage for AEAD key validation.

The test suite thoroughly covers all validation scenarios:

  • Default key rejection in production and with reconciliation enabled
  • Valid key acceptance
  • Format validation (length and character constraints)
  • Development mode allowances

This excellent coverage ensures the security validations work correctly across different environment configurations.

crates/basilica-payments/src/reconciliation/service.rs (3)

103-111: Good addition of pre-cycle health checks.

Reconciling stale sweeps and monitoring the cold wallet before each cycle improves operational visibility. Treating failures as warnings ensures the main sweep cycle proceeds even if these checks fail.


142-165: LGTM: Per-cycle sweep rate limiting.

The rate limit prevents processing too many sweeps in a single cycle, which is good for load management. The early break with logging provides clear operational feedback.


439-481: Monitor implementation is correct.

The cold wallet monitoring logic properly converts SS58 to hex and queries the balance. Both sp_core and subxt are declared as workspace dependencies in Cargo.toml, confirming that the imports (subxt::ext::sp_core::crypto::Ss58Codec and sp_core::sr25519::Public) resolve correctly. The inline use statements and overall structure are sound.

crates/basilica-payments/src/blockchain/client.rs (2)

79-87: LGTM! Proper handling of non-existent accounts.

Returning 0 balance for accounts not found on chain is the correct behavior, and the debug logging provides useful diagnostics.


219-219: This is not a breaking change for external consumers.

The block_number field addition is safe because TransferReceipt is only constructed internally within client.rs (at line 202), where all fields are already being specified. External code receives the struct as a return value and reads its public fields without constructing instances. If future fields are expected to be added frequently, consider applying #[non_exhaustive] for better forward compatibility, but it's not required for this change.

Likely an incorrect or invalid review comment.

crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (13)

21-120: LGTM: Clean generic design for optional offering selector.

The generic parameter approach allows the controller to optionally integrate an offering selector without breaking existing code. The trait bounds are appropriate and the implementation correctly propagates the selector through Clone and constructors.


142-147: LGTM: Clean sentinel value for infinite timeouts.

Using u64::MAX as a sentinel for phases without timeout (Ready, Failed, Deleted) is clear and avoids additional Option wrapping.


356-400: Good improvement: Early offering fetch and persistence.

Fetching and persisting offering metadata immediately (before rental start) prevents data loss on crashes and provides better visibility through the OfferingUnavailable event. The invalidation logic at line 446 further mitigates the timing gap issue noted in previous reviews.


514-536: LGTM: Correct WireGuard peer mapping.

The conversion from API response types to CRD status types is straightforward and handles missing wireGuard data appropriately with unwrap_or_default.


633-651: LGTM: Appropriate warning for missing peers.

The warning at line 647 provides good visibility if the registration response lacks peers, which would prevent WireGuard from routing traffic correctly.


877-886: LGTM: Clean label reapplication for recovery scenarios.

Using the labels_need_reapplication helper makes this recovery logic concise and clear.


1021-1028: LGTM: Correct lifecycle completion.

Deleting the CR after transitioning to Deleted phase correctly triggers finalizer removal and completes the lifecycle.


1030-1077: LGTM: Proper auto-GC for failed NodePools.

The timeout-based garbage collection prevents accumulation of failed resources while allowing time for investigation. Error handling appropriately logs warnings without blocking cleanup.


1079-1118: LGTM: Recovery path for stuck Deleted NodePools.

This handler provides a clean migration path for NodePools that reached Deleted phase before the CR deletion fix was implemented.


1282-1288: LGTM: Appropriate SSH username resolution.

The precedence order and "ubuntu" fallback are sensible defaults for cloud VM provisioning.


1417-1465: LGTM: Comprehensive GPU label construction.

The label building logic correctly handles optional fields and applies sanitization where needed. The precedence order (status > spec) is appropriate for dynamic provisioning.


1467-1480: LGTM: Correct label comparison logic.

The function correctly identifies when labels need reapplication by checking for missing or mismatched values.


1557-1670: LGTM: Comprehensive test coverage.

The tests cover all relevant scenarios for labels_need_reapplication, including edge cases like empty expected labels and nodes without labels.

* Add /target to gitignore for migrations directory
* Prevent Rust build artifacts from being tracked in version control
* Acquire RwLock guards in consistent order (client first, then rpc_methods)
* Ensures lock ordering is predictable to prevent deadlock scenarios
* Follows Rust async lock best practices for concurrent access patterns
* Create row_to_sweep function to convert PgRow to ReconciliationSweep
* Replace duplicate mapping closures in list_sweeps and list_sweeps_for_account
* Follows DRY principle by consolidating row-to-struct conversion logic
* Replace comma with underscore in gpu-request.basilica.ai/model label
* Commas are invalid in Kubernetes label values per K8s naming conventions
* Add explicit debug logging for GPU spec JSON serialization failures
* Add unit test to verify multi-model GPU labels use valid K8s format
…aults

* Add common.rs module with shared retry logic, env parsing, wait functions
* Consolidate WaitResult enum, create_with_retry, parse_env_vars, wait_for_ready
* Remove duplicated code from sglang.rs and vllm.rs templates
* Update default GPU from RTX A4000 (16GB) to A100 (40GB) for model sizing
* Simplify GPU recommendation to use canonical A100/H100 model names
* Update wait_for_ready to accept service_name parameter for dynamic messages
* Export common module from templates mod.rs
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
crates/basilica-payments/src/storage/reconciliation.rs (1)

200-220: Implementation looks correct; index optimization remains pending.

The query logic correctly identifies stale sweeps, excludes dry runs, and uses the row_to_sweep helper for consistent mapping.

However, the index optimization issue raised in the previous review still applies here. The query filters by status, initiated_at, AND dry_run = false (line 209), but the existing idx_sweeps_status(status, initiated_at) index doesn't include dry_run. Consider updating the index to (status, dry_run, initiated_at) for optimal query performance.

🧹 Nitpick comments (10)
crates/basilica-payments/src/storage/reconciliation.rs (1)

166-179: Add composite index to optimize get_pending_sweep query performance.

The query filters by account_hex and status IN ('pending', 'submitted'), then orders by initiated_at DESC. Current indexes (idx_sweeps_account and idx_sweeps_status) don't provide optimal coverage. Consider adding a composite index on (account_hex, status, initiated_at) to eliminate in-memory filtering and sorting as the reconciliation_sweeps table grows.

crates/bittensor/src/connect/monitor.rs (1)

111-131: Remove unused parameter for clearer API.

The client parameter is explicitly unused (line 115) but still required by callers. Since the block object contains all necessary information, the client parameter should be removed to simplify the API and avoid confusion.

🔎 Proposed refactor to remove unused parameter
     /// Extract transfers from a block
     async fn get_transfers_from_block(
-        client: &OnlineClient<PolkadotConfig>,
         block: subxt::blocks::Block<PolkadotConfig, OnlineClient<PolkadotConfig>>,
     ) -> Result<Vec<TransferInfo>> {
-        let _ = client; // Used for type consistency, block already contains client ref
         let mut transfers = Vec::new();

Then update the two call sites at lines 93 and 107:

-        Self::get_transfers_from_block(&client, block).await
+        Self::get_transfers_from_block(block).await
crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (4)

20-20: Consider pinning the SGLang image version for reproducibility.

The default image uses the :latest tag, which can lead to non-deterministic deployments as the underlying image changes over time. Consider using a specific version tag (e.g., lmsysorg/sglang:v0.3.0) or documenting that users should override this default with a pinned version in production.


216-218: Document security implications of trust_remote_code flag.

The trust_remote_code flag allows model code execution from untrusted sources, which can pose security risks. Consider adding a warning in the CLI help text or documentation that users should only enable this for trusted models.


240-245: Consider making CPU/memory resources configurable.

The CPU and memory values are currently hardcoded (CPU: "4", memory_request: "8Gi"). For flexibility, consider exposing these as optional parameters in TemplateCommonOptions to allow users to tune resources based on their workload requirements.


260-260: Verify 4GB cache size is adequate for model files.

The cache is set to 4096 MB (4GB), which may be insufficient for larger models. For example, a 7B model typically requires 14-16GB. Consider increasing the default cache size or making it configurable based on the estimated model size.

crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (3)

20-20: Consider pinning the vLLM image version for reproducibility.

The default image uses the :latest tag, which can lead to non-deterministic deployments as the underlying image changes over time. Consider using a specific version tag (e.g., vllm/vllm-openai:v0.6.0) or documenting that users should override this default with a pinned version in production.


215-217: API key passed as command-line argument may be visible in logs.

The --api-key flag is passed as a command-line argument, which may be visible in process listings and logs. Consider using environment variables (e.g., VLLM_API_KEY) instead for better security.


227-229: Document security implications of trust_remote_code flag.

The trust_remote_code flag allows model code execution from untrusted sources, which can pose security risks. Consider adding a warning in the CLI help text or documentation that users should only enable this for trusted models.

crates/basilica-operator/src/controllers/user_deployment_controller.rs (1)

1984-2131: Strong test coverage for GPU model normalization and label generation.

The tests thoroughly verify:

  • GPU model normalization in node affinity (removing special characters, uppercasing)
  • Case-insensitive deduplication
  • Multi-model labels using underscore delimiter (fixing the previous comma issue)
  • JSON annotation with full GPU requirements
Optional: Add test for label vs node affinity consistency

Consider adding a test that uses GPU models requiring normalization (e.g., ["A100-40GB", "H100-80GB"]) to verify the label and node affinity values are consistent or intentionally different:

#[test]
fn test_gpu_labels_and_affinity_consistency() {
    use crate::crd::user_deployment::GpuSpec;
    
    let spec = UserDeploymentSpec::new(
        "user123".to_string(),
        "gpu-app".to_string(),
        "pytorch:latest".to_string(),
        1,
        8080,
        "/deployments/gpu-app".to_string(),
    )
    .with_resources(ResourceRequirements {
        cpu: "4000m".to_string(),
        memory: "16Gi".to_string(),
        gpus: Some(GpuSpec {
            count: 1,
            model: vec!["A100-40GB".to_string(), "H100-80GB".to_string()],
            min_cuda_version: None,
            min_gpu_memory_gb: None,
        }),
        cpu_request_ratio: 1.0,
    });

    let deployment = render_deployment("gpu-app", "u-user123", &spec, None).unwrap();
    let template = deployment.spec.unwrap().template;
    let template_meta = template.metadata.unwrap();
    let labels = template_meta.labels.unwrap();
    
    // Check if label matches normalized format or is intentionally different
    let gpu_model_label = labels.get("gpu-request.basilica.ai/model").unwrap();
    // Currently: "A100-40GB_H100-80GB" (raw)
    // Node affinity: ["A10040GB", "H10080GB"] (normalized)
    
    // Document the expected behavior
    assert!(gpu_model_label == "A100-40GB_H100-80GB" || gpu_model_label == "A10040GB_H10080GB");
}

This would document whether the inconsistency is intentional or should be fixed.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 57c656c and 219c4c3.

📒 Files selected for processing (9)
  • crates/basilica-cli/src/cli/handlers/deploy/templates/common.rs (1 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/templates/mod.rs (1 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (1 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (1 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (1 hunks)
  • crates/basilica-operator/src/controllers/user_deployment_controller.rs (7 hunks)
  • crates/basilica-payments/src/storage/reconciliation.rs (4 hunks)
  • crates/bittensor/src/connect/monitor.rs (3 hunks)
  • scripts/migrations/.gitignore (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • scripts/migrations/.gitignore
🧰 Additional context used
🧬 Code graph analysis (5)
crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (2)
crates/basilica-cli/src/cli/handlers/deploy/templates/common.rs (3)
  • create_with_retry (47-78)
  • parse_env_vars (28-44)
  • wait_for_ready (110-173)
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (1)
  • estimate_gpu_requirements (32-50)
crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (2)
crates/basilica-cli/src/cli/handlers/deploy/templates/common.rs (3)
  • create_with_retry (47-78)
  • parse_env_vars (28-44)
  • wait_for_ready (110-173)
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (1)
  • estimate_gpu_requirements (32-50)
crates/basilica-cli/src/cli/handlers/deploy/templates/common.rs (1)
crates/basilica-cli/src/progress/mod.rs (2)
  • complete_spinner_and_clear (137-139)
  • create_spinner (98-109)
crates/basilica-cli/src/cli/handlers/deploy/templates/mod.rs (3)
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (1)
  • estimate_gpu_requirements (32-50)
crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (1)
  • handle_sglang_deploy (29-150)
crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (1)
  • handle_vllm_deploy (29-150)
crates/basilica-operator/src/controllers/user_deployment_controller.rs (1)
crates/basilica-operator/src/labels.rs (1)
  • normalize_gpu_models (222-229)
🔇 Additional comments (38)
crates/basilica-payments/src/storage/reconciliation.rs (1)

195-198: Excellent refactoring to use the centralized helper!

This change successfully addresses the previous review feedback about duplicated status mapping logic. The code is now cleaner and more maintainable.

crates/bittensor/src/connect/monitor.rs (15)

13-15: LGTM!

The helper correctly identifies insecure endpoints for conditional client creation.


29-30: LGTM!

Wrapping clients in Arc<RwLock<...>> correctly enables thread-safe concurrent access and supports the new reconnect() capability.


36-44: LGTM!

The constructor properly delegates to helper methods and wraps clients in the appropriate Arc/RwLock containers.


46-54: LGTM!

The helper correctly handles both secure and insecure endpoints with appropriate subxt API calls and helpful debug logging.


56-64: LGTM!

Consistent with create_client, this helper properly handles endpoint security and constructs the RPC methods client.


66-80: Deadlock issue resolved!

The lock ordering has been corrected. Both reconnect() (lines 73-74) and get_transfers_at_block() (lines 99-100) now consistently acquire locks in the same order: clientrpc_methods. This eliminates the deadlock risk identified in the previous review. The implementation also correctly creates new clients before acquiring write locks, ensuring atomic updates only on success.


83-87: LGTM!

Correctly uses a read lock for this read-only operation.


90-94: LGTM!

Correct lock usage and proper delegation to the transfer extraction helper.


97-108: Consistent lock ordering maintained!

The explicit comment on line 98 and the implementation (lines 99-100) confirm that the lock ordering now matches reconnect(), preventing the previously identified deadlock hazard. The method correctly acquires read locks in the proper sequence.


134-179: LGTM!

The method properly handles both named and unnamed field representations from the subxt event system, with appropriate length validation and error handling via Option.


182-214: LGTM!

The polling logic correctly detects new blocks, retrieves transfers, and invokes the callback with proper error propagation.


216-219: LGTM!

Simple, correct accessor for the endpoint URL.


221-232: LGTM!

The pruning heuristic is logically sound: future blocks return false, and the age calculation correctly determines if retention has been exceeded.


236-260: LGTM!

The string-based parsing is inherently tied to subxt's Display format, but the 32-byte validation and test coverage (lines 266-292) provide adequate safeguards for extracting AccountId32 values.


262-293: LGTM!

The test validates the account hex parsing logic with appropriate assertions on length, format, and specific byte conversions.

crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (4)

29-150: LGTM: Well-structured deployment orchestration.

The deployment flow is comprehensive and handles various scenarios properly:

  • GPU auto-detection with user override
  • Environment variable parsing
  • Retry logic with backoff
  • Status polling with timeout handling
  • Both attached and detached modes

153-185: LGTM: RFC 1123 compliant name generation.

The function correctly sanitizes model names and generates valid deployment names by:

  • Extracting the model name without org prefix
  • Converting to lowercase and filtering non-compliant characters
  • Truncating to prevent excessively long names
  • Adding a UUID suffix for uniqueness

297-334: LGTM: Comprehensive success message with helpful examples.

The success message provides all necessary information for users to interact with their SGLang deployment, including:

  • Deployment status and endpoints
  • API endpoint documentation
  • Practical curl example
  • Management command references

267-294: The /health endpoint is widely documented and used across SGLang deployments. SGLang implements a GET /health endpoint for basic health checks, and the server starts at http://localhost:30000 by default. Multiple deployment guides confirm using readiness_probe: /health with SGLang. The health check probe configuration in the code is correct and aligns with standard SGLang deployment practices.

crates/basilica-cli/src/cli/handlers/deploy/templates/mod.rs (1)

1-14: LGTM: Clean module organization with appropriate exports.

The module properly organizes deployment templates and exposes a clean public API through selective re-exports. The documentation clearly describes the module's purpose and supported frameworks.

crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (2)

29-150: LGTM: Consistent deployment orchestration with SGLang template.

The deployment flow follows the same robust pattern as the SGLang template, ensuring consistency across inference frameworks. The use of shared utilities from common.rs successfully eliminates code duplication.


307-346: LGTM: Comprehensive success message with OpenAI-compatible examples.

The success message effectively guides users on interacting with the vLLM deployment through OpenAI-compatible endpoints, with practical curl examples.

crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (7)

1-27: LGTM: Clean data structures and efficient regex initialization.

The module uses appropriate types and efficiently initializes the regex pattern with LazyLock. The GpuRequirements struct provides clear fields for GPU estimation results.


32-50: LGTM: Clear estimation logic with appropriate fallbacks.

The function implements a sensible strategy: first attempting to extract parameter counts from the model name, then falling back to model family heuristics.


66-71: LGTM: Reasonable memory estimation heuristic.

The 2GB per billion parameters with 20% overhead is a practical rule of thumb for FP16 inference. Rounding to the nearest 8GB provides cleaner resource allocation.


74-110: LGTM: Comprehensive model family heuristics.

The function covers major model families with reasonable memory estimates. The default of 16GB for unknown models is a safe middle-ground choice.


113-116: LGTM: Correct GPU count calculation with reasonable limits.

The ceiling division ensures sufficient GPU memory, and the 1-8 GPU clamp prevents unrealistic resource requests.


118-126: LGTM: GPU recommendations now align with autoscaler offerings.

The function correctly returns canonical GPU model names ("A100", "H100") without vendor prefixes or memory suffixes, which will properly match the autoscaler's offering matcher after normalization. This addresses the previous review concern.


128-185: LGTM: Comprehensive test coverage.

The tests thoroughly verify:

  • Parameter extraction from various model name formats
  • Memory estimation calculations
  • GPU count logic including capping behavior
  • End-to-end estimation for real model names
  • GPU recommendation logic
crates/basilica-cli/src/cli/handlers/deploy/templates/common.rs (5)

14-25: LGTM: Well-defined constants and status enum.

The retry configuration values are reasonable, and the WaitResult enum provides clear deployment status outcomes. Boxing the DeploymentResponse in the Ready variant is a good optimization for enum size.


27-44: LGTM: Robust environment variable parsing.

The function correctly handles the KEY=VALUE format and properly uses splitn(2, '=') to allow equals signs in values. Error messages are clear and actionable.


80-101: LGTM: Comprehensive quota detection logic.

The functions properly identify quota-exceeded scenarios across different error types and status codes (403, 429), with appropriate message extraction for user feedback.


110-173: LGTM: Robust deployment status polling with phase-aware timing.

The function implements effective status monitoring with:

  • Timeout handling
  • Phase-based progress updates
  • Optimized polling intervals based on deployment phase
  • Proper terminal state detection (Active, Failed, Terminating)

175-188: LGTM: User-friendly phase messaging.

The function provides clear, actionable status messages for each deployment phase, significantly improving the user experience during deployment.

crates/basilica-operator/src/controllers/user_deployment_controller.rs (3)

918-922: Good addition to prevent autoscaler from provisioning nodes for stale pods.

Calling cleanup_stale_replica_sets after patching the deployment is a smart optimization. This ensures that old ReplicaSets with outdated pod specs (including stale GPU node affinity requirements) don't cause the autoscaler to provision nodes unnecessarily.

The best-effort approach (errors logged but reconciliation continues) is appropriate here.


1057-1139: Well-implemented cleanup logic for stale ReplicaSets.

The implementation is defensive and correct:

  • Only scales down old (non-current) ReplicaSets based on revision annotations
  • Only acts on RSs with 0 ready replicas but non-zero desired replicas
  • Handles errors gracefully with debug logging
  • Early returns minimize overhead on typical reconciliations

The approach of directly scaling RSs to 0 is safe because:

  1. Only old RSs are targeted (not the current revision)
  2. The Deployment controller expects old RSs to be scaled down during rollouts
  3. This just accelerates what Kubernetes would do eventually

535-559: Remove the incorrect autoscaler correlation concern; the label/annotation approach is intentional.

The review comment incorrectly assumes the autoscaler correlates pod labels (gpu-request.basilica.ai/model) with node labels. In reality, the autoscaler reads the JSON annotation (autoscaler.basilica.ai/gpu-requirements), not the pod label.

The pod label uses raw GPU model names (e.g., A100-40GB_H100-80GB) for human-readable pod identification, while node affinity uses the separate basilica.ai/gpu-model key with normalized model names (e.g., A10040GB_H10080GB) for selector matching. These serve different purposes and are intentionally separate—no inconsistency exists.

The 63-character label value limit is unlikely to be exceeded in practice with standard NVIDIA GPU names. If validation is desired, it can be added as an optional enhancement, but this is a minor concern.

Likely an incorrect or invalid review comment.

@epappas epappas merged commit b29987e into main Dec 19, 2025
17 checks passed
@epappas epappas deleted the feat/k3s-autoscaler-dynamic-offering branch December 19, 2025 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants