Skip to content

Conversation

@epappas
Copy link
Member

@epappas epappas commented Dec 20, 2025

Summary by CodeRabbit

  • New Features

    • Richer CRD printer columns and improved model-size estimation defaults; agent join now surfaces node and GPU/driver metadata; new metric tracks forced deletions.
  • Bug Fixes

    • IP lookup now retries non-fatally when unavailable; rental stop treats not-found responses as success; deletions can be forced after a timeout to avoid stuck resources.
  • Improvements

    • GPU-aware scaling, idle-time scale-down, optimistic warm-pool scale-up, better phase-transition logging, and non-fatal GPU labeling behavior.
  • Documentation

    • New inference examples, quickstart, and reference guides.
  • Tests

    • Expanded GPU-detection and model-size tests.

✏️ Tip: You can customize this high-level summary in your review settings.

…thods

* Add WARM_POOL_LABEL constant for consistent label key usage
* Add NodePool::is_warm_pool() method to check warm pool membership
* Add NodePoolPhase::is_provisioning() method for phase state checks
* Update CRD print columns to show GPU info, external IP, and warm status
* Move Mode/Provider/CUDA columns to priority 1 (hidden in normal view)
* Add K3sJoinResult struct with node_name, cuda_version, driver_version
* Update install_k3s_agent to return K3sJoinResult instead of String
* Export K3sJoinResult from provisioner module
* Parse CUDA and driver versions from node during K3s installation
* Use K3sJoinResult to populate cuda_version and driver_version in status
* Return Ok() instead of Err() when IP not yet available to prevent
  rapid retries that cause duplicate rentals before cache sync
* Add phase transition duration logging for better observability
* Log duration_ms for each phase transition including failures
* Add fallback to extract GPU models from nodeAffinity scheduling constraints
* Priority order: annotation > label > nodeAffinity for GPU model selection
* Add tests for nodeAffinity GPU extraction
* Add test to verify annotation takes priority over nodeAffinity
* Enables autoscaler to provision correct GPU type from pod affinity rules
* Use NodePool::is_warm_pool() and NodePoolPhase::is_provisioning() helpers
* Add try_idle_scale_down() to evaluate idle scale-down in all code paths
* Improve is_active_gpu_pod() to check pod phase and init containers
* Add optimistic locking for warm pool scale-up to prevent race conditions
* Treat NodePools without status as provisioning to prevent duplicate scale-ups
* Use WARM_POOL_LABEL constant instead of hardcoded string
* Add comprehensive tests for is_active_gpu_pod() behavior
* Simplify warm_pool.rs by using is_warm_pool() helper
@epappas epappas self-assigned this Dec 20, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 20, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Introduce timeout-aware forced deletion and metric for forced deletions; make IP-missing cases non-fatal for reconciles; persist GPU/offering and rental metadata earlier; return structured K3s join results with GPU/driver info; add warm-pool helpers and GPU-aware scaling; extend CRDs and CLI model-size heuristics; add inference examples and docs.

Changes

Cohort / File(s) Summary
Node pool controller — deletion, phase transitions, provisioning
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs
Add timeout-based forced cleanup using DELETION_FORCE_TIMEOUT, track deletion_started_at, conditional finalizer removal, record forced-deletions metric, make IP-not-yet-available a non-error for retry, persist GPU/offering/rental metadata earlier, populate status fields from K3sJoinResult, and improve phase transition logging.
Scaling policy controller — GPU-aware scaling & idle down
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs
Use WARM_POOL_LABEL / is_warm_pool(), add is_active_gpu_pod() and GPU-aware pending-pod counting, optimistic locking for pending scale-ups (try_increment_pending_scale_up), add try_idle_scale_down() and adjust provisioning detection to treat no-status pools as provisioning.
CRD: NodePool fields, helpers & re-exports
crates/basilica-autoscaler/src/crd/node_pool.rs, crates/basilica-autoscaler/src/crd/mod.rs
Add WARM_POOL_LABEL constant, implement NodePool::is_warm_pool() and NodePoolPhase::is_provisioning(), expand NodePoolSpec public fields, add deletion_started_at to NodePoolStatus, and re-export WARM_POOL_LABEL.
Provisioner: K3s join result and re-export
crates/basilica-autoscaler/src/provisioner/ssh.rs, crates/basilica-autoscaler/src/provisioner/mod.rs
Introduce public K3sJoinResult { node_name, cuda_version?, driver_version? }; change install_k3s_agent() to return Result<K3sJoinResult>; normalize "unknown" values to None; re-export K3sJoinResult.
Offering matcher — GPU model extraction priority
crates/basilica-autoscaler/src/offering_matcher.rs
Extend GPU-model extraction priority to: annotation → label → nodeAffinity fallback; normalize affinity-derived models and add tests covering priority and affinity cases.
Warm-pool helper usage
crates/basilica-autoscaler/src/warm_pool.rs
Replace inline warm-pool label checks with pool.is_warm_pool() helper; preserve warm-pool filtering and related logic.
Metrics — forced deletions
crates/basilica-autoscaler/src/metrics.rs
Add forced_deletions_total IntCounter, register it, include it in AutoscalerMetrics, and expose record_forced_deletion(&self, _pool: &str).
Config — timeout constant
crates/basilica-autoscaler/src/config.rs
Add public constant DELETION_FORCE_TIMEOUT: u64 = 3600 to PhaseTimeouts.
Secure cloud rental idempotency
crates/basilica-autoscaler/src/api/secure_cloud.rs
Treat HTTP 404 and 410 as successful outcomes in stop_rental (idempotent cleanup).
CRD printer columns (manifests)
orchestrator/k8s/services/autoscaler/crds.yaml
Add/reorder NodePool printer-columns: External-IP, GPU, GPUs, VRAM, Warm (label), Mode (priority 1), Provider (priority 1), CUDA (priority 1); add ScalingPolicy printer-columns: Enabled, Min, Max, Current, Age.
CLI model-size heuristic and tests
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs
Replace family heuristics with parameter-count-driven estimation, add DEFAULT_MEMORY_GB, bump per-device default memory, update regex for param suffixes, and adjust tests for new defaults.
Examples: inference docs & samples
examples/inference/README.md, examples/inference/quickstart.py, examples/inference/reference/README.md, examples/inference/vllm_qwen_small.py
Add inference quickstart, reference docs, and example deployment scripts demonstrating OpenAI-compatible endpoints and framework-specific configs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Pay extra attention to:
    • node_pool_controller.rs: deletion timeout branching, finalizer removal conditions, and metric increment placement.
    • scaling_policy_controller.rs: correctness of optimistic locking and try_idle_scale_down() interactions.
    • provisioner/ssh.rs and all callers: ensure K3sJoinResult is handled everywhere former hostname string was used.
    • CRD/spec/status consistency between crd/node_pool.rs and crds.yaml.
    • offering_matcher.rs tests for nodeAffinity parsing and precedence.

Possibly related PRs

Suggested reviewers

  • distributedstatemachine

Poem

🐇 I nibbled logs and chased a dangling id,
Poked IPs gently — "try again" I did,
I kept GPU names safe in tidy rows,
Warm pools wake with locks and careful toes,
If deletion stalls, I hop — and nudge it closed.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix/k3s autoscaler race condition' is clearly related to the main changes in the PR, which involve substantial refactoring of the node pool controller to handle timeout-based deletion, IP retry logic, GPU awareness, and optimistic locking in the scaling policy controller to prevent concurrent scale-up conflicts.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1163ab and 5ea3f8c.

📒 Files selected for processing (1)
  • crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (17 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
crates/basilica-autoscaler/src/provisioner/ssh.rs (1)

993-1010: Consider simplifying the "unknown" to None conversion.

The conversion logic is correct but could be slightly more concise. The intermediate cuda and driver variables shadow the earlier ones, which works but adds cognitive load.

🔎 Proposed simplification
-        // Return node info with detected versions
-        let cuda = if cuda_version == "unknown" {
-            None
-        } else {
-            Some(cuda_version)
-        };
-        let driver = if driver_version == "unknown" {
-            None
-        } else {
-            Some(driver_version)
-        };
-
-        Ok(K3sJoinResult {
-            node_name: hostname,
-            cuda_version: cuda,
-            driver_version: driver,
-        })
+        Ok(K3sJoinResult {
+            node_name: hostname,
+            cuda_version: (cuda_version != "unknown").then_some(cuda_version),
+            driver_version: (driver_version != "unknown").then_some(driver_version),
+        })
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)

1396-1447: Well-structured GPU pod activity detection.

The function correctly:

  • Excludes terminal phases (Succeeded, Failed)
  • Checks both containers and init_containers for GPU requests
  • Uses the standard nvidia.com/gpu resource key

Consider: GPU resources may also be specified in limits (though typically equal to requests). If pods specify only limits, they'd be missed.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52f15cb and fdb6812.

📒 Files selected for processing (9)
  • crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (6 hunks)
  • crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (15 hunks)
  • crates/basilica-autoscaler/src/crd/mod.rs (1 hunks)
  • crates/basilica-autoscaler/src/crd/node_pool.rs (4 hunks)
  • crates/basilica-autoscaler/src/offering_matcher.rs (3 hunks)
  • crates/basilica-autoscaler/src/provisioner/mod.rs (1 hunks)
  • crates/basilica-autoscaler/src/provisioner/ssh.rs (4 hunks)
  • crates/basilica-autoscaler/src/warm_pool.rs (1 hunks)
  • orchestrator/k8s/services/autoscaler/crds.yaml (1 hunks)
🔇 Additional comments (22)
crates/basilica-autoscaler/src/crd/mod.rs (1)

4-9: LGTM!

The re-export of WARM_POOL_LABEL follows the existing pattern and correctly expands the public API surface to support warm pool detection across the codebase.

crates/basilica-autoscaler/src/warm_pool.rs (1)

84-87: LGTM!

Replacing the inline label check with pool.is_warm_pool() improves code maintainability by encapsulating the warm-pool detection logic in a single location. This aligns with the DRY principle and makes future label changes easier to manage.

crates/basilica-autoscaler/src/provisioner/ssh.rs (1)

25-34: LGTM!

Well-structured result type that cleanly encapsulates node identity and GPU metadata. The optional fields for cuda_version and driver_version appropriately handle cases where NVIDIA drivers aren't present.

crates/basilica-autoscaler/src/provisioner/mod.rs (1)

5-7: LGTM!

The re-export of K3sJoinResult correctly exposes the new type for consumers of the provisioner module, maintaining a clean public API surface.

crates/basilica-autoscaler/src/offering_matcher.rs (4)

588-622: LGTM! Well-structured fallback chain for GPU model extraction.

The prioritized extraction order (annotation → label → nodeAffinity) ensures explicit user intent takes precedence while still supporting pods that only specify GPU requirements via scheduling constraints. The addition of .filter(|set: &BTreeSet<String>| !set.is_empty()) on line 609 correctly prevents empty label-based sets from blocking the affinity fallback.


1525-1578: LGTM!

The test helper test_pod_with_gpu_request_and_affinity is well-constructed, properly setting up both the GPU resource request and nodeAffinity constraints needed to exercise the new fallback path.


1580-1607: LGTM!

Good test coverage for the nodeAffinity fallback path. The tests verify both single-model extraction and multi-model scenarios with normalization (e.g., "A100-80GB""A10080GB").


1609-1674: LGTM!

Critical test case that validates the priority ordering. By setting up a pod with both an annotation (RTX4090) and conflicting nodeAffinity (H100), it confirms that the annotation correctly takes precedence.

crates/basilica-autoscaler/src/crd/node_pool.rs (4)

9-11: LGTM! Centralizing the warm pool label key.

Good practice to define this as a constant for consistent usage across the codebase.


23-31: Good observability enhancements for kubectl get nodepools.

The new print columns surface key runtime details. Priority 1 columns (Mode, Provider, CUDA) will appear with -o wide, keeping default output clean.


77-87: Clean helper for warm pool detection.

The implementation correctly handles missing labels/metadata and uses the centralized constant.


342-357: Comprehensive provisioning phase detection.

All intermediate phases from Pending through WaitingForNode are correctly included. This aligns with the race condition fix to prevent duplicate scale-ups while nodes are being provisioned.

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (4)

120-134: Critical race condition fix: treating status-less NodePools as provisioning.

The comment at lines 121-123 and the unwrap_or(true) at line 132 are essential. Newly created NodePools may not have status set yet by the NodePoolController, and treating them as provisioning prevents duplicate scale-ups during this window.


464-489: Optimistic locking prevents warm pool scale-up race conditions.

The pattern correctly uses resource_version for conflict detection. Returning Ok(()) on conflict (line 488) allows the winning reconcile to proceed without triggering error backoff on losing reconciles.


510-511: Verify local status synchronization after atomic K8s update.

The atomic try_increment_pending_scale_up already persists the incremented value to K8s. Line 511 updates the local status copy to stay consistent. Ensure that the subsequent update_scaling_policy_status at the end of reconcile (line 385) doesn't conflict or cause double-counting if the atomic update used a different resource version than the final status update.


1695-1753: Good test coverage for is_active_gpu_pod.

Tests cover the key scenarios: running/pending (active), succeeded/failed (inactive), no GPU (inactive), and init containers with GPU.

crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (4)

425-426: LGTM: Non-error retry path prevents tight error-backoff loop.

Returning Ok(()) when IP isn't yet available uses the success interval (~10s) for retry, giving the K8s cache time to sync with the persisted rental_id. This prevents the race condition described in the comments above.


462-470: Critical fix: prevents duplicate rentals by avoiding error backoff.

The comment block clearly explains the issue: error backoff retries within milliseconds, before the K8s cache reflects the persisted rental_id. Returning Ok(()) uses the longer success interval, allowing the cache to sync. On the next reconcile, line 404 will reuse the existing rental.


730-744: Clean structured data flow from K3s join.

Using K3sJoinResult (from provisioner) provides explicit fields for node_name, cuda_version, and driver_version. This aligns with the new CRD print columns for runtime visibility.


1203-1222: Improved observability: phase transition duration logging.

Logging duration_ms, phase_from, and phase_to enables debugging slow provisioning phases and tracking node lifecycle performance.

orchestrator/k8s/services/autoscaler/crds.yaml (2)

559-574: Useful ScalingPolicy overview columns.

The columns expose key scaling configuration (Enabled, Min, Max) and runtime state (Current) for quick kubectl get scalingpolicies inspection.


22-51: NodePool print columns enhance operational visibility.

The new columns provide at-a-glance information:

  • External-IP, GPU, GPUs, VRAM for resource details
  • Warm for warm pool membership
  • Priority 1 columns (Mode, Provider, CUDA) for detailed debugging with -o wide

The Rust #[kube(printcolumn = ...)] attributes in node_pool.rs are synchronized with all YAML definitions—all 11 columns match in names, types, jsonPath values, and priorities.

* Format long printcolumn attributes in node_pool.rs
* Format multi-line expression in scaling_policy_controller.rs
* Format insert call in offering_matcher.rs tests
* Add check_known_moe_models() for models where param extraction fails
* DeepSeek-V3.1: 671B MoE requires 8x H100 (640GB)
* Qwen3-235B-A22B (Qwen3-Max): 235B MoE requires 4x H100 (320GB)
* MiniMax-M2: 230B MoE requires 4x H100 (320GB)
* Kimi-K2-Thinking: 1T MoE requires 2x H100 (160GB)
* Add tests for MoE flagship model GPU estimation
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)

762-815: Idle node identification doesn't match scale-down selection.

find_idle_nodes_for_scale_down at line 777 identifies specific idle nodes (returns their names), but scale_down at line 802 receives all non-warm-pool NodePools and selects the count newest Ready nodes based on its internal logic at lines 1127-1145. This means the nodes actually removed may not be the idle nodes identified.

Consider passing the identified idle node names to scale_down or filtering non_warm_pools to only include the idle nodes:

🔎 Proposed fix to ensure idle nodes are removed
        let non_warm_pools: Vec<_> = node_pools
            .iter()
            .filter(|p| !p.is_warm_pool())
+           .filter(|p| {
+               p.metadata.name.as_ref()
+                   .map(|n| idle_nodes.contains(n))
+                   .unwrap_or(false)
+           })
            .cloned()
            .collect();
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fdb6812 and a93d92a.

📒 Files selected for processing (4)
  • crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (15 hunks)
  • crates/basilica-autoscaler/src/crd/node_pool.rs (4 hunks)
  • crates/basilica-autoscaler/src/offering_matcher.rs (3 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (3 hunks)
🔇 Additional comments (19)
crates/basilica-autoscaler/src/offering_matcher.rs (4)

588-622: LGTM! Well-implemented fallback hierarchy.

The three-tier priority for GPU model extraction is cleanly implemented with proper filtering at each level. The addition of nodeAffinity as a third fallback ensures pods with only scheduling constraints (no explicit annotations/labels) can still be provisioned correctly.

Key strengths:

  • Lazy evaluation prevents unnecessary work
  • Empty results properly filtered at each priority level
  • Consistent normalization applied to all sources
  • Idiomatic use of or_else chains and then for Option conversion

1524-1581: LGTM! Comprehensive test helper.

The new test helper properly constructs pods with both GPU resource requests and nodeAffinity constraints, enabling thorough testing of the new fallback logic.


1583-1610: LGTM! Thorough coverage of nodeAffinity extraction.

The tests verify both single and multiple GPU model extraction from nodeAffinity, including proper normalization (e.g., "A100-80GB" → "A10080GB"). Good validation of the new fallback path.


1612-1677: LGTM! Priority order correctly validated.

Excellent test that verifies annotation takes precedence over nodeAffinity when both are present. The conflicting values (RTX4090 in annotation vs H100 in affinity) clearly demonstrate the intended priority hierarchy.

crates/basilica-autoscaler/src/crd/node_pool.rs (4)

9-11: LGTM: Centralized constant for warm pool labeling.

The constant provides a single source of truth for the warm pool label, improving maintainability.


21-37: LGTM: Enhanced CRD printer columns for better visibility.

The additional columns improve operational visibility when inspecting NodePools via kubectl. The JSONPath expressions are correctly formatted, including proper escaping for the Warm column label.


83-93: LGTM: Clean helper method for warm pool detection.

The implementation correctly handles missing labels and metadata gracefully, returning false by default.


348-363: LGTM: Well-designed helper for provisioning state detection.

The method cleanly encapsulates all provisioning phases and provides a clear API for callers. The use of matches! is idiomatic and efficient.

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (10)

14-14: LGTM: Import of centralized constant.

Using the constant from the CRD module ensures consistency across the codebase.


120-134: LGTM: Defensive provisioning counter prevents race conditions.

Treating NodePools without status as provisioning is a smart safeguard that prevents duplicate scale-ups before the NodePoolController has initialized the phase.


173-176: LGTM: Idle scale-down runs even when scale-up is blocked.

The design correctly decouples idle-time scale-down from scale-up logic, preventing cost waste from idle nodes regardless of scale-up status.

Also applies to: 277-279


368-380: LGTM: Idle scale-down integrated with NoAction path.

The logic correctly sets the Stable condition only when no scale-down action was taken, avoiding conflicting status updates.


464-489: LGTM: Optimistic locking prevents warm pool scale-up races.

The implementation correctly uses optimistic locking to prevent concurrent reconciles from creating duplicate warm pool nodes, consistent with the regular scale-up pattern.


537-541: LGTM: Consistent use of is_warm_pool() helper.

Replacing inline label checks with the centralized helper improves maintainability and reduces code duplication.

Also applies to: 1341-1341, 1367-1367


1394-1445: LGTM: Robust active GPU pod detection.

The implementation correctly identifies active GPU workloads by checking both pod phase and GPU resource requests in containers and init containers. The logic properly excludes terminal states (Succeeded/Failed).


1453-1461: LGTM: Use of is_provisioning() helper.

The refactoring improves code clarity and consistency by using the centralized helper method.


1517-1517: LGTM: Use of WARM_POOL_LABEL constant.

Using the constant ensures consistency with the CRD label definition and improves maintainability.


1627-1751: LGTM: Comprehensive test coverage for GPU pod detection.

The test helpers are well-designed, and the test cases cover all relevant scenarios including edge cases like init containers and terminal pod states.

crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (1)

35-42: LGTM! Clean priority chain for model detection.

The logic correctly prioritizes known MoE/flagship models before attempting parameter extraction, with appropriate fallbacks. The flow is clear and maintainable.

* Add DELETION_FORCE_TIMEOUT (1 hour) for stuck deletions
* Add deletion_started_at to NodePoolStatus for timeout tracking
* Add forced_deletions_total metric to track potential orphaned VMs
* Force finalizer removal after timeout in handle_deletion and terminating
* Treat 404/410 as success in stop_rental for idempotent cleanup
* Make stop_rental critical in GC and Deleted cleanup (return error to retry)
* Add node_pool_matches_requirements() to check GPU compatibility
* Add count_serviceable_pending_pods() to count pods that idle nodes can serve
* Skip scale-up when idle nodes can serve all pending pods based on GPU model/count/memory
* Pass pending_pods and node_pools to evaluate_scaling for GPU-aware decisions
* Remove hardcoded MoE model exceptions (users can override with CLI flags)
* Remove model family heuristics in favor of parameter extraction
* Use 80GB default GPU memory (A100/H100)
* Use 16GB default memory when param count cannot be extracted
* Simplify regex to handle both lowercase and uppercase 'B' suffix
* Update tests to reflect simplified behavior
* Add quickstart.py with @deployment decorator usage
* Add vllm_qwen_small.py for verified small model deployment
* Add README.md with configuration reference and model sizing guide
* Add reference/README.md with vLLM and SGLang multi-GPU examples
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/basilica-autoscaler/src/api/secure_cloud.rs (1)

777-810: Align deregister_node with stop_rental's idempotent deletion pattern.

stop_rental treats 404/410 as successful idempotent cleanup (lines 651–658), but deregister_node returns errors for all non-success statuses. Since deregister_node is called from 6 distinct locations in node_pool_controller.rs during various lifecycle operations, concurrent or retried calls could fail unnecessarily. Apply the same 404/410 → success pattern to ensure consistency and resilience.

♻️ Duplicate comments (1)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)

800-825: Potential mismatch: idle node identification vs. actual scale-down selection.

find_idle_nodes_for_scale_down identifies specific idle nodes (line 800), but line 819-823 passes all non-warm-pool NodePools to scale_down. The scale_down method (lines 1148-1182) selects the count newest Ready nodes from the provided list, which may not be the same nodes identified as idle.

🔎 Recommended fix to ensure idle nodes are actually removed
-        let non_warm_pools: Vec<_> = node_pools
+        // Filter to only the idle nodes identified for removal
+        let idle_pools: Vec<_> = node_pools
             .iter()
-            .filter(|p| !p.is_warm_pool())
+            .filter(|p| {
+                p.metadata.name.as_ref()
+                    .map(|n| idle_nodes.contains(n))
+                    .unwrap_or(false)
+            })
             .cloned()
             .collect();
 
-        self.scale_down(ns, &non_warm_pools, count).await?;
+        self.scale_down(ns, &idle_pools, count).await?;

This ensures the nodes actually removed are the ones identified as idle based on your timeout criteria.

🧹 Nitpick comments (9)
examples/inference/quickstart.py (2)

14-22: Consider adding timeout parameter for consistency.

The related example vllm_qwen_small.py includes timeout=600 in the deployment decorator. Adding a timeout would provide consistent behavior and better user experience if the deployment takes longer than expected.

🔎 Suggested addition
 @deployment(
     name="my-llm",
     image="vllm/vllm-openai:latest",
     gpu="A100",
     gpu_count=1,
     memory="40Gi",
     port=8000,
     ttl_seconds=3600,
+    timeout=600,
 )

34-39: Remove unnecessary f-string prefixes.

Lines 34, 36, 38, and 39 use f-string prefixes but contain no placeholders. Only lines 35 and 37 actually interpolate values.

🔎 Proposed fix
     result = serve()
-    print(f"\nDeployment ready!")
+    print("\nDeployment ready!")
     print(f"  Endpoint: {result.url}/v1/chat/completions")
-    print(f"\nTest with:")
+    print("\nTest with:")
     print(f'  curl {result.url}/v1/chat/completions \\')
-    print(f'    -H "Content-Type: application/json" \\')
-    print(f'    -d \'{{"model": "Qwen/Qwen2.5-0.5B-Instruct", "messages": [{{"role": "user", "content": "Hello!"}}]}}\'')
+    print('    -H "Content-Type: application/json" \\')
+    print('    -d \'{"model": "Qwen/Qwen2.5-0.5B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}\'')
examples/inference/vllm_qwen_small.py (2)

9-15: Clarify GPU requirements in documentation.

The docstring states the model requires only ~2GB GPU RAM but then specifies "1x A100 GPU (80GB VRAM)" as a requirement. This appears inconsistent. If the model truly only needs 2GB, consider:

  • Clarifying that A100 is suggested for testing but not strictly required
  • Mentioning that smaller GPUs could work
  • Or explaining why the full A100 is specified despite low memory needs

58-58: Remove unnecessary f-string prefix.

Line 58 uses an f-string prefix but contains no placeholders.

🔎 Proposed fix
-    print(f"vLLM Qwen2.5-0.5B deployment ready!")
+    print("vLLM Qwen2.5-0.5B deployment ready!")
examples/inference/reference/README.md (1)

89-107: Add blank lines around tables for proper markdown formatting.

Markdown tables should be surrounded by blank lines. Add a blank line before the vLLM table (line 92) and the SGLang table (line 101).

🔎 Proposed fix
 ## Framework Options
 
 ### vLLM
+
 | Option | Description |
 |--------|-------------|
 | `--tensor-parallel-size` | Split model across GPUs |
 | `--max-model-len` | Maximum sequence length |
 | `--gpu-memory-utilization` | GPU memory fraction (0.0-1.0) |
 | `--quantization` | Quantization method (awq, gptq) |
 | `--trust-remote-code` | Trust model's custom code |
 
 ### SGLang
+
 | Option | Description |
 |--------|-------------|
 | `--tp` | Tensor parallelism size |
 | `--mem-fraction-static` | Static memory fraction |
 | `--context-length` | Maximum context length |
 | `--quantization` | Quantization method (fp8) |
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (3)

35-48: Consider adding special handling for quantized and MoE models.

The current heuristic (params × 2 × 1.2) assumes FP16 weights, but quantized models (Q8, Q4, etc.) require significantly less memory, while MoE architectures may need more due to activated expert subsets. This could lead to over-provisioning for quantized models and under-provisioning for MoE models.

Consider parsing quantization hints from model names (e.g., "Q8", "Q4") and adjusting the multiplier accordingly, or documenting that users should override defaults for these model types using CLI flags.


68-72: GPU count capped at 8 may under-provision very large models.

The clamp at 8 GPUs will cap models requiring >640GB (e.g., models needing 16+ GPUs). While users can override via CLI flags, consider adding a log warning or documentation note for models that hit this ceiling.

Optional: Add warning for capped GPU counts
 fn calculate_gpu_count(required_memory_gb: u32, gpu_memory_gb: u32) -> u32 {
     let count = (required_memory_gb as f32 / gpu_memory_gb as f32).ceil() as u32;
-    count.clamp(1, 8)
+    let clamped = count.clamp(1, 8);
+    if count > 8 {
+        eprintln!(
+            "Warning: Model requires {} GPUs but capped at 8. Use --gpu flag to override.",
+            count
+        );
+    }
+    clamped
 }

96-106: Clarify the rounding behavior in the comment.

The comment on Line 98 states "truncated then rounded up to 8" which is slightly misleading. The actual behavior is: convert to u32 (truncating the decimal), then round up to the nearest multiple of 8 using div_ceil(8) * 8.

Clearer comment
-        // 7B: 7 * 2 * 1.2 = 16.8 -> 16GB (truncated then rounded up to 8)
+        // 7B: 7 * 2 * 1.2 = 16.8 -> 16 (as u32) -> 16 (rounded to nearest 8GB)
         assert_eq!(estimate_memory_from_params(7.0), 16);
-        // 0.5B: 0.5 * 2 * 1.2 = 1.2 -> 8GB
+        // 0.5B: 0.5 * 2 * 1.2 = 1.2 -> 1 (as u32) -> 8 (rounded to nearest 8GB)
         assert_eq!(estimate_memory_from_params(0.5), 8);
-        // 70B: 70 * 2 * 1.2 = 168 -> 168GB
+        // 70B: 70 * 2 * 1.2 = 168 -> 168 (already multiple of 8)
         assert_eq!(estimate_memory_from_params(70.0), 168);
-        // 235B: 235 * 2 * 1.2 = 564 -> 568GB
+        // 235B: 235 * 2 * 1.2 = 564 -> 564 (as u32) -> 568 (rounded to nearest 8GB)
         assert_eq!(estimate_memory_from_params(235.0), 568);
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (1)

209-274: Add operational guardrails for forced deletion scenarios.

The timeout-based forced deletion logic is sound—it prevents stuck CustomResources by attempting critical cleanup (rental stop) but proceeding with finalizer removal after 1 hour even on failure. This accepts the risk of orphaned VMs when rental APIs are unavailable. The record_forced_deletion metric is properly called in both deletion paths and DELETION_FORCE_TIMEOUT is consistently enforced.

To operationalize this pattern, implement:

  • Alerts on record_forced_deletion metric increments to detect forced deletions in production
  • Periodic reconciliation job to detect and clean up orphaned VMs by cross-referencing inactive rentals
  • Runbook documenting manual recovery procedures when forced deletions occur
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a93d92a and e1163ab.

📒 Files selected for processing (11)
  • crates/basilica-autoscaler/src/api/secure_cloud.rs (1 hunks)
  • crates/basilica-autoscaler/src/config.rs (1 hunks)
  • crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (10 hunks)
  • crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (17 hunks)
  • crates/basilica-autoscaler/src/crd/node_pool.rs (5 hunks)
  • crates/basilica-autoscaler/src/metrics.rs (5 hunks)
  • crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (4 hunks)
  • examples/inference/README.md (1 hunks)
  • examples/inference/quickstart.py (1 hunks)
  • examples/inference/reference/README.md (1 hunks)
  • examples/inference/vllm_qwen_small.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-13T15:34:46.544Z
Learnt from: epappas
Repo: one-covenant/basilica PR: 184
File: crates/basilica-validator/src/rental/billing.rs:129-130
Timestamp: 2025-10-13T15:34:46.544Z
Learning: In the basilica-validator crate, rental_id is a String type (not UUID) throughout the system - in RentalInfo struct, database rentals.id column, and all rental operations. The persistence layer methods should accept &str for rental_id parameters, not &Uuid, since they immediately convert to String for SQL binding anyway.

Applied to files:

  • crates/basilica-autoscaler/src/api/secure_cloud.rs
🧬 Code graph analysis (2)
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (2)
crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (1)
  • model (154-169)
crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (1)
  • model (155-170)
examples/inference/quickstart.py (2)
examples/inference/vllm_qwen_small.py (1)
  • serve (37-52)
crates/basilica-sdk-python/python/basilica/deployment.py (1)
  • url (183-191)
🪛 markdownlint-cli2 (0.18.1)
examples/inference/reference/README.md

92-92: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


101-101: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

🪛 Ruff (0.14.8)
examples/inference/quickstart.py

25-29: Starting a process with a partial executable path

(S607)


27-27: Possible binding to all interfaces

(S104)


34-34: f-string without any placeholders

Remove extraneous f prefix

(F541)


36-36: f-string without any placeholders

Remove extraneous f prefix

(F541)


38-38: f-string without any placeholders

Remove extraneous f prefix

(F541)


39-39: f-string without any placeholders

Remove extraneous f prefix

(F541)

examples/inference/vllm_qwen_small.py

46-46: Possible binding to all interfaces

(S104)


52-52: subprocess call: check for execution of untrusted input

(S603)


58-58: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (25)
crates/basilica-autoscaler/src/api/secure_cloud.rs (1)

649-658: Excellent idempotency fix for race condition!

Treating 404/410 as success correctly handles the case where multiple tasks attempt to delete the same rental. The early status capture and explicit check before is_success() ensures clean, idempotent deletion semantics. The circuit breaker will appropriately record these as successes rather than failures.

examples/inference/vllm_qwen_small.py (1)

24-36: LGTM!

The deployment configuration is well-structured with appropriate timeout and environment variables for optimizing HuggingFace downloads.

examples/inference/README.md (2)

45-51: Verify memory requirements for small models.

Line 47 states that models <1B require 8GB of memory, but the vllm_qwen_small.py example documents that the Qwen2.5-0.5B model (500M parameters) requires only ~2GB GPU RAM. Consider updating the table to reflect more accurate memory requirements, perhaps "2-8GB" to account for variation.


56-80: LGTM!

The framework examples and management commands are well-documented and align with the example scripts in the repository.

examples/inference/reference/README.md (2)

8-44: LGTM!

The vLLM reference examples are well-structured and demonstrate appropriate resource configurations for both single-GPU and multi-GPU deployments.


108-128: LGTM!

The gated models and volume caching examples provide clear guidance for advanced deployment scenarios.

crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (5)

26-29: LGTM: Regex pattern correctly handles case-insensitive parameter extraction.

The updated pattern now matches both lowercase and uppercase 'B' suffixes, supporting patterns like "7b", "235B", and decimal values like "0.5b".


50-56: LGTM: Parameter extraction logic is clean and correct.


58-66: LGTM: Memory estimation formula is reasonable for FP16 models.

The 2GB per billion parameters with 20% overhead is a sensible heuristic, and rounding to 8GB increments aligns with typical GPU memory allocation patterns.


74-81: LGTM: GPU recommendation logic is sensible for common use cases.

The threshold at 80GB appropriately distinguishes between A100 and H100 requirements.


118-147: LGTM: Comprehensive test coverage for the updated estimation logic.

Tests properly verify:

  • Small models (0.5B) with 1 GPU
  • Medium models (7B) with A100 recommendation
  • Large models (70B, 235B) with H100 recommendation and appropriate GPU counts
  • Unknown models falling back to DEFAULT_MEMORY_GB
  • Case-insensitive parameter extraction (uppercase B)
crates/basilica-autoscaler/src/config.rs (1)

229-232: LGTM! Timeout constant is well-documented.

The 1-hour timeout provides a reasonable balance between allowing transient API issues to resolve and preventing indefinite blocking. The documentation clearly explains the purpose and risk mitigation strategy.

crates/basilica-autoscaler/src/metrics.rs (1)

34-34: LGTM! Forced deletions metric properly implemented.

The new metric is correctly initialized, registered with Prometheus, and includes a clear help text warning about potential orphaned VMs. The recording method follows the existing pattern.

Also applies to: 170-174, 361-363, 434-434, 504-506

crates/basilica-autoscaler/src/crd/node_pool.rs (3)

9-10: LGTM! Clean abstraction for warm pool detection.

The WARM_POOL_LABEL constant and is_warm_pool() helper replace magic strings with a clear, reusable abstraction used consistently across the codebase.

Also applies to: 83-93


348-363: LGTM! Comprehensive provisioning phase detection.

The is_provisioning() method correctly identifies all in-progress phases and provides a clear way to count NodePools being set up. This prevents duplicate scale-up decisions.


517-522: LGTM! Deletion timeout tracking support.

The deletion_started_at field enables timeout-based forced cleanup to prevent indefinitely stuck deletions. The documentation clearly explains its purpose.

crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (3)

458-460: IP-not-available handling change is correct but requires careful monitoring.

Changing from Err to Ok() when IP is not yet available prevents duplicate rentals by using success_interval (~10s) instead of error backoff (milliseconds), giving the K8s cache time to sync with the persisted rental_id.

This is a critical race condition fix. Monitor that:

  1. Rental IDs are persisted before returning Ok()
  2. The success_interval is long enough for cache sync
  3. No duplicate rentals are created

The comments explain the rationale well, but consider adding integration tests to verify this behavior under concurrent reconciliation.

Also applies to: 502-503


763-778: LGTM! GPU metadata captured from join result.

The code correctly captures node_name, cuda_version, and driver_version from the K3s join result, providing better observability and enabling GPU-aware scheduling decisions.


1271-1294: LGTM! Phase transition duration logging improves observability.

The added duration logging in phase transitions provides valuable debugging information and helps identify bottlenecks in the provisioning pipeline.

Also applies to: 1303-1329

crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (6)

121-134: LGTM! Correct handling of newly-created NodePools.

Treating NodePools without status as provisioning prevents race conditions where scale-up happens before the NodePoolController sets the initial phase. This is a defensive programming practice.


470-496: LGTM! Optimistic locking prevents duplicate warm pool scale-ups.

The optimistic locking using try_increment_pending_scale_up with resource version correctly prevents race conditions between concurrent reconciles. The early return on conflict is appropriate.


713-730: LGTM! GPU-aware idle node matching prevents unnecessary scale-up.

The serviceable count check ensures that idle nodes can actually serve pending pods based on GPU model, count, and memory requirements. This prevents provisioning new nodes when existing idle nodes can handle the workload.


1369-1401: LGTM! Comprehensive GPU requirement matching.

The node_pool_matches_requirements method correctly validates GPU count, memory, and model requirements. The normalized model matching with prefix support handles model variations well.


1482-1533: LGTM! Comprehensive active GPU pod detection.

The is_active_gpu_pod method correctly:

  • Excludes terminal phase pods (Succeeded/Failed)
  • Checks both regular containers and init containers for GPU requests
  • Prevents premature scale-down of nodes running GPU workloads

1727-1839: LGTM! Thorough test coverage for is_active_gpu_pod.

The test cases cover all critical scenarios:

  • Running/Pending pods with GPU (active)
  • Succeeded/Failed pods with GPU (inactive)
  • Running pods without GPU (inactive)
  • Init containers with GPU (active)

* Use function reference instead of closure in filter_map
@epappas epappas merged commit 8513c67 into main Dec 21, 2025
11 of 12 checks passed
@epappas epappas deleted the fix/k3s-autoscaler-race-condition branch December 21, 2025 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants