-
Notifications
You must be signed in to change notification settings - Fork 7
Fix/k3s autoscaler race condition #302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…thods * Add WARM_POOL_LABEL constant for consistent label key usage * Add NodePool::is_warm_pool() method to check warm pool membership * Add NodePoolPhase::is_provisioning() method for phase state checks * Update CRD print columns to show GPU info, external IP, and warm status * Move Mode/Provider/CUDA columns to priority 1 (hidden in normal view)
* Add K3sJoinResult struct with node_name, cuda_version, driver_version * Update install_k3s_agent to return K3sJoinResult instead of String * Export K3sJoinResult from provisioner module * Parse CUDA and driver versions from node during K3s installation
* Use K3sJoinResult to populate cuda_version and driver_version in status * Return Ok() instead of Err() when IP not yet available to prevent rapid retries that cause duplicate rentals before cache sync * Add phase transition duration logging for better observability * Log duration_ms for each phase transition including failures
* Add fallback to extract GPU models from nodeAffinity scheduling constraints * Priority order: annotation > label > nodeAffinity for GPU model selection * Add tests for nodeAffinity GPU extraction * Add test to verify annotation takes priority over nodeAffinity * Enables autoscaler to provision correct GPU type from pod affinity rules
* Use NodePool::is_warm_pool() and NodePoolPhase::is_provisioning() helpers * Add try_idle_scale_down() to evaluate idle scale-down in all code paths * Improve is_active_gpu_pod() to check pod phase and init containers * Add optimistic locking for warm pool scale-up to prevent race conditions * Treat NodePools without status as provisioning to prevent duplicate scale-ups * Use WARM_POOL_LABEL constant instead of hardcoded string * Add comprehensive tests for is_active_gpu_pod() behavior * Simplify warm_pool.rs by using is_warm_pool() helper
|
Caution Review failedThe pull request is closed. WalkthroughIntroduce timeout-aware forced deletion and metric for forced deletions; make IP-missing cases non-fatal for reconciles; persist GPU/offering and rental metadata earlier; return structured K3s join results with GPU/driver info; add warm-pool helpers and GPU-aware scaling; extend CRDs and CLI model-size heuristics; add inference examples and docs. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
crates/basilica-autoscaler/src/provisioner/ssh.rs (1)
993-1010: Consider simplifying the "unknown" toNoneconversion.The conversion logic is correct but could be slightly more concise. The intermediate
cudaanddrivervariables shadow the earlier ones, which works but adds cognitive load.🔎 Proposed simplification
- // Return node info with detected versions - let cuda = if cuda_version == "unknown" { - None - } else { - Some(cuda_version) - }; - let driver = if driver_version == "unknown" { - None - } else { - Some(driver_version) - }; - - Ok(K3sJoinResult { - node_name: hostname, - cuda_version: cuda, - driver_version: driver, - }) + Ok(K3sJoinResult { + node_name: hostname, + cuda_version: (cuda_version != "unknown").then_some(cuda_version), + driver_version: (driver_version != "unknown").then_some(driver_version), + })crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)
1396-1447: Well-structured GPU pod activity detection.The function correctly:
- Excludes terminal phases (
Succeeded,Failed)- Checks both containers and init_containers for GPU requests
- Uses the standard
nvidia.com/gpuresource keyConsider: GPU resources may also be specified in
limits(though typically equal torequests). If pods specify only limits, they'd be missed.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs(6 hunks)crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs(15 hunks)crates/basilica-autoscaler/src/crd/mod.rs(1 hunks)crates/basilica-autoscaler/src/crd/node_pool.rs(4 hunks)crates/basilica-autoscaler/src/offering_matcher.rs(3 hunks)crates/basilica-autoscaler/src/provisioner/mod.rs(1 hunks)crates/basilica-autoscaler/src/provisioner/ssh.rs(4 hunks)crates/basilica-autoscaler/src/warm_pool.rs(1 hunks)orchestrator/k8s/services/autoscaler/crds.yaml(1 hunks)
🔇 Additional comments (22)
crates/basilica-autoscaler/src/crd/mod.rs (1)
4-9: LGTM!The re-export of
WARM_POOL_LABELfollows the existing pattern and correctly expands the public API surface to support warm pool detection across the codebase.crates/basilica-autoscaler/src/warm_pool.rs (1)
84-87: LGTM!Replacing the inline label check with
pool.is_warm_pool()improves code maintainability by encapsulating the warm-pool detection logic in a single location. This aligns with the DRY principle and makes future label changes easier to manage.crates/basilica-autoscaler/src/provisioner/ssh.rs (1)
25-34: LGTM!Well-structured result type that cleanly encapsulates node identity and GPU metadata. The optional fields for
cuda_versionanddriver_versionappropriately handle cases where NVIDIA drivers aren't present.crates/basilica-autoscaler/src/provisioner/mod.rs (1)
5-7: LGTM!The re-export of
K3sJoinResultcorrectly exposes the new type for consumers of the provisioner module, maintaining a clean public API surface.crates/basilica-autoscaler/src/offering_matcher.rs (4)
588-622: LGTM! Well-structured fallback chain for GPU model extraction.The prioritized extraction order (annotation → label → nodeAffinity) ensures explicit user intent takes precedence while still supporting pods that only specify GPU requirements via scheduling constraints. The addition of
.filter(|set: &BTreeSet<String>| !set.is_empty())on line 609 correctly prevents empty label-based sets from blocking the affinity fallback.
1525-1578: LGTM!The test helper
test_pod_with_gpu_request_and_affinityis well-constructed, properly setting up both the GPU resource request and nodeAffinity constraints needed to exercise the new fallback path.
1580-1607: LGTM!Good test coverage for the nodeAffinity fallback path. The tests verify both single-model extraction and multi-model scenarios with normalization (e.g.,
"A100-80GB"→"A10080GB").
1609-1674: LGTM!Critical test case that validates the priority ordering. By setting up a pod with both an annotation (
RTX4090) and conflicting nodeAffinity (H100), it confirms that the annotation correctly takes precedence.crates/basilica-autoscaler/src/crd/node_pool.rs (4)
9-11: LGTM! Centralizing the warm pool label key.Good practice to define this as a constant for consistent usage across the codebase.
23-31: Good observability enhancements forkubectl get nodepools.The new print columns surface key runtime details. Priority 1 columns (
Mode,Provider,CUDA) will appear with-o wide, keeping default output clean.
77-87: Clean helper for warm pool detection.The implementation correctly handles missing labels/metadata and uses the centralized constant.
342-357: Comprehensive provisioning phase detection.All intermediate phases from
PendingthroughWaitingForNodeare correctly included. This aligns with the race condition fix to prevent duplicate scale-ups while nodes are being provisioned.crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (4)
120-134: Critical race condition fix: treating status-less NodePools as provisioning.The comment at lines 121-123 and the
unwrap_or(true)at line 132 are essential. Newly created NodePools may not have status set yet by the NodePoolController, and treating them as provisioning prevents duplicate scale-ups during this window.
464-489: Optimistic locking prevents warm pool scale-up race conditions.The pattern correctly uses
resource_versionfor conflict detection. ReturningOk(())on conflict (line 488) allows the winning reconcile to proceed without triggering error backoff on losing reconciles.
510-511: Verify local status synchronization after atomic K8s update.The atomic
try_increment_pending_scale_upalready persists the incremented value to K8s. Line 511 updates the localstatuscopy to stay consistent. Ensure that the subsequentupdate_scaling_policy_statusat the end ofreconcile(line 385) doesn't conflict or cause double-counting if the atomic update used a different resource version than the final status update.
1695-1753: Good test coverage foris_active_gpu_pod.Tests cover the key scenarios: running/pending (active), succeeded/failed (inactive), no GPU (inactive), and init containers with GPU.
crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (4)
425-426: LGTM: Non-error retry path prevents tight error-backoff loop.Returning
Ok(())when IP isn't yet available uses the success interval (~10s) for retry, giving the K8s cache time to sync with the persistedrental_id. This prevents the race condition described in the comments above.
462-470: Critical fix: prevents duplicate rentals by avoiding error backoff.The comment block clearly explains the issue: error backoff retries within milliseconds, before the K8s cache reflects the persisted
rental_id. ReturningOk(())uses the longer success interval, allowing the cache to sync. On the next reconcile, line 404 will reuse the existing rental.
730-744: Clean structured data flow from K3s join.Using
K3sJoinResult(from provisioner) provides explicit fields fornode_name,cuda_version, anddriver_version. This aligns with the new CRD print columns for runtime visibility.
1203-1222: Improved observability: phase transition duration logging.Logging
duration_ms,phase_from, andphase_toenables debugging slow provisioning phases and tracking node lifecycle performance.orchestrator/k8s/services/autoscaler/crds.yaml (2)
559-574: Useful ScalingPolicy overview columns.The columns expose key scaling configuration (
Enabled,Min,Max) and runtime state (Current) for quickkubectl get scalingpoliciesinspection.
22-51: NodePool print columns enhance operational visibility.The new columns provide at-a-glance information:
External-IP,GPU,GPUs,VRAMfor resource detailsWarmfor warm pool membership- Priority 1 columns (
Mode,Provider,CUDA) for detailed debugging with-o wideThe Rust
#[kube(printcolumn = ...)]attributes innode_pool.rsare synchronized with all YAML definitions—all 11 columns match in names, types, jsonPath values, and priorities.
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs
Outdated
Show resolved
Hide resolved
* Format long printcolumn attributes in node_pool.rs * Format multi-line expression in scaling_policy_controller.rs * Format insert call in offering_matcher.rs tests
* Add check_known_moe_models() for models where param extraction fails * DeepSeek-V3.1: 671B MoE requires 8x H100 (640GB) * Qwen3-235B-A22B (Qwen3-Max): 235B MoE requires 4x H100 (320GB) * MiniMax-M2: 230B MoE requires 4x H100 (320GB) * Kimi-K2-Thinking: 1T MoE requires 2x H100 (160GB) * Add tests for MoE flagship model GPU estimation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)
762-815: Idle node identification doesn't match scale-down selection.
find_idle_nodes_for_scale_downat line 777 identifies specific idle nodes (returns their names), butscale_downat line 802 receives all non-warm-pool NodePools and selects thecountnewest Ready nodes based on its internal logic at lines 1127-1145. This means the nodes actually removed may not be the idle nodes identified.Consider passing the identified idle node names to
scale_downor filteringnon_warm_poolsto only include the idle nodes:🔎 Proposed fix to ensure idle nodes are removed
let non_warm_pools: Vec<_> = node_pools .iter() .filter(|p| !p.is_warm_pool()) + .filter(|p| { + p.metadata.name.as_ref() + .map(|n| idle_nodes.contains(n)) + .unwrap_or(false) + }) .cloned() .collect();
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs(15 hunks)crates/basilica-autoscaler/src/crd/node_pool.rs(4 hunks)crates/basilica-autoscaler/src/offering_matcher.rs(3 hunks)crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs(3 hunks)
🔇 Additional comments (19)
crates/basilica-autoscaler/src/offering_matcher.rs (4)
588-622: LGTM! Well-implemented fallback hierarchy.The three-tier priority for GPU model extraction is cleanly implemented with proper filtering at each level. The addition of nodeAffinity as a third fallback ensures pods with only scheduling constraints (no explicit annotations/labels) can still be provisioned correctly.
Key strengths:
- Lazy evaluation prevents unnecessary work
- Empty results properly filtered at each priority level
- Consistent normalization applied to all sources
- Idiomatic use of
or_elsechains andthenfor Option conversion
1524-1581: LGTM! Comprehensive test helper.The new test helper properly constructs pods with both GPU resource requests and nodeAffinity constraints, enabling thorough testing of the new fallback logic.
1583-1610: LGTM! Thorough coverage of nodeAffinity extraction.The tests verify both single and multiple GPU model extraction from nodeAffinity, including proper normalization (e.g., "A100-80GB" → "A10080GB"). Good validation of the new fallback path.
1612-1677: LGTM! Priority order correctly validated.Excellent test that verifies annotation takes precedence over nodeAffinity when both are present. The conflicting values (RTX4090 in annotation vs H100 in affinity) clearly demonstrate the intended priority hierarchy.
crates/basilica-autoscaler/src/crd/node_pool.rs (4)
9-11: LGTM: Centralized constant for warm pool labeling.The constant provides a single source of truth for the warm pool label, improving maintainability.
21-37: LGTM: Enhanced CRD printer columns for better visibility.The additional columns improve operational visibility when inspecting NodePools via kubectl. The JSONPath expressions are correctly formatted, including proper escaping for the Warm column label.
83-93: LGTM: Clean helper method for warm pool detection.The implementation correctly handles missing labels and metadata gracefully, returning false by default.
348-363: LGTM: Well-designed helper for provisioning state detection.The method cleanly encapsulates all provisioning phases and provides a clear API for callers. The use of
matches!is idiomatic and efficient.crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (10)
14-14: LGTM: Import of centralized constant.Using the constant from the CRD module ensures consistency across the codebase.
120-134: LGTM: Defensive provisioning counter prevents race conditions.Treating NodePools without status as provisioning is a smart safeguard that prevents duplicate scale-ups before the NodePoolController has initialized the phase.
173-176: LGTM: Idle scale-down runs even when scale-up is blocked.The design correctly decouples idle-time scale-down from scale-up logic, preventing cost waste from idle nodes regardless of scale-up status.
Also applies to: 277-279
368-380: LGTM: Idle scale-down integrated with NoAction path.The logic correctly sets the Stable condition only when no scale-down action was taken, avoiding conflicting status updates.
464-489: LGTM: Optimistic locking prevents warm pool scale-up races.The implementation correctly uses optimistic locking to prevent concurrent reconciles from creating duplicate warm pool nodes, consistent with the regular scale-up pattern.
537-541: LGTM: Consistent use ofis_warm_pool()helper.Replacing inline label checks with the centralized helper improves maintainability and reduces code duplication.
Also applies to: 1341-1341, 1367-1367
1394-1445: LGTM: Robust active GPU pod detection.The implementation correctly identifies active GPU workloads by checking both pod phase and GPU resource requests in containers and init containers. The logic properly excludes terminal states (Succeeded/Failed).
1453-1461: LGTM: Use ofis_provisioning()helper.The refactoring improves code clarity and consistency by using the centralized helper method.
1517-1517: LGTM: Use of WARM_POOL_LABEL constant.Using the constant ensures consistency with the CRD label definition and improves maintainability.
1627-1751: LGTM: Comprehensive test coverage for GPU pod detection.The test helpers are well-designed, and the test cases cover all relevant scenarios including edge cases like init containers and terminal pod states.
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (1)
35-42: LGTM! Clean priority chain for model detection.The logic correctly prioritizes known MoE/flagship models before attempting parameter extraction, with appropriate fallbacks. The flow is clear and maintainable.
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs
Outdated
Show resolved
Hide resolved
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs
Outdated
Show resolved
Hide resolved
* Add DELETION_FORCE_TIMEOUT (1 hour) for stuck deletions * Add deletion_started_at to NodePoolStatus for timeout tracking * Add forced_deletions_total metric to track potential orphaned VMs * Force finalizer removal after timeout in handle_deletion and terminating * Treat 404/410 as success in stop_rental for idempotent cleanup * Make stop_rental critical in GC and Deleted cleanup (return error to retry)
* Add node_pool_matches_requirements() to check GPU compatibility * Add count_serviceable_pending_pods() to count pods that idle nodes can serve * Skip scale-up when idle nodes can serve all pending pods based on GPU model/count/memory * Pass pending_pods and node_pools to evaluate_scaling for GPU-aware decisions
* Remove hardcoded MoE model exceptions (users can override with CLI flags) * Remove model family heuristics in favor of parameter extraction * Use 80GB default GPU memory (A100/H100) * Use 16GB default memory when param count cannot be extracted * Simplify regex to handle both lowercase and uppercase 'B' suffix * Update tests to reflect simplified behavior
* Add quickstart.py with @deployment decorator usage * Add vllm_qwen_small.py for verified small model deployment * Add README.md with configuration reference and model sizing guide * Add reference/README.md with vLLM and SGLang multi-GPU examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/basilica-autoscaler/src/api/secure_cloud.rs (1)
777-810: Align deregister_node with stop_rental's idempotent deletion pattern.
stop_rentaltreats 404/410 as successful idempotent cleanup (lines 651–658), butderegister_nodereturns errors for all non-success statuses. Sincederegister_nodeis called from 6 distinct locations innode_pool_controller.rsduring various lifecycle operations, concurrent or retried calls could fail unnecessarily. Apply the same 404/410 → success pattern to ensure consistency and resilience.
♻️ Duplicate comments (1)
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (1)
800-825: Potential mismatch: idle node identification vs. actual scale-down selection.
find_idle_nodes_for_scale_downidentifies specific idle nodes (line 800), but line 819-823 passes all non-warm-pool NodePools toscale_down. Thescale_downmethod (lines 1148-1182) selects thecountnewest Ready nodes from the provided list, which may not be the same nodes identified as idle.🔎 Recommended fix to ensure idle nodes are actually removed
- let non_warm_pools: Vec<_> = node_pools + // Filter to only the idle nodes identified for removal + let idle_pools: Vec<_> = node_pools .iter() - .filter(|p| !p.is_warm_pool()) + .filter(|p| { + p.metadata.name.as_ref() + .map(|n| idle_nodes.contains(n)) + .unwrap_or(false) + }) .cloned() .collect(); - self.scale_down(ns, &non_warm_pools, count).await?; + self.scale_down(ns, &idle_pools, count).await?;This ensures the nodes actually removed are the ones identified as idle based on your timeout criteria.
🧹 Nitpick comments (9)
examples/inference/quickstart.py (2)
14-22: Consider adding timeout parameter for consistency.The related example
vllm_qwen_small.pyincludestimeout=600in the deployment decorator. Adding a timeout would provide consistent behavior and better user experience if the deployment takes longer than expected.🔎 Suggested addition
@deployment( name="my-llm", image="vllm/vllm-openai:latest", gpu="A100", gpu_count=1, memory="40Gi", port=8000, ttl_seconds=3600, + timeout=600, )
34-39: Remove unnecessary f-string prefixes.Lines 34, 36, 38, and 39 use f-string prefixes but contain no placeholders. Only lines 35 and 37 actually interpolate values.
🔎 Proposed fix
result = serve() - print(f"\nDeployment ready!") + print("\nDeployment ready!") print(f" Endpoint: {result.url}/v1/chat/completions") - print(f"\nTest with:") + print("\nTest with:") print(f' curl {result.url}/v1/chat/completions \\') - print(f' -H "Content-Type: application/json" \\') - print(f' -d \'{{"model": "Qwen/Qwen2.5-0.5B-Instruct", "messages": [{{"role": "user", "content": "Hello!"}}]}}\'') + print(' -H "Content-Type: application/json" \\') + print(' -d \'{"model": "Qwen/Qwen2.5-0.5B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}\'')examples/inference/vllm_qwen_small.py (2)
9-15: Clarify GPU requirements in documentation.The docstring states the model requires only ~2GB GPU RAM but then specifies "1x A100 GPU (80GB VRAM)" as a requirement. This appears inconsistent. If the model truly only needs 2GB, consider:
- Clarifying that A100 is suggested for testing but not strictly required
- Mentioning that smaller GPUs could work
- Or explaining why the full A100 is specified despite low memory needs
58-58: Remove unnecessary f-string prefix.Line 58 uses an f-string prefix but contains no placeholders.
🔎 Proposed fix
- print(f"vLLM Qwen2.5-0.5B deployment ready!") + print("vLLM Qwen2.5-0.5B deployment ready!")examples/inference/reference/README.md (1)
89-107: Add blank lines around tables for proper markdown formatting.Markdown tables should be surrounded by blank lines. Add a blank line before the vLLM table (line 92) and the SGLang table (line 101).
🔎 Proposed fix
## Framework Options ### vLLM + | Option | Description | |--------|-------------| | `--tensor-parallel-size` | Split model across GPUs | | `--max-model-len` | Maximum sequence length | | `--gpu-memory-utilization` | GPU memory fraction (0.0-1.0) | | `--quantization` | Quantization method (awq, gptq) | | `--trust-remote-code` | Trust model's custom code | ### SGLang + | Option | Description | |--------|-------------| | `--tp` | Tensor parallelism size | | `--mem-fraction-static` | Static memory fraction | | `--context-length` | Maximum context length | | `--quantization` | Quantization method (fp8) |crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (3)
35-48: Consider adding special handling for quantized and MoE models.The current heuristic (params × 2 × 1.2) assumes FP16 weights, but quantized models (Q8, Q4, etc.) require significantly less memory, while MoE architectures may need more due to activated expert subsets. This could lead to over-provisioning for quantized models and under-provisioning for MoE models.
Consider parsing quantization hints from model names (e.g., "Q8", "Q4") and adjusting the multiplier accordingly, or documenting that users should override defaults for these model types using CLI flags.
68-72: GPU count capped at 8 may under-provision very large models.The clamp at 8 GPUs will cap models requiring >640GB (e.g., models needing 16+ GPUs). While users can override via CLI flags, consider adding a log warning or documentation note for models that hit this ceiling.
Optional: Add warning for capped GPU counts
fn calculate_gpu_count(required_memory_gb: u32, gpu_memory_gb: u32) -> u32 { let count = (required_memory_gb as f32 / gpu_memory_gb as f32).ceil() as u32; - count.clamp(1, 8) + let clamped = count.clamp(1, 8); + if count > 8 { + eprintln!( + "Warning: Model requires {} GPUs but capped at 8. Use --gpu flag to override.", + count + ); + } + clamped }
96-106: Clarify the rounding behavior in the comment.The comment on Line 98 states "truncated then rounded up to 8" which is slightly misleading. The actual behavior is: convert to
u32(truncating the decimal), then round up to the nearest multiple of 8 usingdiv_ceil(8) * 8.Clearer comment
- // 7B: 7 * 2 * 1.2 = 16.8 -> 16GB (truncated then rounded up to 8) + // 7B: 7 * 2 * 1.2 = 16.8 -> 16 (as u32) -> 16 (rounded to nearest 8GB) assert_eq!(estimate_memory_from_params(7.0), 16); - // 0.5B: 0.5 * 2 * 1.2 = 1.2 -> 8GB + // 0.5B: 0.5 * 2 * 1.2 = 1.2 -> 1 (as u32) -> 8 (rounded to nearest 8GB) assert_eq!(estimate_memory_from_params(0.5), 8); - // 70B: 70 * 2 * 1.2 = 168 -> 168GB + // 70B: 70 * 2 * 1.2 = 168 -> 168 (already multiple of 8) assert_eq!(estimate_memory_from_params(70.0), 168); - // 235B: 235 * 2 * 1.2 = 564 -> 568GB + // 235B: 235 * 2 * 1.2 = 564 -> 564 (as u32) -> 568 (rounded to nearest 8GB) assert_eq!(estimate_memory_from_params(235.0), 568);crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (1)
209-274: Add operational guardrails for forced deletion scenarios.The timeout-based forced deletion logic is sound—it prevents stuck CustomResources by attempting critical cleanup (rental stop) but proceeding with finalizer removal after 1 hour even on failure. This accepts the risk of orphaned VMs when rental APIs are unavailable. The
record_forced_deletionmetric is properly called in both deletion paths andDELETION_FORCE_TIMEOUTis consistently enforced.To operationalize this pattern, implement:
- Alerts on
record_forced_deletionmetric increments to detect forced deletions in production- Periodic reconciliation job to detect and clean up orphaned VMs by cross-referencing inactive rentals
- Runbook documenting manual recovery procedures when forced deletions occur
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
crates/basilica-autoscaler/src/api/secure_cloud.rs(1 hunks)crates/basilica-autoscaler/src/config.rs(1 hunks)crates/basilica-autoscaler/src/controllers/node_pool_controller.rs(10 hunks)crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs(17 hunks)crates/basilica-autoscaler/src/crd/node_pool.rs(5 hunks)crates/basilica-autoscaler/src/metrics.rs(5 hunks)crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs(4 hunks)examples/inference/README.md(1 hunks)examples/inference/quickstart.py(1 hunks)examples/inference/reference/README.md(1 hunks)examples/inference/vllm_qwen_small.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-13T15:34:46.544Z
Learnt from: epappas
Repo: one-covenant/basilica PR: 184
File: crates/basilica-validator/src/rental/billing.rs:129-130
Timestamp: 2025-10-13T15:34:46.544Z
Learning: In the basilica-validator crate, rental_id is a String type (not UUID) throughout the system - in RentalInfo struct, database rentals.id column, and all rental operations. The persistence layer methods should accept &str for rental_id parameters, not &Uuid, since they immediately convert to String for SQL binding anyway.
Applied to files:
crates/basilica-autoscaler/src/api/secure_cloud.rs
🧬 Code graph analysis (2)
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (2)
crates/basilica-cli/src/cli/handlers/deploy/templates/sglang.rs (1)
model(154-169)crates/basilica-cli/src/cli/handlers/deploy/templates/vllm.rs (1)
model(155-170)
examples/inference/quickstart.py (2)
examples/inference/vllm_qwen_small.py (1)
serve(37-52)crates/basilica-sdk-python/python/basilica/deployment.py (1)
url(183-191)
🪛 markdownlint-cli2 (0.18.1)
examples/inference/reference/README.md
92-92: Tables should be surrounded by blank lines
(MD058, blanks-around-tables)
101-101: Tables should be surrounded by blank lines
(MD058, blanks-around-tables)
🪛 Ruff (0.14.8)
examples/inference/quickstart.py
25-29: Starting a process with a partial executable path
(S607)
27-27: Possible binding to all interfaces
(S104)
34-34: f-string without any placeholders
Remove extraneous f prefix
(F541)
36-36: f-string without any placeholders
Remove extraneous f prefix
(F541)
38-38: f-string without any placeholders
Remove extraneous f prefix
(F541)
39-39: f-string without any placeholders
Remove extraneous f prefix
(F541)
examples/inference/vllm_qwen_small.py
46-46: Possible binding to all interfaces
(S104)
52-52: subprocess call: check for execution of untrusted input
(S603)
58-58: f-string without any placeholders
Remove extraneous f prefix
(F541)
🔇 Additional comments (25)
crates/basilica-autoscaler/src/api/secure_cloud.rs (1)
649-658: Excellent idempotency fix for race condition!Treating 404/410 as success correctly handles the case where multiple tasks attempt to delete the same rental. The early status capture and explicit check before
is_success()ensures clean, idempotent deletion semantics. The circuit breaker will appropriately record these as successes rather than failures.examples/inference/vllm_qwen_small.py (1)
24-36: LGTM!The deployment configuration is well-structured with appropriate timeout and environment variables for optimizing HuggingFace downloads.
examples/inference/README.md (2)
45-51: Verify memory requirements for small models.Line 47 states that models <1B require 8GB of memory, but the
vllm_qwen_small.pyexample documents that the Qwen2.5-0.5B model (500M parameters) requires only ~2GB GPU RAM. Consider updating the table to reflect more accurate memory requirements, perhaps "2-8GB" to account for variation.
56-80: LGTM!The framework examples and management commands are well-documented and align with the example scripts in the repository.
examples/inference/reference/README.md (2)
8-44: LGTM!The vLLM reference examples are well-structured and demonstrate appropriate resource configurations for both single-GPU and multi-GPU deployments.
108-128: LGTM!The gated models and volume caching examples provide clear guidance for advanced deployment scenarios.
crates/basilica-cli/src/cli/handlers/deploy/templates/model_size.rs (5)
26-29: LGTM: Regex pattern correctly handles case-insensitive parameter extraction.The updated pattern now matches both lowercase and uppercase 'B' suffixes, supporting patterns like "7b", "235B", and decimal values like "0.5b".
50-56: LGTM: Parameter extraction logic is clean and correct.
58-66: LGTM: Memory estimation formula is reasonable for FP16 models.The 2GB per billion parameters with 20% overhead is a sensible heuristic, and rounding to 8GB increments aligns with typical GPU memory allocation patterns.
74-81: LGTM: GPU recommendation logic is sensible for common use cases.The threshold at 80GB appropriately distinguishes between A100 and H100 requirements.
118-147: LGTM: Comprehensive test coverage for the updated estimation logic.Tests properly verify:
- Small models (0.5B) with 1 GPU
- Medium models (7B) with A100 recommendation
- Large models (70B, 235B) with H100 recommendation and appropriate GPU counts
- Unknown models falling back to DEFAULT_MEMORY_GB
- Case-insensitive parameter extraction (uppercase B)
crates/basilica-autoscaler/src/config.rs (1)
229-232: LGTM! Timeout constant is well-documented.The 1-hour timeout provides a reasonable balance between allowing transient API issues to resolve and preventing indefinite blocking. The documentation clearly explains the purpose and risk mitigation strategy.
crates/basilica-autoscaler/src/metrics.rs (1)
34-34: LGTM! Forced deletions metric properly implemented.The new metric is correctly initialized, registered with Prometheus, and includes a clear help text warning about potential orphaned VMs. The recording method follows the existing pattern.
Also applies to: 170-174, 361-363, 434-434, 504-506
crates/basilica-autoscaler/src/crd/node_pool.rs (3)
9-10: LGTM! Clean abstraction for warm pool detection.The
WARM_POOL_LABELconstant andis_warm_pool()helper replace magic strings with a clear, reusable abstraction used consistently across the codebase.Also applies to: 83-93
348-363: LGTM! Comprehensive provisioning phase detection.The
is_provisioning()method correctly identifies all in-progress phases and provides a clear way to count NodePools being set up. This prevents duplicate scale-up decisions.
517-522: LGTM! Deletion timeout tracking support.The
deletion_started_atfield enables timeout-based forced cleanup to prevent indefinitely stuck deletions. The documentation clearly explains its purpose.crates/basilica-autoscaler/src/controllers/node_pool_controller.rs (3)
458-460: IP-not-available handling change is correct but requires careful monitoring.Changing from
ErrtoOk()when IP is not yet available prevents duplicate rentals by using success_interval (~10s) instead of error backoff (milliseconds), giving the K8s cache time to sync with the persisted rental_id.This is a critical race condition fix. Monitor that:
- Rental IDs are persisted before returning Ok()
- The success_interval is long enough for cache sync
- No duplicate rentals are created
The comments explain the rationale well, but consider adding integration tests to verify this behavior under concurrent reconciliation.
Also applies to: 502-503
763-778: LGTM! GPU metadata captured from join result.The code correctly captures
node_name,cuda_version, anddriver_versionfrom the K3s join result, providing better observability and enabling GPU-aware scheduling decisions.
1271-1294: LGTM! Phase transition duration logging improves observability.The added duration logging in phase transitions provides valuable debugging information and helps identify bottlenecks in the provisioning pipeline.
Also applies to: 1303-1329
crates/basilica-autoscaler/src/controllers/scaling_policy_controller.rs (6)
121-134: LGTM! Correct handling of newly-created NodePools.Treating NodePools without status as provisioning prevents race conditions where scale-up happens before the NodePoolController sets the initial phase. This is a defensive programming practice.
470-496: LGTM! Optimistic locking prevents duplicate warm pool scale-ups.The optimistic locking using
try_increment_pending_scale_upwith resource version correctly prevents race conditions between concurrent reconciles. The early return on conflict is appropriate.
713-730: LGTM! GPU-aware idle node matching prevents unnecessary scale-up.The serviceable count check ensures that idle nodes can actually serve pending pods based on GPU model, count, and memory requirements. This prevents provisioning new nodes when existing idle nodes can handle the workload.
1369-1401: LGTM! Comprehensive GPU requirement matching.The
node_pool_matches_requirementsmethod correctly validates GPU count, memory, and model requirements. The normalized model matching with prefix support handles model variations well.
1482-1533: LGTM! Comprehensive active GPU pod detection.The
is_active_gpu_podmethod correctly:
- Excludes terminal phase pods (Succeeded/Failed)
- Checks both regular containers and init containers for GPU requests
- Prevents premature scale-down of nodes running GPU workloads
1727-1839: LGTM! Thorough test coverage for is_active_gpu_pod.The test cases cover all critical scenarios:
- Running/Pending pods with GPU (active)
- Succeeded/Failed pods with GPU (inactive)
- Running pods without GPU (inactive)
- Init containers with GPU (active)
* Use function reference instead of closure in filter_map
Summary by CodeRabbit
New Features
Bug Fixes
Improvements
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.