-
Notifications
You must be signed in to change notification settings - Fork 7
feat(hyperstack): add webhook callback support for VM lifecycle events #307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds Hyperstack webhook support: new HyperstackConfig, optional hyperstack provider wiring, webhook callback types and API route, provider/client changes to pass callback URLs, service updates to store structured connection_info, and CLI SSH retry/connection improvements with silent checks and exponential backoff. Changes
Sequence Diagram(s)sequenceDiagram
actor CloudProvider as Hyperstack
participant API as Basilica API (webhook handler)
participant Config as AggregatorConfig
participant DB as Database
participant Aggregator as AggregatorService
participant CLI as basilica-cli
CloudProvider->>API: POST /webhooks/cloud-provider/hyperstack (token + payload)
activate API
API->>Config: Validate token via HyperstackConfig.webhook_secret
Config-->>API: validation result
API->>API: Parse HyperstackCallback (vm_id, operation, data)
API->>DB: SELECT deployment WHERE provider_instance_id = vm_id
DB-->>API: deployment record or none
alt deployment found
API->>API: map_callback_status / map_hyperstack_status
alt status == Running and data.floating_ip present
API->>API: build connection_info JSON (ssh_host, ssh_port, ssh_user)
API->>DB: UPDATE deployment SET status, ip_address, connection_info, updated_at
else other status
API->>DB: UPDATE deployment SET status, updated_at
end
DB-->>API: OK
else deployment missing
API-->>CloudProvider: 200 OK (noop)
end
API-->>CloudProvider: 200 OK
deactivate API
DB-->>Aggregator: deployment status change observed
Aggregator-->>CLI: clients poll see Running
CLI->>CLI: retry_ssh_connection -> try_connect_silently/test_connection with backoff
CLI-->>User: interactive session (on success)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
9f8a68b to
2827dff
Compare
Replace polling with push-based callbacks from Hyperstack for deployment status updates. The webhook secret in the URL path provides authentication.
Handle different Hyperstack callback formats by making resource and operation fields optional, adding root-level field support, and extracting data.status for actual VM state. Exclude Hyperstack rentals from polling health checks since they use webhooks.
Move authentication token from URL path (/webhooks/hyperstack/:token) to query parameter (/webhooks/hyperstack?token=...) for cleaner API.
2827dff to
c51e4aa
Compare
Change webhook route from /webhooks/hyperstack to /webhooks/cloud-provider/hyperstack for clearer API organization.
Add try_connect_silently() for passphrase-enabled SSH retries. Normalize connection_info to consistent JSON format across providers.
c51e4aa to
0557338
Compare
Make resource and operation fields required instead of optional with fallbacks. Add explicit JSON rejection handling and improve status mapping with extracted helper function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
crates/basilica-cli/src/cli/handlers/gpu_rental.rs (1)
2136-2186: Well-structured retry loop with one minor UX observation.The exponential backoff logic is correct:
- Timeout check at loop start (line 2140) with accurate attempt count (
attempt - 1)- Sleep skipped on first attempt (line 2155)
- Spinner cleared before SSH attempt to avoid clutter with passphrase prompts (line 2161)
- Proper transition from
try_connect_silentlyprobe tointeractive_sessionon successOne minor observation: on the first attempt, the spinner is created (line 2152) and immediately cleared (line 2161) with no visible wait time, potentially causing a brief flicker. This is cosmetic and doesn't affect functionality.
🔎 Optional: Skip spinner creation on first attempt
loop { attempt += 1; // Check timeout before attempting if start_time.elapsed() >= max_wait { return Err(CliError::Internal( eyre!( "SSH connection failed after {} attempts over {}s", attempt - 1, start_time.elapsed().as_secs(), ) .suggestion("The SSH service may not be ready yet. Wait a minute and try 'basilica ssh <rental_id>'"), )); } - // Show spinner during wait periods - let spinner = create_spinner(&format!("Waiting for SSH... (attempt {})", attempt)); - // Brief delay between attempts (skip on first attempt) if attempt > 1 { + let spinner = create_spinner(&format!("Waiting for SSH... (attempt {})", attempt)); tokio::time::sleep(interval).await; interval = std::cmp::min(interval * 2, MAX_INTERVAL); + complete_spinner_and_clear(spinner); } - // Clear spinner before SSH attempt (in case passphrase prompt appears) - complete_spinner_and_clear(spinner); - // Try silent connection (suppresses errors, allows passphrase)crates/basilica-aggregator/src/config.rs (1)
64-72: Consider URL-encoding the webhook secret in the callback URL.If
webhook_secretcontains URL-special characters (e.g.,&,=,+,#), the generated URL could break or be misinterpreted. Consider usingurlencoding::encode()orpercent_encodingto safely encode the token value.🔎 Proposed fix
+use urlencoding::encode; + impl HyperstackConfig { /// Build callback URL with token as query parameter pub fn callback_url(&self) -> String { format!( "{}/webhooks/cloud-provider/hyperstack?token={}", self.callback_base_url.trim_end_matches('/'), - self.webhook_secret + encode(&self.webhook_secret) ) } }crates/basilica-aggregator/src/providers/hyperstack/types.rs (1)
382-395: Consider adding a test for numeric ID deserialization.The current test only covers string ID format. Adding a test for numeric ID would validate the
deserialize_idhelper's number-to-string conversion path.🔎 Suggested additional test
#[test] fn vm_id_handles_numeric_resource_id() { let callback: HyperstackCallback = serde_json::from_value(json!({ "resource": { "id": 507279 }, "operation": { "name": "createInstance", "status": "CREATING" } })) .expect("valid callback"); assert_eq!(callback.vm_id(), "507279"); }crates/basilica-api/src/api/routes/webhooks.rs (2)
108-115: Duplication of connection_info construction logic.This JSON structure for
connection_infoduplicates the logic inbuild_connection_infofromcrates/basilica-aggregator/src/service.rs. Consider importing and reusing that helper to ensure consistency.🔎 Potential refactor to reuse helper
If the
build_connection_infofunction were made public and accessible from this module, you could use it here:- // Build connection_info if we have an IP - let connection_info = ip_address.as_ref().map(|ip| { - json!({ - "ssh_host": ip, - "ssh_port": 22, - "ssh_user": "ubuntu" - }) - }); + // Build connection_info if we have an IP + let connection_info = basilica_aggregator::service::build_connection_info(&ip_address);This requires making
build_connection_infopublic in the aggregator crate.
69-73: Consider logging at debug level instead of trace for unknown VM IDs.While returning 200 OK for unknown VM IDs is correct to avoid information leakage, logging at
tracelevel may make it difficult to detect legitimate issues (e.g., webhooks arriving before deployment records are created). Considerdebuglevel for better operational visibility.🔎 Proposed change
- tracing::trace!(vm_id = %vm_id, "Webhook received for unknown VM ID"); + tracing::debug!(vm_id = %vm_id, "Webhook received for unknown VM ID");
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
crates/basilica-aggregator/src/config.rscrates/basilica-aggregator/src/providers/hyperstack/client.rscrates/basilica-aggregator/src/providers/hyperstack/mod.rscrates/basilica-aggregator/src/providers/hyperstack/types.rscrates/basilica-aggregator/src/service.rscrates/basilica-api/src/api/mod.rscrates/basilica-api/src/api/routes/jobs.rscrates/basilica-api/src/api/routes/mod.rscrates/basilica-api/src/api/routes/secure_cloud.rscrates/basilica-api/src/api/routes/webhooks.rscrates/basilica-api/src/config/mod.rscrates/basilica-api/src/server.rscrates/basilica-cli/src/cli/handlers/gpu_rental.rscrates/basilica-cli/src/ssh/mod.rs
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-13T15:34:46.544Z
Learnt from: epappas
Repo: one-covenant/basilica PR: 184
File: crates/basilica-validator/src/rental/billing.rs:129-130
Timestamp: 2025-10-13T15:34:46.544Z
Learning: In the basilica-validator crate, rental_id is a String type (not UUID) throughout the system - in RentalInfo struct, database rentals.id column, and all rental operations. The persistence layer methods should accept &str for rental_id parameters, not &Uuid, since they immediately convert to String for SQL binding anyway.
Applied to files:
crates/basilica-api/src/api/routes/secure_cloud.rs
🧬 Code graph analysis (8)
crates/basilica-api/src/api/mod.rs (2)
crates/basilica-api/src/api/routes/health.rs (2)
health_check(9-19)k3s_health_check(22-40)crates/basilica-api/src/api/routes/webhooks.rs (1)
hyperstack_callback(23-155)
crates/basilica-cli/src/ssh/mod.rs (1)
crates/basilica-common/src/ssh/connection.rs (3)
test_connection(66-66)test_connection(630-662)new(119-123)
crates/basilica-api/src/server.rs (1)
crates/basilica-aggregator/src/service.rs (1)
new(51-97)
crates/basilica-api/src/api/routes/webhooks.rs (4)
crates/basilica-api/src/api/routes/secure_cloud.rs (2)
response(626-626)s(295-295)crates/basilica-sdk-python/python/basilica/deployment.py (1)
DeploymentStatus(66-111)crates/basilica-aggregator/src/providers/hyperstack/types.rs (4)
vm_id(332-334)operation_name(342-344)operation_status(337-339)from(274-289)crates/basilica-api/src/server.rs (2)
sqlx(919-919)sqlx(1146-1146)
crates/basilica-api/src/api/routes/secure_cloud.rs (2)
crates/basilica-billing/src/domain/rentals.rs (1)
provider(203-205)crates/basilica-api/src/api/query.rs (1)
provider(27-31)
crates/basilica-cli/src/cli/handlers/gpu_rental.rs (1)
crates/basilica-cli/src/progress/mod.rs (2)
create_spinner(98-109)complete_spinner_and_clear(137-139)
crates/basilica-api/src/api/routes/jobs.rs (1)
crates/basilica-aggregator/src/config.rs (1)
default_for_tests(249-269)
crates/basilica-aggregator/src/providers/hyperstack/client.rs (2)
crates/basilica-aggregator/src/service.rs (1)
new(51-97)crates/basilica-aggregator/src/config.rs (1)
callback_url(66-72)
🔇 Additional comments (26)
crates/basilica-cli/src/ssh/mod.rs (2)
127-185: LGTM - Well-structured timeout handling with spawn_blocking.The implementation properly wraps the blocking SSH command execution in
spawn_blockingand adds a timeout wrapper with 5 seconds overhead to account for SSH connection overhead beyondConnectTimeout. The error handling covers all cases (success, non-zero exit, task join errors, and timeout).
190-233: Synchronous.status()blocks the async runtime, but is acceptable here.The
.status()call on line 224 is synchronous and will block the tokio worker thread. However, this is acceptable because:
ConnectTimeoutis set (line 215), capped at 10 seconds, preventing indefinite blocking- The method intentionally inherits stdin/stdout (lines 220-222) for passphrase prompts, which requires synchronous execution
- This pattern is consistent with
interactive_sessionandinteractive_session_with_optionsin this fileThe capped timeout on line 196 (
std::cmp::min(details.timeout.as_secs(), 10)) is a sensible choice for a quick retry probe.crates/basilica-cli/src/cli/handlers/gpu_rental.rs (1)
2066-2069: LGTM - Correct behavior to continue polling for SSH info.Previously the function might have returned early when a running rental had no SSH command. Now it correctly continues polling, which aligns with the webhook-based approach where SSH info may not be immediately available after a rental is marked as running.
crates/basilica-api/src/api/routes/secure_cloud.rs (1)
152-175: LGTM! Clean implementation of polling skip for Hyperstack.The query extension to include
providerand the skip logic for Hyperstack rentals correctly implements the webhook-based status update pattern. VIP rentals continue to be skipped as before, and the earlycontinueprevents unnecessary API calls to Hyperstack since webhooks handle status updates.crates/basilica-aggregator/src/config.rs (1)
376-415: Good test coverage for HyperstackConfig.The tests properly validate both configuration validation and callback URL generation, including the trailing slash normalization.
crates/basilica-aggregator/src/providers/hyperstack/types.rs (2)
181-182: LGTM! Clean addition of callback_url to deployment requests.The optional
callback_urlfield withskip_serializing_ifcorrectly omits it from the request body when not set.
368-380: Good handling of polymorphic ID deserialization.The
deserialize_idhelper correctly handles both string and numeric IDs from the Hyperstack API. This defensive approach prevents deserialization failures when the API returns either format.crates/basilica-api/src/api/routes/mod.rs (1)
14-14: LGTM!New webhooks module follows the existing pattern for route organization.
crates/basilica-api/src/api/routes/jobs.rs (1)
293-293: LGTM!Test scaffolding correctly includes the new
aggregator_configfield using the test default configuration.crates/basilica-api/src/api/mod.rs (1)
24-28: LGTM! Webhook route correctly placed in public routes.The webhook endpoint appropriately bypasses auth middleware since it uses token-based authentication via query parameter. The handler (per the relevant snippet) validates the token and returns HTTP 200 even for unauthorized requests to prevent information leakage—a good security practice.
crates/basilica-aggregator/src/providers/hyperstack/mod.rs (1)
3-6: LGTM!Making the types module public and re-exporting
HyperstackCallbackenables the webhook handler to consume these types cleanly without reaching into internal module paths.crates/basilica-api/src/config/mod.rs (1)
166-166: LGTM!The type change from
ProviderConfigtoOption<HyperstackConfig>correctly aligns with the aggregator crate's updated configuration structure and enables the webhook-specific fields (webhook_secret,callback_base_url).crates/basilica-api/src/server.rs (3)
81-83: LGTM - Webhook configuration access.Adding
aggregator_configtoAppStateenables the webhook handler to validate tokens against the configuredwebhook_secret. This is the appropriate place to store this configuration for shared access across handlers.
768-775: LGTM - Config cloning for service initialization.Cloning
aggregator_configbefore passing toAggregatorService::newis correct since the config is also needed forAppState. The service now owns its copy of the configuration.
1143-1148: LGTM - Hyperstack exclusion from polling-based health checks.Excluding Hyperstack rentals from the polling-based health check query is correct since Hyperstack now uses webhook callbacks for status updates. The comment clearly explains the rationale.
crates/basilica-aggregator/src/providers/hyperstack/client.rs (4)
23-25: LGTM - Callback URL field for webhook notifications.The optional
callback_urlfield allows Hyperstack to send webhook notifications to this URL when VM lifecycle events occur. The field is properly documented with a comment explaining its purpose.
27-38: LGTM - Constructor signature update.The constructor properly accepts and stores the optional
callback_url. This is a non-breaking API change since it's a new optional parameter.
73-73: LGTM - Log level adjustments.Downgrading pricebook fetch and entry logs from
debugtotracereduces noise during normal operation since these are called frequently by background refresh tasks.Also applies to: 116-126
571-590: LGTM - Callback URL propagation to deployment request.The
callback_urlis correctly cloned fromself.callback_urland included in theDeployVmRequest. This enables Hyperstack to call back to the webhook endpoint when the VM status changes.crates/basilica-aggregator/src/service.rs (3)
76-84: LGTM - Conditional Hyperstack provider initialization.Hyperstack provider initialization is now properly gated behind the optional
HyperstackConfig. The callback URL is correctly retrieved from config and passed to the provider constructor. The log message clearly indicates webhook callbacks are enabled.
525-526: LGTM - Consistent connection_info for new deployments.Using
build_connection_infoensures consistent JSON structure forconnection_infoacross all deployment creation paths.
610-621: LGTM - Connection info updated on deployment status refresh.The
get_deploymentflow now properly updatesconnection_infousing the helper function when refreshing deployment status from the provider.crates/basilica-api/src/api/routes/webhooks.rs (4)
175-207: LGTM - Robust status mapping with fallback logic.The
map_hyperstack_statusfunction handles various Hyperstack status strings with proper case-insensitive matching. The fallback logic using operation names (op_name) provides resilience when the status field is ambiguous (e.g., "success"). The empty string check returnsPendingas a safe default.
209-272: LGTM - Comprehensive unit tests for status mapping.The tests cover key scenarios including:
- Mapping from
data.statusfield- Mapping from
operation.statusfield- Various status values (CREATING, BUILD, ACTIVE, DELETED, ERROR)
- Delete operations with different statuses
121-137: No action needed. Theconnection_infocolumn insecure_cloud_rentalsis alreadyJSONBtype (as defined in migrations), andsqlxproperly handles bindingOption<serde_json::Value>toJSONBcolumns without requiring explicit serialization.
50-54: The proposed constant-time comparison syntax is incorrect and requires correction.While timing-safe comparison is a valid defense-in-depth practice, the suggested fix using
subtle::ConstantTimeEqhas two issues:
- The API syntax
ct_eq().unwrap_u8() != 1is incorrect for thesubtlecrate (v2.6.1). Thect_eq()method returns aChoicetype without anunwrap_u8()method.- The
subtlecrate is not a direct dependency ofcrates/basilica-api—it would need to be explicitly declared.Additionally, the timing attack risk for webhook secret validation is minimal because network latency dominates any timing variations from string comparison, especially over HTTPS. The current approach of returning
StatusCode::OKalready mitigates information leakage effectively.If constant-time comparison is desired, use the correct API (e.g.,
token.as_bytes().ct_eq(hyperstack_config.webhook_secret.as_bytes()).into()and addsubtleas an explicit dependency), or consider hashing both values for comparison as an alternative.
Validate api_key, webhook_secret (URL-safe chars), and callback_base_url.
Extract ssh_params_for_provider to return port/user per provider type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
crates/basilica-aggregator/src/config.rs (1)
261-265: Consider validating thatcallback_base_urlis a valid URL.The validation checks that
callback_base_urlis non-empty, but doesn't verify it's a valid URL (e.g., starts withhttp://orhttps://). An invalid URL would result in malformed callback URLs being sent to Hyperstack.🔎 Proposed enhancement
if hyperstack.callback_base_url.trim().is_empty() { return Err(AggregatorError::Config( "Hyperstack callback_base_url must be non-empty".to_string(), )); } + if !hyperstack.callback_base_url.starts_with("http://") + && !hyperstack.callback_base_url.starts_with("https://") + { + return Err(AggregatorError::Config( + "Hyperstack callback_base_url must start with http:// or https://".to_string(), + )); + }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
crates/basilica-aggregator/src/config.rscrates/basilica-aggregator/src/service.rs
🧰 Additional context used
🧬 Code graph analysis (1)
crates/basilica-aggregator/src/service.rs (2)
crates/basilica-aggregator/src/config.rs (1)
callback_url(66-72)crates/basilica-aggregator/src/providers/hyperstack/client.rs (1)
new(28-38)
🔇 Additional comments (8)
crates/basilica-aggregator/src/config.rs (3)
53-73: LGTM on the HyperstackConfig struct and callback_url helper.The struct cleanly encapsulates Hyperstack-specific configuration. The
callback_url()method correctly trims trailing slashes from the base URL before constructing the webhook URL. The webhook secret is passed as a query parameter, which is a common pattern for webhook authentication.
295-299: LGTM on the URL-safe token validation.The allowed character set (alphanumeric plus
- _ . ~) correctly matches RFC 3986 unreserved characters, ensuring the token can be safely used in query parameters without URL encoding.
432-444: Good test for callback_url with trailing slash handling.The test correctly verifies that trailing slashes are stripped from
callback_base_urlbefore constructing the webhook URL, preventing double-slash issues.crates/basilica-aggregator/src/service.rs (5)
32-58: LGTM on thebuild_connection_infohelper.Clean extraction of connection info construction. Good defensive handling of empty/whitespace-only usernames by trimming and falling back to "ubuntu". The JSON structure matches the VIP format as documented.
60-65: Partial improvement toward configurable SSH parameters.This helper addresses the past review concern by centralizing SSH parameters per provider. However, the values are still hardcoded. For
Hyperstack,(22, "ubuntu")is returned explicitly, while other providers fall through to(None, None), which causesbuild_connection_infoto use the same defaults anyway.Consider whether other providers (DataCrunch, Lambda, HydraHost) might need different SSH users based on their images. The current structure enables future extension but doesn't yet support per-image customization.
99-107: LGTM on the Hyperstack provider initialization.Clean integration with the new
HyperstackConfig. The callback URL is derived from configuration and passed to the provider, enabling webhook-based status updates instead of polling.
533-552: Clean integration of SSH parameters into deployment creation.The deployment flow correctly uses
ssh_params_for_providerandbuild_connection_infoto construct structured connection information. Storing bothip_addressandconnection_infomaintains backward compatibility while providing the richer connection metadata.
630-648: LGTM on theget_deploymentstatus update flow.Consistent use of the new helpers mirrors the deployment creation pattern. When fetching updated status from the provider,
connection_infois correctly rebuilt if the IP address changes.
Add webhook_secret and callback_base_url configuration for Hyperstack provider to support webhook callbacks for VM status updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
scripts/cloud/compute.tfscripts/cloud/secrets.tfscripts/cloud/terraform.tfvars.examplescripts/cloud/variables.tf
🔇 Additional comments (4)
scripts/cloud/compute.tf (1)
419-421: LGTM!The callback base URL is correctly passed as a non-sensitive environment variable, following the existing naming convention for nested configuration.
scripts/cloud/variables.tf (1)
167-178: LGTM!The variable definitions are well-structured with appropriate defaults, sensitivity marking, and clear descriptions. The URL-safe character constraint mentioned in the description could optionally be enforced with a
validationblock, but the current documentation approach is acceptable.scripts/cloud/terraform.tfvars.example (1)
105-116: LGTM!Clear example configuration with helpful inline comments. The URL-safe character requirement for the webhook secret is appropriately documented.
scripts/cloud/secrets.tf (1)
121-145: LGTM!The secret management implementation follows established patterns in this file:
- The
lifecycle { ignore_changes = [secret_string] }blocks prevent Terraform from flagging changes when secrets are rotated outside of Terraform.- The conditional
counton the secret version allows the secret to be populated only when provided.- The secret resource is always created (providing the "shell"), while the version is optional.
This is consistent with how
kubeconfig,k3s_ssh_key, andhyperstack_api_keyare handled.
Adds webhook callback support for Hyperstack VM lifecycle events, eliminating the need for polling on Hyperstack deployments.
Changes
Webhook Callback System
POST /webhooks/cloud-provider/hyperstack?token=...for receiving VM status updatescreateVM,deleteVM,startVM,stopVM,hibernateVM,restoreVMHyperstack Configuration
HyperstackConfigstruct with required fields:api_key: Hyperstack API keywebhook_secret: URL-safe token for authenticating callbackscallback_base_url: Base URL for webhooks (e.g.,https://api.basilica.ai)Provider-Aware Connection Info
build_connection_info()creates structured JSON withssh_host,ssh_port,ssh_userssh_params_for_provider()returns provider-specific SSH defaults