Skip to content

Conversation

@saber04414
Copy link

@saber04414 saber04414 commented Dec 12, 2025

Adds multi-cluster federation system for managing multiple K3s clusters with geographic distribution and high availability.

Features

  • Multi-cluster API gateway with unified REST API
  • Cross-cluster service discovery
  • Cluster health aggregation
  • Cross-cluster load balancing (5 algorithms)
  • Federated resource management

Implementation

  • New basilica-federation crate
  • Ansible playbook for federation setup
  • Kubernetes deployment manifests
  • API integration module

API Endpoints

  • GET /health - Federation health
  • GET /clusters - List clusters
  • GET /services - List services across clusters
  • GET /proxy/*path - Proxy requests
  • GET /metrics - Prometheus metrics

Contribution by Gittensor, learn more at https://gittensor.io/

Summary by CodeRabbit

Release Notes

  • New Features

    • Multi-cluster federation system for managing geographically distributed clusters with unified REST API
    • Cross-cluster service discovery with automatic caching and health-aware routing
    • Five load-balancing algorithms (RoundRobin, LeastConnections, WeightedRoundRobin, Random, Geographic)
    • Cluster health monitoring and aggregation across federated infrastructure
    • Federated resource management with automatic service synchronization
    • Prometheus metrics support for federation operations
  • Documentation

    • Comprehensive architecture and troubleshooting guides
    • Deployment examples for Kubernetes and Ansible
    • Configuration documentation with federation.toml reference
  • Tests

    • Full test coverage for discovery, health, load balancing, events, metrics, and integration scenarios
  • Chores

    • Kubernetes deployment manifests with RBAC configuration
    • Prometheus alerting rules and service monitoring configuration

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 12, 2025

Walkthrough

Introduces a comprehensive multi-cluster federation system for Basilica as a new basilica-federation crate with API gateway, cross-cluster service discovery, health aggregation, load balancing, and resource management. Integrates with basilica-api and includes Kubernetes deployment manifests and Ansible automation.

Changes

Cohort / File(s) Summary
Workspace Setup
Cargo.toml, crates/basilica-api/Cargo.toml
Added basilica-federation crate to workspace members; added basilica-federation as internal dependency to basilica-api.
Basilica-API Integration
crates/basilica-api/src/federation.rs, crates/basilica-api/src/lib.rs
Introduced FederationClient wrapper struct with async initialization and pub module export for federation functionality.
Configuration
crates/basilica-federation/src/config.rs
Comprehensive configuration structures (FederationConfig, ClusterConfig, GatewayConfig, HealthConfig, LoadBalancerConfig, ResourceManagerConfig, etc.) with Figment-based loading, environment variable support, and accessor methods.
Error Handling
crates/basilica-federation/src/error.rs
FederationError enum with 12 variants covering cluster, Kubernetes, HTTP, discovery, health, load balancing, resource management, and state errors; Result type alias.
Caching
crates/basilica-federation/src/cache.rs
Generic CacheBuilder with TTL/TTI support; ServiceCache and HealthCache type aliases; convenience factory functions.
Service Discovery
crates/basilica-federation/src/discovery/mod.rs
ServiceDiscovery struct managing per-cluster Kubernetes clients with TTL-based caching; periodic refresh; cross-cluster service aggregation; ServiceInfo structure.
Health Aggregation
crates/basilica-federation/src/health/mod.rs
HealthAggregator with per-cluster health checks on Node objects; HealthStatus enum (Healthy, Degraded, Unhealthy); failure-threshold logic; concurrent health state tracking.
Load Balancing
crates/basilica-federation/src/load_balancer/mod.rs
LoadBalancer supporting five algorithms (RoundRobin, LeastConnections, WeightedRoundRobin, Random, Geographic); health-aware and region-aware routing.
Resource Management
crates/basilica-federation/src/resource_manager/mod.rs
ResourceManager for pod listing and retrieval across federated clusters; async resource synchronization with configurable intervals.
Events
crates/basilica-federation/src/events.rs
FederationEvent enum; EventPublisher with broadcast channel; EventHandler trait; EventManager for subscribe/publish workflows.
Metrics
crates/basilica-federation/src/metrics.rs
FederationMetrics struct with Prometheus counters, gauges, and histograms for requests, clusters, discovery, load balancing, health, and resources.
Retry Logic
crates/basilica-federation/src/retry.rs
RetryConfig and RetryExecutor with exponential backoff; per-error-type retry eligibility; async execute method.
API Layer
crates/basilica-federation/src/api/mod.rs
FederationApi orchestrating initialization and startup of all components (discovery, health, load balancer, resource manager, gateway); accessor methods.
Gateway
crates/basilica-federation/src/api/gateway.rs
FederationGateway HTTP server with Axum routing; endpoints for health, cluster discovery, service discovery, pod management, proxying, metrics; permissive CORS and timeout handling.
API Handlers
crates/basilica-federation/src/api/handler.rs
FederationHandler with methods for listing/getting clusters, discovering services, listing resources, selecting clusters, and reporting statistics with error mapping.
API Router
crates/basilica-federation/src/api/router.rs
FederationRouter constructing Axum router with endpoints for health, readiness, liveness, statistics, cluster management, service discovery, resource handling, load balancer, and metrics.
Utilities & Entry Point
crates/basilica-federation/src/utils.rs, crates/basilica-federation/src/main.rs, crates/basilica-federation/src/lib.rs
Kubernetes client creation utility; CLI entry point with Clap for config and log level; module exports and VERSION constant.
Documentation
crates/basilica-federation/README.md, crates/basilica-federation/docs/ARCHITECTURE.md, crates/basilica-federation/docs/TROUBLESHOOTING.md
Comprehensive README with features, configuration, usage, deployment, and API documentation; architecture overview; troubleshooting placeholder.
Examples
crates/basilica-federation/examples/basic_federation.rs, examples/load_balancing.rs, examples/service_discovery.rs
Demonstration code for federation initialization, load balancer routing, and service discovery workflows.
Tests
crates/basilica-federation/tests/*_test.rs, tests/integration_test.rs
Coverage for discovery, error handling, events, health aggregation, load balancer, metrics, and integration scenarios.
Kubernetes Manifests
orchestrator/k8s/services/federation/deployment.yaml, servicemonitor.yaml, prometheusrule.yaml
Kubernetes Deployment (2 replicas) with ServiceAccount, ClusterRole, and ClusterRoleBinding; Prometheus ServiceMonitor and PrometheusRule for alerting.
Ansible Automation
orchestrator/ansible/playbooks/01-setup/federation.yml, orchestrator/ansible/roles/federation/tasks/main.yml
Playbook and role tasks for federation namespace, ConfigMap/Secret provisioning, configuration validation, and deployment health checks.

Sequence Diagram

sequenceDiagram
    participant Main as Federation Main
    participant API as FederationApi
    participant Config as FederationConfig
    participant Discovery as ServiceDiscovery
    participant Health as HealthAggregator
    participant LB as LoadBalancer
    participant ResMgr as ResourceManager
    participant Gateway as FederationGateway
    participant Client as HTTP Client

    Main->>Config: Load configuration
    Main->>API: new(config)
    
    API->>Discovery: new(config)
    Discovery->>Discovery: Initialize Kube clients
    API->>Health: new(config)
    Health->>Health: Initialize Kube clients
    API->>LB: new(config)
    API->>ResMgr: new(config)
    ResMgr->>ResMgr: Initialize Kube clients
    API->>Gateway: new(config, discovery, health, LB, ResMgr)
    Gateway->>Gateway: Build Axum router
    
    API->>Discovery: start()
    Discovery->>Discovery: Spawn refresh task
    API->>Health: start()
    Health->>Health: Spawn health check task
    API->>ResMgr: start()
    ResMgr->>ResMgr: Spawn sync task
    API->>Gateway: start()
    Gateway->>Gateway: Bind and serve

    Client->>Gateway: GET /health
    Gateway->>Health: aggregate_health()
    Health->>Health: Query per-cluster health
    Gateway->>Client: Health status JSON

    Client->>Gateway: GET /clusters
    Gateway->>Discovery: discover_services()
    Discovery->>Discovery: Query cache
    Gateway->>Client: Cluster list JSON

    Client->>Gateway: GET /proxy/*
    Gateway->>LB: select_cluster()
    LB->>Health: Get cluster health
    Gateway->>Client: Forward response
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120+ minutes

Areas requiring careful attention:

  • Kubernetes client initialization and error propagation (discovery/mod.rs, utils.rs, health/mod.rs, resource_manager/mod.rs) — Multiple kubeconfig loading paths, YAML deserialization, per-cluster error handling, and resilience to client creation failures.
  • Async component lifecycle management (api/mod.rs, main.rs) — Initialization order, startup sequencing, error propagation through ? operator in async constructors, and potential deadlock or race conditions during component startup.
  • HTTP API routing and handler implementation (gateway.rs, router.rs, handler.rs) — Proper extraction of state, parameter parsing, error-to-status-code mapping, and hop-by-hop header filtering in proxy logic.
  • Cache consistency and TTL behavior (cache.rs, discovery/mod.rs, health/mod.rs) — TTL expirations, race conditions on cache hits/misses, and invalidation semantics.
  • Load balancing algorithm correctness (load_balancer/mod.rs) — Round-robin counter management, health-aware filtering, region-aware logic, and handling of empty candidate sets.
  • Configuration validation and defaults (config.rs) — Figment layer merging, environment variable precedence, validation of thresholds (e.g., failure_threshold > success_threshold), and interaction with serde defaults.
  • Error type completeness and mappings (error.rs) — All error paths properly captured; error messages user-friendly; automatic From conversions working as intended.

Poem

🐰 Federation blooms in clusters bright,
Cross-cloud discovery takes flight!
Load balancing hops from shore to shore,
Health checks watch, and gates explore,
A rabbit's gift: one mesh to bind them all. 🌍✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: K3s Multi-Cluster Federation' is directly aligned with the main change: adding a comprehensive multi-cluster federation system for K3s clusters. It clearly summarizes the primary feature being introduced.
Docstring Coverage ✅ Passed Docstring coverage is 94.04% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Implement comprehensive multi-cluster federation for managing multiple K3s
clusters with geographic distribution and high availability.

Components:
- Multi-cluster API gateway with unified REST API
- Cross-cluster service discovery with caching
- Cluster health aggregation and monitoring
- Cross-cluster load balancing (5 algorithms)
- Federated resource management

Implementation:
- New basilica-federation crate (~1,655 lines Rust)
- Ansible playbook for federation setup
- Kubernetes deployment manifests
- Integration with basilica-api

Features:
- Health-aware and region-aware routing
- Automatic service discovery across clusters
- Real-time cluster health monitoring
- Request proxying to target clusters
- Prometheus metrics support
@saber04414 saber04414 force-pushed the feat/k3s-multi-cluster-federation branch from 1e94e64 to 330ee23 Compare December 12, 2025 19:11
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (6)
crates/basilica-federation/README.md-15-32 (1)

15-32: Add a language tag to the fenced diagram block (markdownlint MD040).

The ASCII architecture diagram fence should specify a language (e.g., text) to satisfy lint and render consistently.

crates/basilica-federation/src/load_balancer/mod.rs-77-83 (1)

77-83: WeightedRoundRobin is misnamed — this selects max priority only.

Weighted round-robin should distribute requests proportionally to weights across all candidates over time. This implementation always selects the highest-priority cluster, which is priority-based selection, not weighted distribution.

Either rename this to reflect actual behavior, or implement proper weighted distribution using the priority values as weights.

crates/basilica-api/src/federation.rs-39-44 (1)

39-44: Test passes trivially with default config containing no clusters.

The FederationConfig::default() creates a valid but non-functional configuration with an empty clusters vector. While FederationApi::new() initializes successfully with no clusters (sub-components handle empty clusters gracefully), the test doesn't verify that the federation client can actually perform load balancing or service discovery. The test should either provide a valid configuration with at least one cluster or mock the dependencies to verify meaningful behavior.

crates/basilica-federation/src/discovery/mod.rs-91-108 (1)

91-108: Discovery starts “empty” until the first interval tick.
start() only refreshes inside the interval loop (Lines 97-107), so /services can return nothing immediately after startup. Consider calling refresh_services(...) once before entering the loop.

crates/basilica-federation/src/health/mod.rs-158-187 (1)

158-187: total == 0 currently reports Unhealthy; probably should be Unknown.
As written, a cluster with zero nodes (or an empty list due to RBAC scoping quirks) is treated as Unhealthy (Line 180-186). If you want “no data” vs “bad data” separation, return HealthStatus::Unknown when total == 0.

crates/basilica-federation/src/api/gateway.rs-113-125 (1)

113-125: Health endpoint always returns "status": "healthy" even on partial failures.
Since aggregate_health() can include unhealthy clusters, the top-level "status" (Lines 116-119) is misleading. Consider deriving overall status (e.g., healthy/degraded/unhealthy) based on per-cluster statuses + missing-data cases.

🧹 Nitpick comments (10)
crates/basilica-api/Cargo.toml (1)

9-20: Consider feature-gating basilica-federation to avoid forcing the full dependency graph.

If federation isn’t required for all basilica-api builds, make the dependency optional = true and enable it via a federation feature (then gate pub mod federation / usages behind that feature).

crates/basilica-api/src/federation.rs (1)

29-32: is_enabled() always returns true — consider deriving from configuration.

This stub always returns true, but federation may be disabled in certain deployments. Consider reading from FederationConfig or checking if clusters are configured.

-    /// Check if federation is enabled
-    pub fn is_enabled(&self) -> bool {
-        true
-    }
+    /// Check if federation is enabled
+    pub fn is_enabled(&self) -> bool {
+        !self.api.config().enabled_clusters().is_empty()
+    }
crates/basilica-federation/src/resource_manager/mod.rs (2)

46-63: Background task is fire-and-forget with no shutdown mechanism.

The spawned task runs indefinitely with no way to cancel it or await its completion. This can cause issues during graceful shutdown or testing. Consider returning a JoinHandle or using a cancellation token.

+    use tokio::sync::watch;
+    use tokio::task::JoinHandle;
+
     /// Start resource synchronization
-    pub async fn start(&self) {
+    pub fn start(&self, shutdown: watch::Receiver<bool>) -> JoinHandle<()> {
         let config = self.config.clone();
         let clients = self.clients.clone();
         
-        tokio::spawn(async move {
+        tokio::spawn(async move {
             let mut interval = tokio::time::interval(config.resource_manager.sync_interval);
+            let mut shutdown = shutdown;
             
             loop {
-                interval.tick().await;
+                tokio::select! {
+                    _ = interval.tick() => {}
+                    _ = shutdown.changed() => break,
+                }
                 
                 if config.resource_manager.auto_distribute {
                     if let Err(e) = Self::sync_resources(&config, &clients).await {
                         warn!(error = %e, "Failed to sync resources");
                     }
                 }
             }
-        });
+        })
     }

82-104: Sequential cluster iteration — consider parallel fetching for better latency.

Pod listing iterates clusters sequentially. For federations with multiple clusters, parallel fetching via futures::future::join_all would reduce total latency.

use futures::future::join_all;

let results = join_all(self.clients.iter().map(|(cluster_id, client)| {
    let cluster_id = cluster_id.clone();
    let client = client.clone();
    async move {
        let pods_api: Api<Pod> = Api::namespaced(client, namespace);
        (cluster_id, pods_api.list(&ListParams::default()).await)
    }
})).await;
crates/basilica-federation/src/lib.rs (1)

14-27: Crate public surface looks coherent; consider whether consumers need a typed version instead of &str.
If external integrations compare versions, a semver::Version (or at least a fn version() -> Version) can avoid string parsing at call sites.

crates/basilica-federation/src/api/mod.rs (2)

31-60: Potential double-initialization of HealthAggregator via LoadBalancer::new + set_health.
Given LoadBalancer::new(config) appears to create its own HealthAggregator (see crates/basilica-federation/src/load_balancer/mod.rs snippet), FederationApi::new likely allocates two aggregators and discards one. Consider changing LoadBalancer::new to accept Arc<HealthAggregator> (or make health optional until set).


63-90: Task spawning is redundant and complicates shutdown.
ServiceDiscovery::start, HealthAggregator::start, and ResourceManager::start already tokio::spawn their own loops (per provided snippets), so FederationApi::start spawning another task just to call them adds noise and makes graceful shutdown/cancellation harder. Prefer calling discovery.start().await; directly (or making start() non-async), and return JoinHandles if you want lifecycle control.

crates/basilica-federation/src/health/mod.rs (1)

201-233: Avoid format!("{:?}", health.status) as an API contract.
Since HealthStatus already derives Serialize, returning the enum directly (or a stable #[serde(rename_all = "snake_case")]) avoids accidental breaking changes if variants/Debug formatting changes.

crates/basilica-federation/src/discovery/mod.rs (2)

64-90: Remove duplicated create_kube_client helper (prefer crate::utils::create_kube_client).
You already import and use crate::utils::create_kube_client (Line 5, Line 38), so the in-module async fn create_kube_client (Lines 64-89) is redundant and will drift over time.


196-217: get_service selection is ambiguous across clusters (and namespaces if not provided).
get_service returns the first match by name (Lines 204-207). With multi-cluster federation, returning all matches (or requiring namespace + returning per-cluster entries) avoids surprising routing/UX.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 484cf92 and 330ee23.

📒 Files selected for processing (19)
  • Cargo.toml (1 hunks)
  • crates/basilica-api/Cargo.toml (1 hunks)
  • crates/basilica-api/src/federation.rs (1 hunks)
  • crates/basilica-api/src/lib.rs (1 hunks)
  • crates/basilica-federation/Cargo.toml (1 hunks)
  • crates/basilica-federation/README.md (1 hunks)
  • crates/basilica-federation/src/api/gateway.rs (1 hunks)
  • crates/basilica-federation/src/api/mod.rs (1 hunks)
  • crates/basilica-federation/src/config.rs (1 hunks)
  • crates/basilica-federation/src/discovery/mod.rs (1 hunks)
  • crates/basilica-federation/src/error.rs (1 hunks)
  • crates/basilica-federation/src/health/mod.rs (1 hunks)
  • crates/basilica-federation/src/lib.rs (1 hunks)
  • crates/basilica-federation/src/load_balancer/mod.rs (1 hunks)
  • crates/basilica-federation/src/main.rs (1 hunks)
  • crates/basilica-federation/src/resource_manager/mod.rs (1 hunks)
  • crates/basilica-federation/src/utils.rs (1 hunks)
  • orchestrator/ansible/playbooks/01-setup/federation.yml (1 hunks)
  • orchestrator/k8s/services/federation/deployment.yaml (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (8)
crates/basilica-federation/src/utils.rs (1)
crates/basilica-federation/src/discovery/mod.rs (2)
  • create_kube_client (65-89)
  • new (33-62)
crates/basilica-api/src/federation.rs (1)
crates/basilica-federation/src/api/mod.rs (1)
  • new (31-60)
crates/basilica-federation/src/api/mod.rs (5)
crates/basilica-federation/src/discovery/mod.rs (3)
  • create_kube_client (65-89)
  • new (33-62)
  • start (92-108)
crates/basilica-federation/src/api/gateway.rs (2)
  • new (37-59)
  • start (96-111)
crates/basilica-federation/src/health/mod.rs (2)
  • new (56-85)
  • start (88-145)
crates/basilica-federation/src/load_balancer/mod.rs (1)
  • new (28-35)
crates/basilica-federation/src/resource_manager/mod.rs (2)
  • new (19-43)
  • start (46-63)
crates/basilica-federation/src/discovery/mod.rs (2)
crates/basilica-federation/src/utils.rs (1)
  • create_kube_client (7-31)
crates/basilica-federation/src/config.rs (1)
  • default (223-264)
crates/basilica-federation/src/load_balancer/mod.rs (5)
crates/basilica-federation/src/api/mod.rs (2)
  • health (103-105)
  • new (31-60)
crates/basilica-api/src/federation.rs (1)
  • new (15-22)
crates/basilica-federation/src/api/gateway.rs (1)
  • new (37-59)
crates/basilica-federation/src/health/mod.rs (1)
  • new (56-85)
crates/basilica-federation/src/config.rs (1)
  • enabled_clusters (277-279)
crates/basilica-federation/src/api/gateway.rs (6)
crates/basilica-federation/src/api/mod.rs (7)
  • discovery (98-100)
  • health (103-105)
  • load_balancer (108-110)
  • resource_manager (113-115)
  • new (31-60)
  • start (63-90)
  • gateway (93-95)
crates/basilica-federation/src/discovery/mod.rs (3)
  • new (33-62)
  • get_service (197-217)
  • start (92-108)
crates/basilica-federation/src/health/mod.rs (2)
  • new (56-85)
  • start (88-145)
crates/basilica-federation/src/load_balancer/mod.rs (1)
  • new (28-35)
crates/basilica-federation/src/resource_manager/mod.rs (4)
  • new (19-43)
  • list_pods (76-107)
  • get_pod (110-133)
  • start (46-63)
crates/basilica-federation/src/config.rs (1)
  • get_cluster (282-284)
crates/basilica-federation/src/main.rs (5)
crates/basilica-api/src/federation.rs (1)
  • new (15-22)
crates/basilica-federation/src/api/mod.rs (1)
  • new (31-60)
crates/basilica-federation/src/health/mod.rs (1)
  • new (56-85)
crates/basilica-federation/src/load_balancer/mod.rs (1)
  • new (28-35)
crates/basilica-federation/src/config.rs (2)
  • load (269-274)
  • default (223-264)
crates/basilica-federation/src/health/mod.rs (4)
crates/basilica-federation/src/utils.rs (1)
  • create_kube_client (7-31)
crates/basilica-federation/src/api/mod.rs (2)
  • new (31-60)
  • health (103-105)
crates/basilica-federation/src/load_balancer/mod.rs (1)
  • new (28-35)
crates/basilica-federation/src/config.rs (1)
  • default (223-264)
🪛 Checkov (3.2.334)
orchestrator/k8s/services/federation/deployment.yaml

[medium] 1-76: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 1-76: Minimize the admission of root containers

(CKV_K8S_23)

🪛 markdownlint-cli2 (0.18.1)
crates/basilica-federation/README.md

15-15: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (7)
Cargo.toml (1)

1-22: Workspace member addition looks correct.

crates/basilica-api/src/lib.rs (1)

14-29: pub mod federation; is fine—just ensure it matches the intended feature/dependency story.

If you decide to feature-gate the dependency in crates/basilica-api/Cargo.toml, also gate this module export to keep builds consistent.

crates/basilica-federation/src/error.rs (1)

1-56: Well-structured error handling with comprehensive variants.

The error enum properly covers all federation operation categories with clear messages. Good use of thiserror with #[from] conversions for underlying error types.

crates/basilica-federation/src/resource_manager/mod.rs (1)

114-127: Pod search order is non-deterministic.

HashMap iteration order is not guaranteed. If the same pod name exists in multiple clusters (unlikely but possible), the returned pod depends on iteration order. Consider documenting this behavior or accepting a cluster_id parameter.

crates/basilica-federation/src/load_balancer/mod.rs (1)

64-66: Fallback to first enabled cluster when all candidates are unhealthy.

When candidates.is_empty() (all clusters unhealthy), this falls back to enabled_clusters.first(). This may route traffic to an unhealthy cluster. Consider returning None or logging a warning.

crates/basilica-federation/src/api/gateway.rs (1)

61-93: and

crates/basilica-federation/src/config.rs (1)

145-160: LGTM: clean enum-based LB algorithm config + optional sticky sessions.
The shape is straightforward to consume elsewhere and avoids “stringly-typed” algorithms.

Also applies to: 161-179

Copy link
Member

@epappas epappas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code, doesn't actually create a federation, and it's not in compliance with the overall design, neither the proxy nor network design of the platform.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

🧹 Nitpick comments (25)
COMMIT_MESSAGE.txt (3)

3-5: Consider wrapping lines at ~72–100 chars for readability in git tooling.
Line 3-4 may display awkwardly depending on viewer; minor polish.


13-18: Implementation bullets are good; consider adding concrete integration touchpoints.
If you want this commit message to help future archaeology, add brief identifiers (e.g., crate path, key module names, or entrypoint binary) rather than only “Integration with basilica-api”.


19-24: Feature list is solid; consider clarifying “region-aware” source of truth.
If “region-aware routing” is driven by labels/annotations/config, naming it here can reduce ambiguity.

crates/basilica-federation/tests/error_test.rs (3)

5-51: Strengthen Display tests and reduce duplication.

The Display tests only verify substring containment, which doesn't catch formatting regressions or ensure consistent error messages across variants. Additionally, the tests follow an identical pattern that could be parameterized.

Consider:

  1. Asserting exact error message formats (e.g., "Config error: test error") rather than substring matching
  2. Parameterizing tests to reduce duplication and improve maintainability

Example refactor using a parameterized test pattern:

#[test]
fn test_federation_error_display() {
    let test_cases = vec![
        (FederationError::Config("test error".to_string()), "Configuration error: test error"),
        (FederationError::ClusterNotFound("cluster-1".to_string()), "Cluster not found: cluster-1"),
        (FederationError::Discovery("discovery failed".to_string()), "Discovery error: discovery failed"),
        // ... other variants
    ];
    
    for (error, expected) in test_cases {
        assert_eq!(format!("{}", error), expected);
    }
}

1-70: Add tests for Error trait implementation.

The test suite doesn't verify the std::error::Error trait implementation or error conversions. Consider adding tests for:

  • source() method if error chaining is supported
  • From/Into implementations for error conversions
  • Debug formatting output

Example:

#[test]
fn test_error_trait_implementation() {
    let error = FederationError::Config("test".to_string());
    // Verify it implements Error trait
    let _: &dyn std::error::Error = &error;
}

#[test]
fn test_error_debug_format() {
    let error = FederationError::Config("test error".to_string());
    let debug_output = format!("{:?}", error);
    assert!(debug_output.contains("Config"));
    assert!(debug_output.contains("test error"));
}

53-69: Result type tests provide minimal coverage.

These tests verify trivial is_ok()/is_err() behavior but don't exercise meaningful error handling scenarios like error propagation with the ? operator or realistic error flows.

Consider either removing these tests or replacing them with integration tests that demonstrate actual error handling patterns in the federation codebase.

orchestrator/ansible/roles/federation/tasks/main.yml (1)

18-31: Config validation runs even when Kubernetes isn’t being configured (missing when: kubeconfig is defined parity).
If this role is meant to support “generate-only” runs, that’s fine; otherwise, consider gating both template + validation (or at least the validation) behind the same condition (or a dedicated flag like federation_generate_config_only).

orchestrator/k8s/services/federation/servicemonitor.yaml (1)

1-17: Ensure Prometheus can actually “see” this ServiceMonitor (selector labels + port name).
Two common gotchas to double-check:

  • Your Prometheus’ serviceMonitorSelector matches metadata.labels here (often requires a stack-specific label like release: ...).
  • The Service exposing federation metrics has a port named metrics and has app: basilica-federation label (so the selector matches).
 apiVersion: monitoring.coreos.com/v1
 kind: ServiceMonitor
 metadata:
   name: basilica-federation
   namespace: basilica-federation
   labels:
     app: basilica-federation
+    # Add any label required by your Prometheus Operator serviceMonitorSelector, e.g.:
+    # release: kube-prometheus-stack
 spec:
   selector:
     matchLabels:
       app: basilica-federation
+  # Consider making the namespace selection explicit if your setup expects it:
+  # namespaceSelector:
+  #   matchNames: ["basilica-federation"]
   endpoints:
   - port: metrics
     path: /metrics
     interval: 30s
     scrapeTimeout: 10s
crates/basilica-federation/src/events.rs (1)

112-136: EventManager::start() snapshots handlers; later register_handler() calls won’t take effect.
If dynamic handler registration is expected, handlers likely needs shared mutability (e.g., Arc<RwLock<Vec<_>>>) or start() should be called only after final registration and documented accordingly.

crates/basilica-federation/examples/basic_federation.rs (2)

16-42: Consider documenting the placeholder paths.

The example uses placeholder kubeconfig paths (/path/to/kubeconfig1, /path/to/kubeconfig2) that won't work as-is. Consider adding a comment at the top of the cluster configuration section explaining that users should replace these with actual paths.

Example comment:

+    // NOTE: Replace the kubeconfig paths below with actual paths to your cluster configurations
     config.clusters.push(basilica_federation::config::ClusterConfig {

47-49: Clarify the blocking nature of start().

If federation.start().await? is a long-running blocking operation (e.g., starting an HTTP server), consider adding a comment to make this clear. Users might expect the program to continue or do additional work after this call.

Example:

-    // Start federation
+    // Start federation (this will block until shutdown)
     println!("Starting federation...");
     federation.start().await?;
crates/basilica-federation/examples/load_balancing.rs (1)

24-29: Handle the None case when no cluster is selected.

When select_cluster() returns None (which will happen for all iterations in this example since clusters aren't configured), the loop silently continues without output. This could confuse users running the example.

Consider handling the None case:

     // Select cluster
     for i in 0..10 {
         if let Some(cluster) = load_balancer.select_cluster().await {
             println!("Request {} routed to cluster: {}", i + 1, cluster.id);
+        } else {
+            println!("Request {} could not be routed (no available clusters)", i + 1);
         }
     }
crates/basilica-federation/tests/discovery_test.rs (1)

1-54: Good edge-case coverage; consider adding positive test cases.

The tests appropriately validate behavior with empty clusters and basic struct creation. However, all tests currently exercise edge cases (no clusters configured). Consider adding integration tests that verify service discovery with actual cluster configurations to ensure the happy path is also covered.

crates/basilica-federation/tests/health_test.rs (1)

14-43: Simplify the health status enum test.

The test uses verbose pattern matching with assert!(true) and assert!(false), which is not idiomatic Rust. The test could be simplified significantly.

Consider this simpler approach:

 #[tokio::test]
 async fn test_health_status_enum() {
     use basilica_federation::health::HealthStatus;
     
-    let healthy = HealthStatus::Healthy;
-    let degraded = HealthStatus::Degraded;
-    let unhealthy = HealthStatus::Unhealthy;
-    let unknown = HealthStatus::Unknown;
-    
-    // Test that all variants exist
-    match healthy {
-        HealthStatus::Healthy => assert!(true),
-        _ => assert!(false),
-    }
-    
-    match degraded {
-        HealthStatus::Degraded => assert!(true),
-        _ => assert!(false),
-    }
-    
-    match unhealthy {
-        HealthStatus::Unhealthy => assert!(true),
-        _ => assert!(false),
-    }
-    
-    match unknown {
-        HealthStatus::Unknown => assert!(true),
-        _ => assert!(false),
-    }
+    // Test that all variants can be constructed
+    assert!(matches!(HealthStatus::Healthy, HealthStatus::Healthy));
+    assert!(matches!(HealthStatus::Degraded, HealthStatus::Degraded));
+    assert!(matches!(HealthStatus::Unhealthy, HealthStatus::Unhealthy));
+    assert!(matches!(HealthStatus::Unknown, HealthStatus::Unknown));
 }
crates/basilica-federation/tests/events_test.rs (3)

5-12: Remove or improve the trivial assertion.

The assert!(true) on line 11 provides no value - it always passes. If the test is only verifying compilation, consider either removing it or adding a more meaningful assertion (e.g., checking that the receiver is not lagged).

Consider:

 #[test]
 fn test_event_publisher_creation() {
     let publisher = EventPublisher::new();
-    let receiver = publisher.subscribe();
+    let _receiver = publisher.subscribe();
     
-    // Should be able to create publisher and subscriber
-    assert!(true);
+    // Test passes if publisher and subscriber can be created without panicking
 }

33-41: Simplify event matching and error handling.

The nested match and assert!(false, "...") pattern is not idiomatic. Consider using panic! or assert_eq! for clearer intent.

Consider:

     if let Ok(Ok(received_event)) = received {
         match (received_event, event) {
             (FederationEvent::ClusterHealthChanged { cluster_id: id1, .. },
              FederationEvent::ClusterHealthChanged { cluster_id: id2, .. }) => {
                 assert_eq!(id1, id2);
             }
-            _ => assert!(false, "Event types don't match"),
+            _ => panic!("Event types don't match"),
         }
+    } else {
+        panic!("Failed to receive event within timeout");
     }

72-97: Remove or improve trivial assertions in manager tests.

Similar to the publisher creation test, these tests contain assert!(true) statements that provide no value. If the tests are only validating compilation, consider removing the assertions or adding meaningful checks.

For test_event_manager_creation:

 #[test]
 fn test_event_manager_creation() {
     let manager = EventManager::new();
-    let publisher = manager.publisher();
+    let _publisher = manager.publisher();
     
-    // Should be able to get publisher
-    assert!(true);
+    // Test passes if manager and publisher accessor work without panicking
 }

For test_event_manager_handler_registration:

     let mut manager = EventManager::new();
     manager.register_handler(std::sync::Arc::new(TestHandler));
     
-    assert!(true);
+    // Test passes if handler registration works without panicking
 }
crates/basilica-federation/src/cache.rs (1)

8-64: Consider documenting or enforcing default capacity.

The build() method creates an unbounded cache if max_capacity is not set. While moka handles this gracefully, unbounded caches can lead to memory issues in long-running services. Consider either:

  1. Documenting this behavior in the build() method
  2. Setting a sensible default capacity
     /// Build the cache
+    /// 
+    /// Note: If max_capacity is not set, the cache will be unbounded.
     pub fn build(self) -> Cache<K, V> {
crates/basilica-federation/src/api/mod.rs (2)

16-16: Remove unused import.

create_kube_client is imported but not used in this file. The Kubernetes client creation is handled within the individual component constructors.

-use crate::utils::create_kube_client;

74-84: Fire-and-forget spawns lose task handles.

The background tasks are spawned without retaining their JoinHandles. If any task panics, the failure will go unnoticed until downstream effects manifest. Consider:

  1. Storing handles in the struct for graceful shutdown
  2. Using tokio::spawn with panic handling or catch_unwind
  3. Adding a shutdown mechanism
+use tokio::task::JoinHandle;
+
 pub struct FederationApi {
     config: Arc<FederationConfig>,
     gateway: Arc<FederationGateway>,
     discovery: Arc<ServiceDiscovery>,
     health: Arc<HealthAggregator>,
     load_balancer: Arc<LoadBalancer>,
     resource_manager: Arc<ResourceManager>,
+    task_handles: Vec<JoinHandle<()>>,
 }

Then store the handles from tokio::spawn and join them on shutdown.

crates/basilica-federation/src/retry.rs (1)

50-99: Consider using saturating arithmetic for backoff calculation.

The conversion chain delay.as_millis() as f64 * multiplier as u64 can lose precision for large delays. While max_delay caps the result, using saturating operations is more defensive.

-                    delay = Duration::from_millis(
-                        (delay.as_millis() as f64 * self.config.backoff_multiplier) as u64
-                    ).min(self.config.max_delay);
+                    let next_millis = delay
+                        .as_millis()
+                        .saturating_mul(self.config.backoff_multiplier as u128)
+                        .min(self.config.max_delay.as_millis());
+                    delay = Duration::from_millis(next_millis as u64);

Alternatively, keep the f64 approach but ensure clamping before the cast.

crates/basilica-federation/src/metrics.rs (1)

190-194: Default implementation panics on failure.

expect() in Default::default() will panic if metric creation fails. For a metrics subsystem, this is aggressive—consider returning a fallback no-op implementation or making initialization fallible at the call site.

crates/basilica-federation/src/api/handler.rs (1)

102-127: Consider extracting the service-to-JSON mapping to reduce duplication.

The mapping closure is duplicated in both branches. Extract it for DRY.

+        let to_json = |s: crate::discovery::ServiceInfo| serde_json::json!({
+            "name": s.name,
+            "namespace": s.namespace,
+            "cluster_id": s.cluster_id,
+            "endpoints": s.endpoints,
+            "labels": s.labels,
+        });
+
         let filtered_services: Vec<_> = if let Some(filter_labels) = labels {
             services.into_iter()
                 .filter(|s| {
                     filter_labels.iter().all(|(k, v)| {
                         s.labels.get(k).map(|sv| sv == v).unwrap_or(false)
                     })
                 })
-                .map(|s| serde_json::json!({
-                    "name": s.name,
-                    "namespace": s.namespace,
-                    "cluster_id": s.cluster_id,
-                    "endpoints": s.endpoints,
-                    "labels": s.labels,
-                }))
+                .map(to_json)
                 .collect()
         } else {
-            services.into_iter()
-                .map(|s| serde_json::json!({
-                    "name": s.name,
-                    "namespace": s.namespace,
-                    "cluster_id": s.cluster_id,
-                    "endpoints": s.endpoints,
-                    "labels": s.labels,
-                }))
-                .collect()
+            services.into_iter().map(to_json).collect()
         };
crates/basilica-federation/src/api/router.rs (2)

87-96: Consider storing FederationHandler in GatewayState to avoid repeated instantiation.

FederationHandler::new() is called in nearly every handler, creating a new instance per request. Since all constructor arguments are Arc-wrapped and come from GatewayState, this is cheap but unnecessary. Storing a pre-constructed FederationHandler in GatewayState would be cleaner and more efficient.

Example refactor for GatewayState:

pub struct GatewayState {
    // ... existing fields ...
    pub handler: Arc<FederationHandler>,
}

Then handlers simplify to:

 async fn statistics(State(state): State<GatewayState>) -> Result<Json<serde_json::Value>, StatusCode> {
-    let handler = FederationHandler::new(
-        state.config.clone(),
-        state.discovery.clone(),
-        state.health.clone(),
-        state.load_balancer.clone(),
-        state.resource_manager.clone(),
-    );
-    handler.handle_statistics().await.map_err(|_| StatusCode::INTERNAL_SERVER_ERROR)
+    state.handler.handle_statistics().await.map_err(|_| StatusCode::INTERNAL_SERVER_ERROR)
 }

166-177: Error handling conflates different failure modes.

Mapping all errors to NOT_FOUND (line 175) loses distinction between "service doesn't exist" and "internal error contacting the discovery service." Consider differentiating error types to return appropriate status codes.

 async fn get_service(
     State(state): State<GatewayState>,
     Path(service_name): Path<String>,
     Query(params): Query<HashMap<String, String>>,
 ) -> Result<Json<serde_json::Value>, StatusCode> {
     let namespace = params.get("namespace");
     match state.discovery.get_service(&service_name, namespace).await {
         Ok(service) => Ok(Json(service)),
-        Err(_) => Err(StatusCode::NOT_FOUND),
+        Err(e) if e.is_not_found() => Err(StatusCode::NOT_FOUND),
+        Err(_) => Err(StatusCode::INTERNAL_SERVER_ERROR),
     }
 }

This depends on whether the error type from discovery.get_service() exposes error categorization.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 330ee23 and 3cc5e42.

📒 Files selected for processing (25)
  • COMMIT_MESSAGE.txt (1 hunks)
  • crates/basilica-federation/docs/ARCHITECTURE.md (1 hunks)
  • crates/basilica-federation/docs/DEPLOYMENT.md (1 hunks)
  • crates/basilica-federation/docs/TROUBLESHOOTING.md (1 hunks)
  • crates/basilica-federation/examples/basic_federation.rs (1 hunks)
  • crates/basilica-federation/examples/load_balancing.rs (1 hunks)
  • crates/basilica-federation/examples/service_discovery.rs (1 hunks)
  • crates/basilica-federation/src/api/handler.rs (1 hunks)
  • crates/basilica-federation/src/api/mod.rs (1 hunks)
  • crates/basilica-federation/src/api/router.rs (1 hunks)
  • crates/basilica-federation/src/cache.rs (1 hunks)
  • crates/basilica-federation/src/events.rs (1 hunks)
  • crates/basilica-federation/src/lib.rs (1 hunks)
  • crates/basilica-federation/src/metrics.rs (1 hunks)
  • crates/basilica-federation/src/retry.rs (1 hunks)
  • crates/basilica-federation/tests/discovery_test.rs (1 hunks)
  • crates/basilica-federation/tests/error_test.rs (1 hunks)
  • crates/basilica-federation/tests/events_test.rs (1 hunks)
  • crates/basilica-federation/tests/health_test.rs (1 hunks)
  • crates/basilica-federation/tests/integration_test.rs (1 hunks)
  • crates/basilica-federation/tests/load_balancer_test.rs (1 hunks)
  • crates/basilica-federation/tests/metrics_test.rs (1 hunks)
  • orchestrator/ansible/roles/federation/tasks/main.yml (1 hunks)
  • orchestrator/k8s/services/federation/prometheusrule.yaml (1 hunks)
  • orchestrator/k8s/services/federation/servicemonitor.yaml (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • crates/basilica-federation/docs/TROUBLESHOOTING.md
  • crates/basilica-federation/docs/DEPLOYMENT.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/basilica-federation/src/lib.rs
🧰 Additional context used
🧬 Code graph analysis (12)
crates/basilica-federation/examples/load_balancing.rs (2)
crates/basilica-federation/src/api/mod.rs (2)
  • load_balancer (108-110)
  • new (31-60)
crates/basilica-federation/examples/basic_federation.rs (1)
  • main (7-52)
crates/basilica-federation/tests/discovery_test.rs (2)
crates/basilica-federation/src/api/mod.rs (2)
  • discovery (98-100)
  • new (31-60)
crates/basilica-federation/src/api/handler.rs (1)
  • new (29-43)
crates/basilica-federation/tests/events_test.rs (1)
crates/basilica-federation/src/events.rs (4)
  • publisher (118-120)
  • new (65-68)
  • new (105-110)
  • handle (94-94)
crates/basilica-federation/examples/service_discovery.rs (1)
crates/basilica-federation/src/api/mod.rs (2)
  • discovery (98-100)
  • new (31-60)
crates/basilica-federation/src/api/mod.rs (9)
crates/basilica-federation/src/api/router.rs (1)
  • health (57-65)
crates/basilica-federation/src/discovery/mod.rs (3)
  • create_kube_client (65-89)
  • new (33-62)
  • start (92-108)
crates/basilica-federation/src/utils.rs (1)
  • create_kube_client (7-31)
crates/basilica-federation/src/api/handler.rs (1)
  • new (29-43)
crates/basilica-federation/src/metrics.rs (1)
  • new (42-179)
crates/basilica-api/src/federation.rs (1)
  • new (15-22)
crates/basilica-federation/src/health/mod.rs (2)
  • new (56-85)
  • start (88-145)
crates/basilica-federation/src/api/gateway.rs (2)
  • new (37-59)
  • start (96-111)
crates/basilica-federation/src/resource_manager/mod.rs (2)
  • new (19-43)
  • start (46-63)
crates/basilica-federation/examples/basic_federation.rs (1)
crates/basilica-federation/src/api/mod.rs (1)
  • new (31-60)
crates/basilica-federation/tests/load_balancer_test.rs (1)
crates/basilica-federation/src/api/mod.rs (2)
  • load_balancer (108-110)
  • new (31-60)
crates/basilica-federation/tests/health_test.rs (3)
crates/basilica-federation/src/api/mod.rs (2)
  • health (103-105)
  • new (31-60)
crates/basilica-federation/src/api/router.rs (1)
  • health (57-65)
crates/basilica-federation/src/api/handler.rs (1)
  • new (29-43)
crates/basilica-federation/src/events.rs (1)
crates/basilica-federation/tests/events_test.rs (1)
  • handle (88-90)
crates/basilica-federation/tests/integration_test.rs (1)
crates/basilica-federation/src/config.rs (1)
  • enabled_clusters (277-279)
crates/basilica-federation/src/api/router.rs (4)
crates/basilica-federation/src/api/mod.rs (3)
  • gateway (93-95)
  • new (31-60)
  • health (103-105)
crates/basilica-federation/src/api/handler.rs (1)
  • new (29-43)
crates/basilica-federation/src/metrics.rs (1)
  • new (42-179)
crates/basilica-federation/src/config.rs (1)
  • enabled_clusters (277-279)
crates/basilica-federation/src/api/handler.rs (2)
crates/basilica-federation/src/api/mod.rs (5)
  • discovery (98-100)
  • health (103-105)
  • load_balancer (108-110)
  • resource_manager (113-115)
  • new (31-60)
crates/basilica-federation/src/api/router.rs (1)
  • health (57-65)
🪛 Gitleaks (8.30.0)
orchestrator/ansible/roles/federation/tasks/main.yml

[high] 40-49: Possible Kubernetes Secret detected, posing a risk of leaking credentials/tokens from your deployments

(kubernetes-secret-yaml)

🪛 LanguageTool
crates/basilica-federation/docs/ARCHITECTURE.md

[uncategorized] ~91-~91: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ... - Secure inter-cluster communication - Rate limiting - Request validation ## Monitoring Pr...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

🪛 markdownlint-cli2 (0.18.1)
crates/basilica-federation/docs/ARCHITECTURE.md

59-59: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (17)
COMMIT_MESSAGE.txt (2)

1-1: Subject line is clear and conventional-commit friendly.
Reads well as feat(federation): ....


10-10: Avoid hard-coded quantitative claims in commit messages unless they will be kept up-to-date.

Hard-coded numbers like "(5 algorithms)" and "(~1,655 lines Rust)" become stale when code changes. Either remove the numbers entirely or replace them with concrete, verifiable details (e.g., list the algorithm names instead of a count, or reference the actual crate structure). If your commit message makes specific claims, ensure they remain accurate as the codebase evolves.

Also applies to: 14-14

crates/basilica-federation/tests/integration_test.rs (1)

1-138: Excellent configuration test coverage!

The integration tests provide comprehensive coverage of the FederationConfig surface, including cluster management, enabled cluster filtering, and all subsystem defaults. The tests are well-structured and verify both positive cases and edge cases (like nonexistent cluster retrieval).

crates/basilica-federation/tests/health_test.rs (1)

6-12: Good test coverage for health aggregation.

The tests appropriately validate HealthAggregator creation, NodeHealth structure, and edge cases (empty clusters, nonexistent clusters). The test structure is solid and uses tokio::test appropriately for async operations.

Also applies to: 45-76

crates/basilica-federation/tests/load_balancer_test.rs (1)

1-74: Comprehensive load balancer test coverage!

The tests validate all five load balancing algorithms mentioned in the PR objectives (RoundRobin, LeastConnections, WeightedRoundRobin, Random, Geographic) and verify behavior with health-aware and region-aware configurations. The test structure is solid and appropriate.

crates/basilica-federation/tests/events_test.rs (1)

44-70: Good validation of event variants.

The test properly validates that all three event types can be created and matched. The use of matches! macro is appropriate and idiomatic.

crates/basilica-federation/src/cache.rs (2)

1-6: LGTM!

Module documentation and imports are appropriate. The use of moka::future::Cache is a good choice for async-compatible caching.


66-96: LGTM!

The Default implementation and helper constructors are well-designed. The capacity choices (10,000 for services, 100 for health) are reasonable defaults for their respective use cases.

crates/basilica-federation/src/api/mod.rs (1)

92-115: LGTM!

Accessor methods are straightforward and provide read-only access to the Arc-wrapped components.

crates/basilica-federation/src/retry.rs (3)

7-36: LGTM!

RetryConfig has sensible defaults: 3 attempts, 100ms initial delay, 5s max delay, 2x multiplier. The struct is well-documented.


101-113: LGTM!

The is_retryable implementation correctly categorizes transient errors as retryable and uses a wildcard for unknown variants, ensuring forward compatibility.


116-124: LGTM!

The convenience retry function provides a clean API for one-off retry operations.

crates/basilica-federation/src/api/handler.rs (3)

27-43: LGTM!

Constructor properly initializes the handler with all required dependencies.


46-65: LGTM!

The cluster listing implementation is straightforward and returns appropriate metadata.


68-92: LGTM!

Good error handling with appropriate status codes. The fallback for missing health data is reasonable.

crates/basilica-federation/src/api/router.rs (2)

15-54: Router structure looks well-organized.

The routing setup follows standard Axum patterns with clear grouping of endpoints by domain (health, clusters, services, resources, load balancer, metrics). The use of a unit struct with associated functions is a clean approach for organizing the handlers.

Note: The Arc import on line 13 appears unused since GatewayState fields are already Arc-wrapped. Consider removing it if not needed elsewhere.


201-242: Resource endpoints follow consistent patterns.

The resource listing and pod retrieval handlers are consistent with the established patterns. The same error handling note from the service endpoints applies here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
crates/basilica-federation/src/api/gateway.rs (3)

161-162: Consider logging health fetch failures.

unwrap_or_default() silently discards errors when fetching cluster health. While returning a default may be appropriate for graceful degradation, consider logging the error to aid debugging and monitoring.

Apply this diff to log the error:

         let health = state.health.get_cluster_health(&cluster_id).await
-            .unwrap_or_default();
+            .unwrap_or_else(|e| {
+                warn!(cluster_id = %cluster_id, error = %e, "Failed to fetch cluster health");
+                Default::default()
+            });

302-318: Consider using reqwest's built-in query parameter handling.

The current implementation manually encodes and concatenates query parameters to the URL string. While functional, this approach is more error-prone and verbose than using reqwest's built-in .query() method, which handles encoding automatically.

Apply this diff to use reqwest's query handling:

-        // Build target URL with query parameters
-        let mut target_url = format!("{}/{}", cluster.api_server, path);
-        if !params.is_empty() {
-            let query_string: Vec<String> = params
-                .iter()
-                .map(|(k, v)| format!("{}={}", urlencoding::encode(k), urlencoding::encode(v)))
-                .collect();
-            target_url.push('?');
-            target_url.push_str(&query_string.join("&"));
-        }
-        
-        // Parse as reqwest::Url (not http::Uri)
-        let url: reqwest::Url = target_url.parse()
+        // Build target URL
+        let target_url = format!("{}/{}", cluster.api_server, path);
+        let url: reqwest::Url = target_url.parse()
             .map_err(|e| {
                 error!(error = %e, "Invalid target URL");
                 StatusCode::BAD_REQUEST
             })?;

Then pass params to the request builder at line 333:

-        let mut request_builder = state.http_client.request(reqwest_method, url);
+        let mut request_builder = state.http_client.request(reqwest_method, url)
+            .query(&params);

416-419: Implement actual metrics collection.

The metrics handler currently returns a placeholder string with no actual metrics. For production use, this endpoint should expose Prometheus-formatted metrics such as request counts, latencies, error rates, cluster health status, and proxy success/failure rates.

Consider integrating a metrics library like prometheus or metrics to track:

  • HTTP request counts by method, path, and status code
  • Request duration histograms
  • Proxy request success/failure rates by target cluster
  • Cluster health check results
  • Active connections

Do you want me to generate a more complete metrics implementation or open an issue to track this?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3cc5e42 and 195e617.

📒 Files selected for processing (3)
  • crates/basilica-federation/Cargo.toml (1 hunks)
  • crates/basilica-federation/src/api/gateway.rs (1 hunks)
  • crates/basilica-federation/src/config.rs (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • crates/basilica-federation/src/config.rs
🔇 Additional comments (1)
crates/basilica-federation/Cargo.toml (1)

13-46: No action needed. The workspace dependencies are correctly pinned and compatible with the http = "1.1" version used in this crate.

@saber04414
Copy link
Author

@epappas Thanks for the feedback - you're right that the current implementation doesn't align with the platform's architecture (Envoy/WireGuard/Operator patterns). I've fixed the security issues, but I need clarification on what federation should actually do and how it should integrate with the existing infrastructure before redesigning it properly.

@epappas
Copy link
Member

epappas commented Dec 12, 2025

@epappas Thanks for the feedback - you're right that the current implementation doesn't align with the platform's architecture (Envoy/WireGuard/Operator patterns). I've fixed the security issues, but I need clarification on what federation should actually do and how it should integrate with the existing infrastructure before redesigning it properly.

You started federation by your own initiative and you're asking me to what it should be done for it?

@epappas
Copy link
Member

epappas commented Dec 15, 2025

closing since this is an AI gen work, and the author didn't provide any additional reasoning behind the change or the intended plan.

@epappas epappas closed this Dec 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants