-
Notifications
You must be signed in to change notification settings - Fork 7
feat: K3s Multi-Cluster Federation #289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: K3s Multi-Cluster Federation #289
Conversation
WalkthroughIntroduces a comprehensive multi-cluster federation system for Basilica as a new Changes
Sequence DiagramsequenceDiagram
participant Main as Federation Main
participant API as FederationApi
participant Config as FederationConfig
participant Discovery as ServiceDiscovery
participant Health as HealthAggregator
participant LB as LoadBalancer
participant ResMgr as ResourceManager
participant Gateway as FederationGateway
participant Client as HTTP Client
Main->>Config: Load configuration
Main->>API: new(config)
API->>Discovery: new(config)
Discovery->>Discovery: Initialize Kube clients
API->>Health: new(config)
Health->>Health: Initialize Kube clients
API->>LB: new(config)
API->>ResMgr: new(config)
ResMgr->>ResMgr: Initialize Kube clients
API->>Gateway: new(config, discovery, health, LB, ResMgr)
Gateway->>Gateway: Build Axum router
API->>Discovery: start()
Discovery->>Discovery: Spawn refresh task
API->>Health: start()
Health->>Health: Spawn health check task
API->>ResMgr: start()
ResMgr->>ResMgr: Spawn sync task
API->>Gateway: start()
Gateway->>Gateway: Bind and serve
Client->>Gateway: GET /health
Gateway->>Health: aggregate_health()
Health->>Health: Query per-cluster health
Gateway->>Client: Health status JSON
Client->>Gateway: GET /clusters
Gateway->>Discovery: discover_services()
Discovery->>Discovery: Query cache
Gateway->>Client: Cluster list JSON
Client->>Gateway: GET /proxy/*
Gateway->>LB: select_cluster()
LB->>Health: Get cluster health
Gateway->>Client: Forward response
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120+ minutes Areas requiring careful attention:
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Implement comprehensive multi-cluster federation for managing multiple K3s clusters with geographic distribution and high availability. Components: - Multi-cluster API gateway with unified REST API - Cross-cluster service discovery with caching - Cluster health aggregation and monitoring - Cross-cluster load balancing (5 algorithms) - Federated resource management Implementation: - New basilica-federation crate (~1,655 lines Rust) - Ansible playbook for federation setup - Kubernetes deployment manifests - Integration with basilica-api Features: - Health-aware and region-aware routing - Automatic service discovery across clusters - Real-time cluster health monitoring - Request proxying to target clusters - Prometheus metrics support
1e94e64 to
330ee23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 15
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (6)
crates/basilica-federation/README.md-15-32 (1)
15-32: Add a language tag to the fenced diagram block (markdownlint MD040).The ASCII architecture diagram fence should specify a language (e.g.,
text) to satisfy lint and render consistently.crates/basilica-federation/src/load_balancer/mod.rs-77-83 (1)
77-83:WeightedRoundRobinis misnamed — this selects max priority only.Weighted round-robin should distribute requests proportionally to weights across all candidates over time. This implementation always selects the highest-priority cluster, which is priority-based selection, not weighted distribution.
Either rename this to reflect actual behavior, or implement proper weighted distribution using the priority values as weights.
crates/basilica-api/src/federation.rs-39-44 (1)
39-44: Test passes trivially with default config containing no clusters.The
FederationConfig::default()creates a valid but non-functional configuration with an emptyclustersvector. WhileFederationApi::new()initializes successfully with no clusters (sub-components handle empty clusters gracefully), the test doesn't verify that the federation client can actually perform load balancing or service discovery. The test should either provide a valid configuration with at least one cluster or mock the dependencies to verify meaningful behavior.crates/basilica-federation/src/discovery/mod.rs-91-108 (1)
91-108: Discovery starts “empty” until the first interval tick.
start()only refreshes inside the interval loop (Lines 97-107), so/servicescan return nothing immediately after startup. Consider callingrefresh_services(...)once before entering the loop.crates/basilica-federation/src/health/mod.rs-158-187 (1)
158-187:total == 0currently reportsUnhealthy; probably should beUnknown.
As written, a cluster with zero nodes (or an empty list due to RBAC scoping quirks) is treated as Unhealthy (Line 180-186). If you want “no data” vs “bad data” separation, returnHealthStatus::Unknownwhentotal == 0.crates/basilica-federation/src/api/gateway.rs-113-125 (1)
113-125: Health endpoint always returns"status": "healthy"even on partial failures.
Sinceaggregate_health()can include unhealthy clusters, the top-level"status"(Lines 116-119) is misleading. Consider deriving overall status (e.g., healthy/degraded/unhealthy) based on per-cluster statuses + missing-data cases.
🧹 Nitpick comments (10)
crates/basilica-api/Cargo.toml (1)
9-20: Consider feature-gatingbasilica-federationto avoid forcing the full dependency graph.If federation isn’t required for all
basilica-apibuilds, make the dependencyoptional = trueand enable it via afederationfeature (then gatepub mod federation/ usages behind that feature).crates/basilica-api/src/federation.rs (1)
29-32:is_enabled()always returnstrue— consider deriving from configuration.This stub always returns
true, but federation may be disabled in certain deployments. Consider reading fromFederationConfigor checking if clusters are configured.- /// Check if federation is enabled - pub fn is_enabled(&self) -> bool { - true - } + /// Check if federation is enabled + pub fn is_enabled(&self) -> bool { + !self.api.config().enabled_clusters().is_empty() + }crates/basilica-federation/src/resource_manager/mod.rs (2)
46-63: Background task is fire-and-forget with no shutdown mechanism.The spawned task runs indefinitely with no way to cancel it or await its completion. This can cause issues during graceful shutdown or testing. Consider returning a
JoinHandleor using a cancellation token.+ use tokio::sync::watch; + use tokio::task::JoinHandle; + /// Start resource synchronization - pub async fn start(&self) { + pub fn start(&self, shutdown: watch::Receiver<bool>) -> JoinHandle<()> { let config = self.config.clone(); let clients = self.clients.clone(); - tokio::spawn(async move { + tokio::spawn(async move { let mut interval = tokio::time::interval(config.resource_manager.sync_interval); + let mut shutdown = shutdown; loop { - interval.tick().await; + tokio::select! { + _ = interval.tick() => {} + _ = shutdown.changed() => break, + } if config.resource_manager.auto_distribute { if let Err(e) = Self::sync_resources(&config, &clients).await { warn!(error = %e, "Failed to sync resources"); } } } - }); + }) }
82-104: Sequential cluster iteration — consider parallel fetching for better latency.Pod listing iterates clusters sequentially. For federations with multiple clusters, parallel fetching via
futures::future::join_allwould reduce total latency.use futures::future::join_all; let results = join_all(self.clients.iter().map(|(cluster_id, client)| { let cluster_id = cluster_id.clone(); let client = client.clone(); async move { let pods_api: Api<Pod> = Api::namespaced(client, namespace); (cluster_id, pods_api.list(&ListParams::default()).await) } })).await;crates/basilica-federation/src/lib.rs (1)
14-27: Crate public surface looks coherent; consider whether consumers need a typed version instead of&str.
If external integrations compare versions, asemver::Version(or at least afn version() -> Version) can avoid string parsing at call sites.crates/basilica-federation/src/api/mod.rs (2)
31-60: Potential double-initialization ofHealthAggregatorviaLoadBalancer::new+set_health.
GivenLoadBalancer::new(config)appears to create its ownHealthAggregator(seecrates/basilica-federation/src/load_balancer/mod.rssnippet),FederationApi::newlikely allocates two aggregators and discards one. Consider changingLoadBalancer::newto acceptArc<HealthAggregator>(or make health optional until set).
63-90: Task spawning is redundant and complicates shutdown.
ServiceDiscovery::start,HealthAggregator::start, andResourceManager::startalreadytokio::spawntheir own loops (per provided snippets), soFederationApi::startspawning another task just to call them adds noise and makes graceful shutdown/cancellation harder. Prefer callingdiscovery.start().await;directly (or makingstart()non-async), and returnJoinHandles if you want lifecycle control.crates/basilica-federation/src/health/mod.rs (1)
201-233: Avoidformat!("{:?}", health.status)as an API contract.
SinceHealthStatusalready derivesSerialize, returning the enum directly (or a stable#[serde(rename_all = "snake_case")]) avoids accidental breaking changes if variants/Debug formatting changes.crates/basilica-federation/src/discovery/mod.rs (2)
64-90: Remove duplicatedcreate_kube_clienthelper (prefercrate::utils::create_kube_client).
You already import and usecrate::utils::create_kube_client(Line 5, Line 38), so the in-moduleasync fn create_kube_client(Lines 64-89) is redundant and will drift over time.
196-217:get_serviceselection is ambiguous across clusters (and namespaces if not provided).
get_servicereturns the first match byname(Lines 204-207). With multi-cluster federation, returning all matches (or requiring namespace + returning per-cluster entries) avoids surprising routing/UX.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
Cargo.toml(1 hunks)crates/basilica-api/Cargo.toml(1 hunks)crates/basilica-api/src/federation.rs(1 hunks)crates/basilica-api/src/lib.rs(1 hunks)crates/basilica-federation/Cargo.toml(1 hunks)crates/basilica-federation/README.md(1 hunks)crates/basilica-federation/src/api/gateway.rs(1 hunks)crates/basilica-federation/src/api/mod.rs(1 hunks)crates/basilica-federation/src/config.rs(1 hunks)crates/basilica-federation/src/discovery/mod.rs(1 hunks)crates/basilica-federation/src/error.rs(1 hunks)crates/basilica-federation/src/health/mod.rs(1 hunks)crates/basilica-federation/src/lib.rs(1 hunks)crates/basilica-federation/src/load_balancer/mod.rs(1 hunks)crates/basilica-federation/src/main.rs(1 hunks)crates/basilica-federation/src/resource_manager/mod.rs(1 hunks)crates/basilica-federation/src/utils.rs(1 hunks)orchestrator/ansible/playbooks/01-setup/federation.yml(1 hunks)orchestrator/k8s/services/federation/deployment.yaml(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (8)
crates/basilica-federation/src/utils.rs (1)
crates/basilica-federation/src/discovery/mod.rs (2)
create_kube_client(65-89)new(33-62)
crates/basilica-api/src/federation.rs (1)
crates/basilica-federation/src/api/mod.rs (1)
new(31-60)
crates/basilica-federation/src/api/mod.rs (5)
crates/basilica-federation/src/discovery/mod.rs (3)
create_kube_client(65-89)new(33-62)start(92-108)crates/basilica-federation/src/api/gateway.rs (2)
new(37-59)start(96-111)crates/basilica-federation/src/health/mod.rs (2)
new(56-85)start(88-145)crates/basilica-federation/src/load_balancer/mod.rs (1)
new(28-35)crates/basilica-federation/src/resource_manager/mod.rs (2)
new(19-43)start(46-63)
crates/basilica-federation/src/discovery/mod.rs (2)
crates/basilica-federation/src/utils.rs (1)
create_kube_client(7-31)crates/basilica-federation/src/config.rs (1)
default(223-264)
crates/basilica-federation/src/load_balancer/mod.rs (5)
crates/basilica-federation/src/api/mod.rs (2)
health(103-105)new(31-60)crates/basilica-api/src/federation.rs (1)
new(15-22)crates/basilica-federation/src/api/gateway.rs (1)
new(37-59)crates/basilica-federation/src/health/mod.rs (1)
new(56-85)crates/basilica-federation/src/config.rs (1)
enabled_clusters(277-279)
crates/basilica-federation/src/api/gateway.rs (6)
crates/basilica-federation/src/api/mod.rs (7)
discovery(98-100)health(103-105)load_balancer(108-110)resource_manager(113-115)new(31-60)start(63-90)gateway(93-95)crates/basilica-federation/src/discovery/mod.rs (3)
new(33-62)get_service(197-217)start(92-108)crates/basilica-federation/src/health/mod.rs (2)
new(56-85)start(88-145)crates/basilica-federation/src/load_balancer/mod.rs (1)
new(28-35)crates/basilica-federation/src/resource_manager/mod.rs (4)
new(19-43)list_pods(76-107)get_pod(110-133)start(46-63)crates/basilica-federation/src/config.rs (1)
get_cluster(282-284)
crates/basilica-federation/src/main.rs (5)
crates/basilica-api/src/federation.rs (1)
new(15-22)crates/basilica-federation/src/api/mod.rs (1)
new(31-60)crates/basilica-federation/src/health/mod.rs (1)
new(56-85)crates/basilica-federation/src/load_balancer/mod.rs (1)
new(28-35)crates/basilica-federation/src/config.rs (2)
load(269-274)default(223-264)
crates/basilica-federation/src/health/mod.rs (4)
crates/basilica-federation/src/utils.rs (1)
create_kube_client(7-31)crates/basilica-federation/src/api/mod.rs (2)
new(31-60)health(103-105)crates/basilica-federation/src/load_balancer/mod.rs (1)
new(28-35)crates/basilica-federation/src/config.rs (1)
default(223-264)
🪛 Checkov (3.2.334)
orchestrator/k8s/services/federation/deployment.yaml
[medium] 1-76: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-76: Minimize the admission of root containers
(CKV_K8S_23)
🪛 markdownlint-cli2 (0.18.1)
crates/basilica-federation/README.md
15-15: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (7)
Cargo.toml (1)
1-22: Workspace member addition looks correct.crates/basilica-api/src/lib.rs (1)
14-29:pub mod federation;is fine—just ensure it matches the intended feature/dependency story.If you decide to feature-gate the dependency in
crates/basilica-api/Cargo.toml, also gate this module export to keep builds consistent.crates/basilica-federation/src/error.rs (1)
1-56: Well-structured error handling with comprehensive variants.The error enum properly covers all federation operation categories with clear messages. Good use of
thiserrorwith#[from]conversions for underlying error types.crates/basilica-federation/src/resource_manager/mod.rs (1)
114-127: Pod search order is non-deterministic.
HashMapiteration order is not guaranteed. If the same pod name exists in multiple clusters (unlikely but possible), the returned pod depends on iteration order. Consider documenting this behavior or accepting acluster_idparameter.crates/basilica-federation/src/load_balancer/mod.rs (1)
64-66: Fallback to first enabled cluster when all candidates are unhealthy.When
candidates.is_empty()(all clusters unhealthy), this falls back toenabled_clusters.first(). This may route traffic to an unhealthy cluster. Consider returningNoneor logging a warning.crates/basilica-federation/src/api/gateway.rs (1)
61-93:andcrates/basilica-federation/src/config.rs (1)
145-160: LGTM: clean enum-based LB algorithm config + optional sticky sessions.
The shape is straightforward to consume elsewhere and avoids “stringly-typed” algorithms.Also applies to: 161-179
epappas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code, doesn't actually create a federation, and it's not in compliance with the overall design, neither the proxy nor network design of the platform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 17
🧹 Nitpick comments (25)
COMMIT_MESSAGE.txt (3)
3-5: Consider wrapping lines at ~72–100 chars for readability in git tooling.
Line 3-4 may display awkwardly depending on viewer; minor polish.
13-18: Implementation bullets are good; consider adding concrete integration touchpoints.
If you want this commit message to help future archaeology, add brief identifiers (e.g., crate path, key module names, or entrypoint binary) rather than only “Integration with basilica-api”.
19-24: Feature list is solid; consider clarifying “region-aware” source of truth.
If “region-aware routing” is driven by labels/annotations/config, naming it here can reduce ambiguity.crates/basilica-federation/tests/error_test.rs (3)
5-51: Strengthen Display tests and reduce duplication.The Display tests only verify substring containment, which doesn't catch formatting regressions or ensure consistent error messages across variants. Additionally, the tests follow an identical pattern that could be parameterized.
Consider:
- Asserting exact error message formats (e.g.,
"Config error: test error") rather than substring matching- Parameterizing tests to reduce duplication and improve maintainability
Example refactor using a parameterized test pattern:
#[test] fn test_federation_error_display() { let test_cases = vec![ (FederationError::Config("test error".to_string()), "Configuration error: test error"), (FederationError::ClusterNotFound("cluster-1".to_string()), "Cluster not found: cluster-1"), (FederationError::Discovery("discovery failed".to_string()), "Discovery error: discovery failed"), // ... other variants ]; for (error, expected) in test_cases { assert_eq!(format!("{}", error), expected); } }
1-70: Add tests for Error trait implementation.The test suite doesn't verify the
std::error::Errortrait implementation or error conversions. Consider adding tests for:
source()method if error chaining is supportedFrom/Intoimplementations for error conversions- Debug formatting output
Example:
#[test] fn test_error_trait_implementation() { let error = FederationError::Config("test".to_string()); // Verify it implements Error trait let _: &dyn std::error::Error = &error; } #[test] fn test_error_debug_format() { let error = FederationError::Config("test error".to_string()); let debug_output = format!("{:?}", error); assert!(debug_output.contains("Config")); assert!(debug_output.contains("test error")); }
53-69: Result type tests provide minimal coverage.These tests verify trivial
is_ok()/is_err()behavior but don't exercise meaningful error handling scenarios like error propagation with the?operator or realistic error flows.Consider either removing these tests or replacing them with integration tests that demonstrate actual error handling patterns in the federation codebase.
orchestrator/ansible/roles/federation/tasks/main.yml (1)
18-31: Config validation runs even when Kubernetes isn’t being configured (missingwhen: kubeconfig is definedparity).
If this role is meant to support “generate-only” runs, that’s fine; otherwise, consider gating both template + validation (or at least the validation) behind the same condition (or a dedicated flag likefederation_generate_config_only).orchestrator/k8s/services/federation/servicemonitor.yaml (1)
1-17: Ensure Prometheus can actually “see” this ServiceMonitor (selector labels + port name).
Two common gotchas to double-check:
- Your Prometheus’
serviceMonitorSelectormatchesmetadata.labelshere (often requires a stack-specific label likerelease: ...).- The Service exposing federation metrics has a port named
metricsand hasapp: basilica-federationlabel (so the selector matches).apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: basilica-federation namespace: basilica-federation labels: app: basilica-federation + # Add any label required by your Prometheus Operator serviceMonitorSelector, e.g.: + # release: kube-prometheus-stack spec: selector: matchLabels: app: basilica-federation + # Consider making the namespace selection explicit if your setup expects it: + # namespaceSelector: + # matchNames: ["basilica-federation"] endpoints: - port: metrics path: /metrics interval: 30s scrapeTimeout: 10scrates/basilica-federation/src/events.rs (1)
112-136:EventManager::start()snapshots handlers; laterregister_handler()calls won’t take effect.
If dynamic handler registration is expected,handlerslikely needs shared mutability (e.g.,Arc<RwLock<Vec<_>>>) orstart()should be called only after final registration and documented accordingly.crates/basilica-federation/examples/basic_federation.rs (2)
16-42: Consider documenting the placeholder paths.The example uses placeholder kubeconfig paths (
/path/to/kubeconfig1,/path/to/kubeconfig2) that won't work as-is. Consider adding a comment at the top of the cluster configuration section explaining that users should replace these with actual paths.Example comment:
+ // NOTE: Replace the kubeconfig paths below with actual paths to your cluster configurations config.clusters.push(basilica_federation::config::ClusterConfig {
47-49: Clarify the blocking nature ofstart().If
federation.start().await?is a long-running blocking operation (e.g., starting an HTTP server), consider adding a comment to make this clear. Users might expect the program to continue or do additional work after this call.Example:
- // Start federation + // Start federation (this will block until shutdown) println!("Starting federation..."); federation.start().await?;crates/basilica-federation/examples/load_balancing.rs (1)
24-29: Handle the None case when no cluster is selected.When
select_cluster()returnsNone(which will happen for all iterations in this example since clusters aren't configured), the loop silently continues without output. This could confuse users running the example.Consider handling the None case:
// Select cluster for i in 0..10 { if let Some(cluster) = load_balancer.select_cluster().await { println!("Request {} routed to cluster: {}", i + 1, cluster.id); + } else { + println!("Request {} could not be routed (no available clusters)", i + 1); } }crates/basilica-federation/tests/discovery_test.rs (1)
1-54: Good edge-case coverage; consider adding positive test cases.The tests appropriately validate behavior with empty clusters and basic struct creation. However, all tests currently exercise edge cases (no clusters configured). Consider adding integration tests that verify service discovery with actual cluster configurations to ensure the happy path is also covered.
crates/basilica-federation/tests/health_test.rs (1)
14-43: Simplify the health status enum test.The test uses verbose pattern matching with
assert!(true)andassert!(false), which is not idiomatic Rust. The test could be simplified significantly.Consider this simpler approach:
#[tokio::test] async fn test_health_status_enum() { use basilica_federation::health::HealthStatus; - let healthy = HealthStatus::Healthy; - let degraded = HealthStatus::Degraded; - let unhealthy = HealthStatus::Unhealthy; - let unknown = HealthStatus::Unknown; - - // Test that all variants exist - match healthy { - HealthStatus::Healthy => assert!(true), - _ => assert!(false), - } - - match degraded { - HealthStatus::Degraded => assert!(true), - _ => assert!(false), - } - - match unhealthy { - HealthStatus::Unhealthy => assert!(true), - _ => assert!(false), - } - - match unknown { - HealthStatus::Unknown => assert!(true), - _ => assert!(false), - } + // Test that all variants can be constructed + assert!(matches!(HealthStatus::Healthy, HealthStatus::Healthy)); + assert!(matches!(HealthStatus::Degraded, HealthStatus::Degraded)); + assert!(matches!(HealthStatus::Unhealthy, HealthStatus::Unhealthy)); + assert!(matches!(HealthStatus::Unknown, HealthStatus::Unknown)); }crates/basilica-federation/tests/events_test.rs (3)
5-12: Remove or improve the trivial assertion.The
assert!(true)on line 11 provides no value - it always passes. If the test is only verifying compilation, consider either removing it or adding a more meaningful assertion (e.g., checking that the receiver is not lagged).Consider:
#[test] fn test_event_publisher_creation() { let publisher = EventPublisher::new(); - let receiver = publisher.subscribe(); + let _receiver = publisher.subscribe(); - // Should be able to create publisher and subscriber - assert!(true); + // Test passes if publisher and subscriber can be created without panicking }
33-41: Simplify event matching and error handling.The nested match and
assert!(false, "...")pattern is not idiomatic. Consider usingpanic!orassert_eq!for clearer intent.Consider:
if let Ok(Ok(received_event)) = received { match (received_event, event) { (FederationEvent::ClusterHealthChanged { cluster_id: id1, .. }, FederationEvent::ClusterHealthChanged { cluster_id: id2, .. }) => { assert_eq!(id1, id2); } - _ => assert!(false, "Event types don't match"), + _ => panic!("Event types don't match"), } + } else { + panic!("Failed to receive event within timeout"); }
72-97: Remove or improve trivial assertions in manager tests.Similar to the publisher creation test, these tests contain
assert!(true)statements that provide no value. If the tests are only validating compilation, consider removing the assertions or adding meaningful checks.For test_event_manager_creation:
#[test] fn test_event_manager_creation() { let manager = EventManager::new(); - let publisher = manager.publisher(); + let _publisher = manager.publisher(); - // Should be able to get publisher - assert!(true); + // Test passes if manager and publisher accessor work without panicking }For test_event_manager_handler_registration:
let mut manager = EventManager::new(); manager.register_handler(std::sync::Arc::new(TestHandler)); - assert!(true); + // Test passes if handler registration works without panicking }crates/basilica-federation/src/cache.rs (1)
8-64: Consider documenting or enforcing default capacity.The
build()method creates an unbounded cache ifmax_capacityis not set. While moka handles this gracefully, unbounded caches can lead to memory issues in long-running services. Consider either:
- Documenting this behavior in the
build()method- Setting a sensible default capacity
/// Build the cache + /// + /// Note: If max_capacity is not set, the cache will be unbounded. pub fn build(self) -> Cache<K, V> {crates/basilica-federation/src/api/mod.rs (2)
16-16: Remove unused import.
create_kube_clientis imported but not used in this file. The Kubernetes client creation is handled within the individual component constructors.-use crate::utils::create_kube_client;
74-84: Fire-and-forget spawns lose task handles.The background tasks are spawned without retaining their
JoinHandles. If any task panics, the failure will go unnoticed until downstream effects manifest. Consider:
- Storing handles in the struct for graceful shutdown
- Using
tokio::spawnwith panic handling orcatch_unwind- Adding a shutdown mechanism
+use tokio::task::JoinHandle; + pub struct FederationApi { config: Arc<FederationConfig>, gateway: Arc<FederationGateway>, discovery: Arc<ServiceDiscovery>, health: Arc<HealthAggregator>, load_balancer: Arc<LoadBalancer>, resource_manager: Arc<ResourceManager>, + task_handles: Vec<JoinHandle<()>>, }Then store the handles from
tokio::spawnand join them on shutdown.crates/basilica-federation/src/retry.rs (1)
50-99: Consider using saturating arithmetic for backoff calculation.The conversion chain
delay.as_millis() as f64 * multiplier as u64can lose precision for large delays. Whilemax_delaycaps the result, using saturating operations is more defensive.- delay = Duration::from_millis( - (delay.as_millis() as f64 * self.config.backoff_multiplier) as u64 - ).min(self.config.max_delay); + let next_millis = delay + .as_millis() + .saturating_mul(self.config.backoff_multiplier as u128) + .min(self.config.max_delay.as_millis()); + delay = Duration::from_millis(next_millis as u64);Alternatively, keep the f64 approach but ensure clamping before the cast.
crates/basilica-federation/src/metrics.rs (1)
190-194: Default implementation panics on failure.
expect()inDefault::default()will panic if metric creation fails. For a metrics subsystem, this is aggressive—consider returning a fallback no-op implementation or making initialization fallible at the call site.crates/basilica-federation/src/api/handler.rs (1)
102-127: Consider extracting the service-to-JSON mapping to reduce duplication.The mapping closure is duplicated in both branches. Extract it for DRY.
+ let to_json = |s: crate::discovery::ServiceInfo| serde_json::json!({ + "name": s.name, + "namespace": s.namespace, + "cluster_id": s.cluster_id, + "endpoints": s.endpoints, + "labels": s.labels, + }); + let filtered_services: Vec<_> = if let Some(filter_labels) = labels { services.into_iter() .filter(|s| { filter_labels.iter().all(|(k, v)| { s.labels.get(k).map(|sv| sv == v).unwrap_or(false) }) }) - .map(|s| serde_json::json!({ - "name": s.name, - "namespace": s.namespace, - "cluster_id": s.cluster_id, - "endpoints": s.endpoints, - "labels": s.labels, - })) + .map(to_json) .collect() } else { - services.into_iter() - .map(|s| serde_json::json!({ - "name": s.name, - "namespace": s.namespace, - "cluster_id": s.cluster_id, - "endpoints": s.endpoints, - "labels": s.labels, - })) - .collect() + services.into_iter().map(to_json).collect() };crates/basilica-federation/src/api/router.rs (2)
87-96: Consider storingFederationHandlerinGatewayStateto avoid repeated instantiation.
FederationHandler::new()is called in nearly every handler, creating a new instance per request. Since all constructor arguments areArc-wrapped and come fromGatewayState, this is cheap but unnecessary. Storing a pre-constructedFederationHandlerinGatewayStatewould be cleaner and more efficient.Example refactor for
GatewayState:pub struct GatewayState { // ... existing fields ... pub handler: Arc<FederationHandler>, }Then handlers simplify to:
async fn statistics(State(state): State<GatewayState>) -> Result<Json<serde_json::Value>, StatusCode> { - let handler = FederationHandler::new( - state.config.clone(), - state.discovery.clone(), - state.health.clone(), - state.load_balancer.clone(), - state.resource_manager.clone(), - ); - handler.handle_statistics().await.map_err(|_| StatusCode::INTERNAL_SERVER_ERROR) + state.handler.handle_statistics().await.map_err(|_| StatusCode::INTERNAL_SERVER_ERROR) }
166-177: Error handling conflates different failure modes.Mapping all errors to
NOT_FOUND(line 175) loses distinction between "service doesn't exist" and "internal error contacting the discovery service." Consider differentiating error types to return appropriate status codes.async fn get_service( State(state): State<GatewayState>, Path(service_name): Path<String>, Query(params): Query<HashMap<String, String>>, ) -> Result<Json<serde_json::Value>, StatusCode> { let namespace = params.get("namespace"); match state.discovery.get_service(&service_name, namespace).await { Ok(service) => Ok(Json(service)), - Err(_) => Err(StatusCode::NOT_FOUND), + Err(e) if e.is_not_found() => Err(StatusCode::NOT_FOUND), + Err(_) => Err(StatusCode::INTERNAL_SERVER_ERROR), } }This depends on whether the error type from
discovery.get_service()exposes error categorization.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (25)
COMMIT_MESSAGE.txt(1 hunks)crates/basilica-federation/docs/ARCHITECTURE.md(1 hunks)crates/basilica-federation/docs/DEPLOYMENT.md(1 hunks)crates/basilica-federation/docs/TROUBLESHOOTING.md(1 hunks)crates/basilica-federation/examples/basic_federation.rs(1 hunks)crates/basilica-federation/examples/load_balancing.rs(1 hunks)crates/basilica-federation/examples/service_discovery.rs(1 hunks)crates/basilica-federation/src/api/handler.rs(1 hunks)crates/basilica-federation/src/api/mod.rs(1 hunks)crates/basilica-federation/src/api/router.rs(1 hunks)crates/basilica-federation/src/cache.rs(1 hunks)crates/basilica-federation/src/events.rs(1 hunks)crates/basilica-federation/src/lib.rs(1 hunks)crates/basilica-federation/src/metrics.rs(1 hunks)crates/basilica-federation/src/retry.rs(1 hunks)crates/basilica-federation/tests/discovery_test.rs(1 hunks)crates/basilica-federation/tests/error_test.rs(1 hunks)crates/basilica-federation/tests/events_test.rs(1 hunks)crates/basilica-federation/tests/health_test.rs(1 hunks)crates/basilica-federation/tests/integration_test.rs(1 hunks)crates/basilica-federation/tests/load_balancer_test.rs(1 hunks)crates/basilica-federation/tests/metrics_test.rs(1 hunks)orchestrator/ansible/roles/federation/tasks/main.yml(1 hunks)orchestrator/k8s/services/federation/prometheusrule.yaml(1 hunks)orchestrator/k8s/services/federation/servicemonitor.yaml(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- crates/basilica-federation/docs/TROUBLESHOOTING.md
- crates/basilica-federation/docs/DEPLOYMENT.md
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/basilica-federation/src/lib.rs
🧰 Additional context used
🧬 Code graph analysis (12)
crates/basilica-federation/examples/load_balancing.rs (2)
crates/basilica-federation/src/api/mod.rs (2)
load_balancer(108-110)new(31-60)crates/basilica-federation/examples/basic_federation.rs (1)
main(7-52)
crates/basilica-federation/tests/discovery_test.rs (2)
crates/basilica-federation/src/api/mod.rs (2)
discovery(98-100)new(31-60)crates/basilica-federation/src/api/handler.rs (1)
new(29-43)
crates/basilica-federation/tests/events_test.rs (1)
crates/basilica-federation/src/events.rs (4)
publisher(118-120)new(65-68)new(105-110)handle(94-94)
crates/basilica-federation/examples/service_discovery.rs (1)
crates/basilica-federation/src/api/mod.rs (2)
discovery(98-100)new(31-60)
crates/basilica-federation/src/api/mod.rs (9)
crates/basilica-federation/src/api/router.rs (1)
health(57-65)crates/basilica-federation/src/discovery/mod.rs (3)
create_kube_client(65-89)new(33-62)start(92-108)crates/basilica-federation/src/utils.rs (1)
create_kube_client(7-31)crates/basilica-federation/src/api/handler.rs (1)
new(29-43)crates/basilica-federation/src/metrics.rs (1)
new(42-179)crates/basilica-api/src/federation.rs (1)
new(15-22)crates/basilica-federation/src/health/mod.rs (2)
new(56-85)start(88-145)crates/basilica-federation/src/api/gateway.rs (2)
new(37-59)start(96-111)crates/basilica-federation/src/resource_manager/mod.rs (2)
new(19-43)start(46-63)
crates/basilica-federation/examples/basic_federation.rs (1)
crates/basilica-federation/src/api/mod.rs (1)
new(31-60)
crates/basilica-federation/tests/load_balancer_test.rs (1)
crates/basilica-federation/src/api/mod.rs (2)
load_balancer(108-110)new(31-60)
crates/basilica-federation/tests/health_test.rs (3)
crates/basilica-federation/src/api/mod.rs (2)
health(103-105)new(31-60)crates/basilica-federation/src/api/router.rs (1)
health(57-65)crates/basilica-federation/src/api/handler.rs (1)
new(29-43)
crates/basilica-federation/src/events.rs (1)
crates/basilica-federation/tests/events_test.rs (1)
handle(88-90)
crates/basilica-federation/tests/integration_test.rs (1)
crates/basilica-federation/src/config.rs (1)
enabled_clusters(277-279)
crates/basilica-federation/src/api/router.rs (4)
crates/basilica-federation/src/api/mod.rs (3)
gateway(93-95)new(31-60)health(103-105)crates/basilica-federation/src/api/handler.rs (1)
new(29-43)crates/basilica-federation/src/metrics.rs (1)
new(42-179)crates/basilica-federation/src/config.rs (1)
enabled_clusters(277-279)
crates/basilica-federation/src/api/handler.rs (2)
crates/basilica-federation/src/api/mod.rs (5)
discovery(98-100)health(103-105)load_balancer(108-110)resource_manager(113-115)new(31-60)crates/basilica-federation/src/api/router.rs (1)
health(57-65)
🪛 Gitleaks (8.30.0)
orchestrator/ansible/roles/federation/tasks/main.yml
[high] 40-49: Possible Kubernetes Secret detected, posing a risk of leaking credentials/tokens from your deployments
(kubernetes-secret-yaml)
🪛 LanguageTool
crates/basilica-federation/docs/ARCHITECTURE.md
[uncategorized] ~91-~91: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ... - Secure inter-cluster communication - Rate limiting - Request validation ## Monitoring Pr...
(EN_COMPOUND_ADJECTIVE_INTERNAL)
🪛 markdownlint-cli2 (0.18.1)
crates/basilica-federation/docs/ARCHITECTURE.md
59-59: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (17)
COMMIT_MESSAGE.txt (2)
1-1: Subject line is clear and conventional-commit friendly.
Reads well asfeat(federation): ....
10-10: Avoid hard-coded quantitative claims in commit messages unless they will be kept up-to-date.Hard-coded numbers like "(5 algorithms)" and "(~1,655 lines Rust)" become stale when code changes. Either remove the numbers entirely or replace them with concrete, verifiable details (e.g., list the algorithm names instead of a count, or reference the actual crate structure). If your commit message makes specific claims, ensure they remain accurate as the codebase evolves.
Also applies to: 14-14
crates/basilica-federation/tests/integration_test.rs (1)
1-138: Excellent configuration test coverage!The integration tests provide comprehensive coverage of the
FederationConfigsurface, including cluster management, enabled cluster filtering, and all subsystem defaults. The tests are well-structured and verify both positive cases and edge cases (like nonexistent cluster retrieval).crates/basilica-federation/tests/health_test.rs (1)
6-12: Good test coverage for health aggregation.The tests appropriately validate HealthAggregator creation, NodeHealth structure, and edge cases (empty clusters, nonexistent clusters). The test structure is solid and uses
tokio::testappropriately for async operations.Also applies to: 45-76
crates/basilica-federation/tests/load_balancer_test.rs (1)
1-74: Comprehensive load balancer test coverage!The tests validate all five load balancing algorithms mentioned in the PR objectives (RoundRobin, LeastConnections, WeightedRoundRobin, Random, Geographic) and verify behavior with health-aware and region-aware configurations. The test structure is solid and appropriate.
crates/basilica-federation/tests/events_test.rs (1)
44-70: Good validation of event variants.The test properly validates that all three event types can be created and matched. The use of
matches!macro is appropriate and idiomatic.crates/basilica-federation/src/cache.rs (2)
1-6: LGTM!Module documentation and imports are appropriate. The use of
moka::future::Cacheis a good choice for async-compatible caching.
66-96: LGTM!The
Defaultimplementation and helper constructors are well-designed. The capacity choices (10,000 for services, 100 for health) are reasonable defaults for their respective use cases.crates/basilica-federation/src/api/mod.rs (1)
92-115: LGTM!Accessor methods are straightforward and provide read-only access to the Arc-wrapped components.
crates/basilica-federation/src/retry.rs (3)
7-36: LGTM!
RetryConfighas sensible defaults: 3 attempts, 100ms initial delay, 5s max delay, 2x multiplier. The struct is well-documented.
101-113: LGTM!The
is_retryableimplementation correctly categorizes transient errors as retryable and uses a wildcard for unknown variants, ensuring forward compatibility.
116-124: LGTM!The convenience
retryfunction provides a clean API for one-off retry operations.crates/basilica-federation/src/api/handler.rs (3)
27-43: LGTM!Constructor properly initializes the handler with all required dependencies.
46-65: LGTM!The cluster listing implementation is straightforward and returns appropriate metadata.
68-92: LGTM!Good error handling with appropriate status codes. The fallback for missing health data is reasonable.
crates/basilica-federation/src/api/router.rs (2)
15-54: Router structure looks well-organized.The routing setup follows standard Axum patterns with clear grouping of endpoints by domain (health, clusters, services, resources, load balancer, metrics). The use of a unit struct with associated functions is a clean approach for organizing the handlers.
Note: The
Arcimport on line 13 appears unused sinceGatewayStatefields are alreadyArc-wrapped. Consider removing it if not needed elsewhere.
201-242: Resource endpoints follow consistent patterns.The resource listing and pod retrieval handlers are consistent with the established patterns. The same error handling note from the service endpoints applies here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (3)
crates/basilica-federation/src/api/gateway.rs (3)
161-162: Consider logging health fetch failures.
unwrap_or_default()silently discards errors when fetching cluster health. While returning a default may be appropriate for graceful degradation, consider logging the error to aid debugging and monitoring.Apply this diff to log the error:
let health = state.health.get_cluster_health(&cluster_id).await - .unwrap_or_default(); + .unwrap_or_else(|e| { + warn!(cluster_id = %cluster_id, error = %e, "Failed to fetch cluster health"); + Default::default() + });
302-318: Consider using reqwest's built-in query parameter handling.The current implementation manually encodes and concatenates query parameters to the URL string. While functional, this approach is more error-prone and verbose than using reqwest's built-in
.query()method, which handles encoding automatically.Apply this diff to use reqwest's query handling:
- // Build target URL with query parameters - let mut target_url = format!("{}/{}", cluster.api_server, path); - if !params.is_empty() { - let query_string: Vec<String> = params - .iter() - .map(|(k, v)| format!("{}={}", urlencoding::encode(k), urlencoding::encode(v))) - .collect(); - target_url.push('?'); - target_url.push_str(&query_string.join("&")); - } - - // Parse as reqwest::Url (not http::Uri) - let url: reqwest::Url = target_url.parse() + // Build target URL + let target_url = format!("{}/{}", cluster.api_server, path); + let url: reqwest::Url = target_url.parse() .map_err(|e| { error!(error = %e, "Invalid target URL"); StatusCode::BAD_REQUEST })?;Then pass
paramsto the request builder at line 333:- let mut request_builder = state.http_client.request(reqwest_method, url); + let mut request_builder = state.http_client.request(reqwest_method, url) + .query(¶ms);
416-419: Implement actual metrics collection.The metrics handler currently returns a placeholder string with no actual metrics. For production use, this endpoint should expose Prometheus-formatted metrics such as request counts, latencies, error rates, cluster health status, and proxy success/failure rates.
Consider integrating a metrics library like
prometheusormetricsto track:
- HTTP request counts by method, path, and status code
- Request duration histograms
- Proxy request success/failure rates by target cluster
- Cluster health check results
- Active connections
Do you want me to generate a more complete metrics implementation or open an issue to track this?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
crates/basilica-federation/Cargo.toml(1 hunks)crates/basilica-federation/src/api/gateway.rs(1 hunks)crates/basilica-federation/src/config.rs(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- crates/basilica-federation/src/config.rs
🔇 Additional comments (1)
crates/basilica-federation/Cargo.toml (1)
13-46: No action needed. The workspace dependencies are correctly pinned and compatible with thehttp = "1.1"version used in this crate.
|
@epappas Thanks for the feedback - you're right that the current implementation doesn't align with the platform's architecture (Envoy/WireGuard/Operator patterns). I've fixed the security issues, but I need clarification on what federation should actually do and how it should integrate with the existing infrastructure before redesigning it properly. |
You started federation by your own initiative and you're asking me to what it should be done for it? |
|
closing since this is an AI gen work, and the author didn't provide any additional reasoning behind the change or the intended plan. |
Adds multi-cluster federation system for managing multiple K3s clusters with geographic distribution and high availability.
Features
Implementation
basilica-federationcrateAPI Endpoints
GET /health- Federation healthGET /clusters- List clustersGET /services- List services across clustersGET /proxy/*path- Proxy requestsGET /metrics- Prometheus metricsContribution by Gittensor, learn more at https://gittensor.io/
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.