-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Rental Orphan Garbage Collection System Architecture
Version: 1.0
Date: 2025-10-29
Status: Design Document
Author: System Architecture Team
Classification: CRITICAL - Financial System Component
Executive Summary
This document specifies the architecture for an orphan rental garbage collection system within the Basilica validator. This is a financial-critical feature that directly affects miner payouts and billing accuracy. The system must operate with extreme reliability, conservative decision-making, and comprehensive audit logging.
Key Principles:
- Safety First: Never delete rentals with active billing or recent telemetry
- Conservative: When in doubt, preserve the rental record
- Auditable: Log all decisions and actions comprehensively
- Non-disruptive: Operate independently without affecting active rentals
- Simple: Follow KISS, DRY, and SOLID principles
1. Problem Statement
1.1 Current Issues
The validator rental system can create orphaned rentals in several scenarios:
-
Database Orphans: Rental records in database without corresponding containers
- Container deployment failures that don't update state
- Manual container deletion bypassing rental manager
- SSH connection failures preventing container verification
-
Container Orphans: Running containers without rental records
- Database corruption or data loss
- Manual rental record deletion
- Race conditions during deployment
-
Stuck State Rentals: Rentals in non-terminal states indefinitely
- Rentals stuck in
Provisioningafter deployment failures - Rentals stuck in
Stoppingafter termination failures - Rentals in
Activewith dead containers
- Rentals stuck in
-
Stale References: Rentals referencing deleted miners or nodes
- Miner deregistration without rental cleanup
- Node removal without rental termination
1.2 Financial Impact
Critical Considerations:
- Orphaned active rentals may generate incorrect billing telemetry
- Stuck rentals block node availability for legitimate rentals
- Missing container cleanup wastes miner resources
- Incorrect state transitions affect miner scoring and rewards
- Billing service requires accurate rental lifecycle events
1.3 Success Criteria
- Accuracy: 100% correct identification of orphan rentals (zero false positives)
- Billing Safety: Never interfere with rentals having billing activity in last 48 hours
- Performance: Minimal impact on validator operations (<1% CPU/memory)
- Auditability: Complete log trail for all cleanup actions
- Reliability: Graceful handling of SSH failures and network issues
2. System Architecture
2.1 Component Overview
┌─────────────────────────────────────────────────────────────┐
│ RentalManager │
│ ┌────────────────┐ ┌────────────────┐ ┌───────────────┐ │
│ │ Health Monitor │ │ Billing Monitor│ │ GC Monitor │ │
│ │ (30s loop) │ │ (60s loop) │ │ (3600s loop) │ │
│ └────────────────┘ └────────────────┘ └───────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
┌───────▼────────┐ ┌────────▼─────────┐ ┌───────▼──────┐
│ Persistence │ │ ContainerClient │ │ Metrics │
│ (SQLite) │ │ (SSH + Docker) │ │ (Prometheus) │
└────────────────┘ └──────────────────┘ └──────────────┘
2.2 New Component: RentalGarbageCollector
File: crates/basilica-validator/src/rental/garbage_collector.rs
Responsibilities:
- Periodically scan for orphaned rental records
- Verify container states against database records
- Safely transition rentals to terminal states
- Clean up container resources when appropriate
- Record metrics and audit logs
- Respect billing integrity constraints
Non-Responsibilities:
- Does NOT handle billing finalization (handled by billing service)
- Does NOT make payment decisions
- Does NOT interfere with health monitoring
- Does NOT handle active rental operations
2.3 Architecture Pattern
Follows established monitoring pattern:
pub struct RentalGarbageCollector {
persistence: Arc<SimplePersistence>,
ssh_key_manager: Arc<ValidatorSshKeyManager>,
metrics: Arc<ValidatorPrometheusMetrics>,
config: GarbageCollectorConfig,
cancellation_token: CancellationToken,
}Key Design Decisions:
- Independent Loop: Separate from health and billing monitors
- Read-Heavy: Primarily reads database and container state
- Conservative Updates: Only updates obviously orphaned rentals
- Idempotent: Can run multiple times without adverse effects
- Fail-Safe: Errors in processing one rental don't affect others
3. Orphan Rental Criteria
3.1 Classification System
Orphan rentals are classified into three risk categories:
Category A: Safe to Clean (Low Risk)
Rentals meeting ALL criteria:
- State is
FailedorStopped(terminal states) - Age > 7 days since last update
- No billing telemetry in last 48 hours
- Container verified as non-existent or stopped
Action: Delete from database after audit logging
Category B: Needs Termination (Medium Risk)
Rentals meeting criteria:
- State is
ProvisioningAND age > 30 minutes - State is
StoppingAND age > 15 minutes - State is
ActiveAND container verified as not running - No billing telemetry in last 2 hours
Action: Transition to Failed state, attempt container cleanup
Category C: Needs Investigation (High Risk)
Rentals meeting criteria:
- References non-existent miner_id or node_id
- Missing required fields (container_id, ssh_credentials)
- Database record corruption
Action: Log warning, transition to Failed state, manual review required
Category X: Never Touch (Protected)
Rentals with ANY of these characteristics:
- Billing telemetry in last 48 hours
- Age < 5 minutes (grace period for deployment)
- State is
Activeand container is running - Currently being processed by health monitor
Action: Skip entirely, preserve existing state
3.2 Orphan Detection Algorithm
async fn classify_rental(&self, rental: &RentalInfo) -> OrphanCategory {
// CRITICAL: Never touch recent rentals
if rental.age() < Duration::from_secs(300) {
return OrphanCategory::Protected;
}
// CRITICAL: Check billing telemetry
if self.has_recent_billing_activity(&rental.rental_id).await? {
return OrphanCategory::Protected;
}
// Verify container state
let container_state = self.verify_container_state(rental).await;
match (rental.state.clone(), container_state) {
// Terminal states old enough to delete
(RentalState::Failed | RentalState::Stopped, ContainerState::NotFound)
if rental.age() > Duration::from_days(7) => {
OrphanCategory::SafeToClean
}
// Stuck in provisioning
(RentalState::Provisioning, _)
if rental.age() > Duration::from_secs(1800) => {
OrphanCategory::NeedsTermination
}
// Stuck in stopping
(RentalState::Stopping, _)
if rental.age() > Duration::from_secs(900) => {
OrphanCategory::NeedsTermination
}
// Active but container dead
(RentalState::Active, ContainerState::NotRunning | ContainerState::NotFound) => {
OrphanCategory::NeedsTermination
}
// Data integrity issues
_ if rental.has_data_integrity_issues() => {
OrphanCategory::NeedsInvestigation
}
// Everything else is protected
_ => OrphanCategory::Protected
}
}3.3 Container State Verification
enum ContainerState {
Running, // Container exists and is running
NotRunning, // Container exists but stopped
NotFound, // Container does not exist
Unknown, // Cannot determine (SSH failure, etc.)
}
async fn verify_container_state(&self, rental: &RentalInfo) -> ContainerState {
// Get validator's SSH key
let validator_key_path = match self.ssh_key_manager.get_persistent_key() {
Some((_, path)) => path,
None => return ContainerState::Unknown,
};
// Create container client
let client = match ContainerClient::new(
rental.ssh_credentials.clone(),
Some(validator_key_path),
) {
Ok(c) => c,
Err(e) => {
warn!("Failed to create container client: {}", e);
return ContainerState::Unknown;
}
};
// Check container status with timeout
match tokio::time::timeout(
Duration::from_secs(10),
client.get_container_status(&rental.container_id)
).await {
Ok(Ok(status)) => {
if status.state == "running" {
ContainerState::Running
} else {
ContainerState::NotRunning
}
}
Ok(Err(_)) => ContainerState::NotFound,
Err(_) => ContainerState::Unknown,
}
}4. Billing Safety System
4.1 Billing Activity Detection
Critical Requirement: Never clean rentals with recent billing activity.
async fn has_recent_billing_activity(&self, rental_id: &str) -> Result<bool> {
// Check if billing telemetry was collected recently
// This requires adding last_telemetry_at field to RentalInfo
// or querying billing service directly
let rental = self.persistence.load_rental(rental_id).await?;
if let Some(last_telemetry) = rental.last_telemetry_at {
let age = Utc::now() - last_telemetry;
if age < Duration::from_hours(48) {
return Ok(true);
}
}
Ok(false)
}4.2 Safe State Transitions
All state transitions must preserve billing integrity:
async fn transition_to_failed(&self, rental: &RentalInfo) -> Result<()> {
// CRITICAL: Only transition if no recent billing
if self.has_recent_billing_activity(&rental.rental_id).await? {
warn!(
"Refusing to transition rental {} with recent billing activity",
rental.rental_id
);
return Ok(());
}
let mut updated = rental.clone();
updated.state = RentalState::Failed;
updated.updated_at = Some(Utc::now());
updated.terminated_at = Some(Utc::now());
updated.termination_reason = Some("Garbage collected: orphaned rental".to_string());
// Audit log BEFORE state change
info!(
"GC: Transitioning rental {} from {:?} to Failed (age: {:?}, reason: orphaned)",
rental.rental_id,
rental.state,
rental.age()
);
self.persistence.save_rental(&updated).await?;
// Clear metrics
self.clear_rental_metrics(rental);
Ok(())
}4.3 Container Cleanup Protocol
async fn cleanup_container(&self, rental: &RentalInfo) -> Result<()> {
let validator_key_path = self.ssh_key_manager
.get_persistent_key()
.ok_or_else(|| anyhow!("No SSH key"))?
.1;
let client = ContainerClient::new(
rental.ssh_credentials.clone(),
Some(validator_key_path),
)?;
// Try graceful stop first
match tokio::time::timeout(
Duration::from_secs(30),
client.stop_container(&rental.container_id, false)
).await {
Ok(Ok(_)) => {
info!("GC: Gracefully stopped container {}", rental.container_id);
}
Ok(Err(e)) => {
warn!("GC: Failed to stop container gracefully: {}", e);
}
Err(_) => {
warn!("GC: Timeout stopping container, attempting force stop");
// Force stop on timeout
let _ = client.stop_container(&rental.container_id, true).await;
}
}
// Remove container
match client.remove_container(&rental.container_id).await {
Ok(_) => {
info!("GC: Removed container {}", rental.container_id);
}
Err(e) => {
warn!("GC: Failed to remove container: {}", e);
}
}
Ok(())
}5. Configuration Specification
5.1 Configuration Structure
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct GarbageCollectorConfig {
/// Enable garbage collection
#[serde(default = "default_enabled")]
pub enabled: bool,
/// Scan interval in seconds (default: 1 hour)
#[serde(default = "default_scan_interval_secs")]
pub scan_interval_secs: u64,
/// Minimum age before considering rental for cleanup (seconds)
#[serde(default = "default_min_age_secs")]
pub min_age_secs: u64,
/// Age threshold for stuck provisioning rentals (seconds)
#[serde(default = "default_provisioning_timeout_secs")]
pub provisioning_timeout_secs: u64,
/// Age threshold for stuck stopping rentals (seconds)
#[serde(default = "default_stopping_timeout_secs")]
pub stopping_timeout_secs: u64,
/// Age threshold for deleting terminal state rentals (days)
#[serde(default = "default_terminal_retention_days")]
pub terminal_retention_days: u32,
/// Billing safety window - never touch rentals with telemetry in this window (hours)
#[serde(default = "default_billing_safety_window_hours")]
pub billing_safety_window_hours: u32,
/// Container verification timeout (seconds)
#[serde(default = "default_container_check_timeout_secs")]
pub container_check_timeout_secs: u64,
/// Maximum rentals to process per scan
#[serde(default = "default_max_batch_size")]
pub max_batch_size: usize,
/// Enable dry-run mode (log actions without executing)
#[serde(default = "default_dry_run")]
pub dry_run: bool,
}
// Default values
fn default_enabled() -> bool { true }
fn default_scan_interval_secs() -> u64 { 3600 } // 1 hour
fn default_min_age_secs() -> u64 { 300 } // 5 minutes grace period
fn default_provisioning_timeout_secs() -> u64 { 1800 } // 30 minutes
fn default_stopping_timeout_secs() -> u64 { 900 } // 15 minutes
fn default_terminal_retention_days() -> u32 { 7 } // 1 week
fn default_billing_safety_window_hours() -> u32 { 48 } // 2 days
fn default_container_check_timeout_secs() -> u64 { 10 }
fn default_max_batch_size() -> usize { 100 }
fn default_dry_run() -> bool { false }5.2 TOML Configuration Example
[garbage_collector]
enabled = true
scan_interval_secs = 3600
min_age_secs = 300
provisioning_timeout_secs = 1800
stopping_timeout_secs = 900
terminal_retention_days = 7
billing_safety_window_hours = 48
container_check_timeout_secs = 10
max_batch_size = 100
dry_run = false5.3 Configuration Validation
impl GarbageCollectorConfig {
pub fn validate(&self) -> Result<(), ConfigurationError> {
if self.scan_interval_secs == 0 {
return Err(ConfigurationError::InvalidValue {
key: "garbage_collector.scan_interval_secs".to_string(),
value: "0".to_string(),
reason: "Scan interval must be greater than 0".to_string(),
});
}
if self.min_age_secs < 60 {
return Err(ConfigurationError::InvalidValue {
key: "garbage_collector.min_age_secs".to_string(),
value: self.min_age_secs.to_string(),
reason: "Minimum age must be at least 60 seconds".to_string(),
});
}
if self.billing_safety_window_hours < 24 {
return Err(ConfigurationError::InvalidValue {
key: "garbage_collector.billing_safety_window_hours".to_string(),
value: self.billing_safety_window_hours.to_string(),
reason: "Billing safety window must be at least 24 hours".to_string(),
});
}
Ok(())
}
}6. Implementation Specification
6.1 Module Structure
crates/basilica-validator/src/rental/
├── mod.rs (RentalManager with GC integration)
├── garbage_collector.rs (NEW: Core GC implementation)
├── monitoring.rs (Existing: Health monitoring)
├── billing.rs (Existing: Billing telemetry)
├── types.rs (Updated: Add OrphanCategory, last_telemetry_at)
├── container_client.rs (Existing: Container operations)
└── deployment.rs (Existing: Container deployment)
6.2 Core Implementation
File: garbage_collector.rs
use anyhow::{Context, Result};
use chrono::{Duration, Utc};
use std::sync::Arc;
use tokio::time::interval;
use tokio_util::sync::CancellationToken;
use tracing::{debug, error, info, warn};
use super::container_client::ContainerClient;
use super::types::{RentalInfo, RentalState};
use crate::metrics::ValidatorPrometheusMetrics;
use crate::persistence::SimplePersistence;
use crate::ssh::ValidatorSshKeyManager;
/// Orphan rental categories
#[derive(Debug, Clone, PartialEq)]
pub enum OrphanCategory {
/// Safe to delete - terminal state, old, no billing activity
SafeToClean,
/// Needs state transition to Failed and container cleanup
NeedsTermination,
/// Data integrity issues requiring manual investigation
NeedsInvestigation,
/// Protected - never touch
Protected,
}
/// Container state as determined by verification
#[derive(Debug, Clone, PartialEq)]
pub enum ContainerState {
Running,
NotRunning,
NotFound,
Unknown,
}
/// Configuration for garbage collector
#[derive(Debug, Clone)]
pub struct GarbageCollectorConfig {
pub enabled: bool,
pub scan_interval_secs: u64,
pub min_age_secs: u64,
pub provisioning_timeout_secs: u64,
pub stopping_timeout_secs: u64,
pub terminal_retention_days: u32,
pub billing_safety_window_hours: u32,
pub container_check_timeout_secs: u64,
pub max_batch_size: usize,
pub dry_run: bool,
}
impl Default for GarbageCollectorConfig {
fn default() -> Self {
Self {
enabled: true,
scan_interval_secs: 3600,
min_age_secs: 300,
provisioning_timeout_secs: 1800,
stopping_timeout_secs: 900,
terminal_retention_days: 7,
billing_safety_window_hours: 48,
container_check_timeout_secs: 10,
max_batch_size: 100,
dry_run: false,
}
}
}
/// Rental garbage collector
#[derive(Clone)]
pub struct RentalGarbageCollector {
persistence: Arc<SimplePersistence>,
ssh_key_manager: Arc<ValidatorSshKeyManager>,
metrics: Arc<ValidatorPrometheusMetrics>,
config: GarbageCollectorConfig,
cancellation_token: CancellationToken,
}
impl RentalGarbageCollector {
/// Create new garbage collector
pub fn new(
persistence: Arc<SimplePersistence>,
ssh_key_manager: Arc<ValidatorSshKeyManager>,
metrics: Arc<ValidatorPrometheusMetrics>,
config: GarbageCollectorConfig,
) -> Self {
Self {
persistence,
ssh_key_manager,
metrics,
config,
cancellation_token: CancellationToken::new(),
}
}
/// Start garbage collection loop
pub fn start(&self) {
let collector = self.clone();
tokio::spawn(async move {
collector.collection_loop().await;
});
}
/// Stop garbage collection
pub fn stop(&self) {
self.cancellation_token.cancel();
}
/// Main collection loop
async fn collection_loop(&self) {
let mut scan_interval = interval(
std::time::Duration::from_secs(self.config.scan_interval_secs)
);
info!("Rental garbage collector started (dry_run: {})", self.config.dry_run);
loop {
tokio::select! {
_ = self.cancellation_token.cancelled() => {
info!("Rental garbage collector stopped");
break;
}
_ = scan_interval.tick() => {
if let Err(e) = self.perform_collection().await {
error!("Garbage collection error: {}", e);
}
}
}
}
}
/// Perform garbage collection scan
async fn perform_collection(&self) -> Result<()> {
info!("Starting garbage collection scan");
// Query all non-terminated rentals
let rentals = self.persistence
.query_non_terminated_rentals()
.await
.context("Failed to query rentals")?;
debug!("Scanning {} non-terminated rentals", rentals.len());
let mut stats = CollectionStats::default();
// Process rentals one by one
for rental in rentals.iter().take(self.config.max_batch_size) {
match self.process_rental(rental, &mut stats).await {
Ok(_) => {}
Err(e) => {
error!(
"Failed to process rental {}: {}",
rental.rental_id, e
);
stats.errors += 1;
}
}
}
info!(
"Garbage collection scan complete: {} scanned, {} cleaned, {} terminated, \
{} investigated, {} protected, {} errors",
stats.scanned,
stats.cleaned,
stats.terminated,
stats.investigated,
stats.protected,
stats.errors
);
Ok(())
}
/// Process a single rental
async fn process_rental(
&self,
rental: &RentalInfo,
stats: &mut CollectionStats,
) -> Result<()> {
stats.scanned += 1;
// Classify the rental
let category = self.classify_rental(rental).await?;
match category {
OrphanCategory::SafeToClean => {
stats.cleaned += 1;
self.handle_safe_to_clean(rental).await?;
}
OrphanCategory::NeedsTermination => {
stats.terminated += 1;
self.handle_needs_termination(rental).await?;
}
OrphanCategory::NeedsInvestigation => {
stats.investigated += 1;
self.handle_needs_investigation(rental).await?;
}
OrphanCategory::Protected => {
stats.protected += 1;
debug!("Rental {} is protected, skipping", rental.rental_id);
}
}
Ok(())
}
/// Classify a rental into orphan category
async fn classify_rental(&self, rental: &RentalInfo) -> Result<OrphanCategory> {
// CRITICAL: Never touch recent rentals
let age = rental.age();
if age < Duration::seconds(self.config.min_age_secs as i64) {
return Ok(OrphanCategory::Protected);
}
// CRITICAL: Check billing telemetry
if self.has_recent_billing_activity(rental).await? {
return Ok(OrphanCategory::Protected);
}
// Verify container state
let container_state = self.verify_container_state(rental).await;
// Classification logic
match (&rental.state, container_state, age) {
// Terminal states old enough to delete
(RentalState::Failed | RentalState::Stopped, ContainerState::NotFound, age)
if age > Duration::days(self.config.terminal_retention_days as i64) => {
Ok(OrphanCategory::SafeToClean)
}
// Stuck in provisioning
(RentalState::Provisioning, _, age)
if age > Duration::seconds(self.config.provisioning_timeout_secs as i64) => {
Ok(OrphanCategory::NeedsTermination)
}
// Stuck in stopping
(RentalState::Stopping, _, age)
if age > Duration::seconds(self.config.stopping_timeout_secs as i64) => {
Ok(OrphanCategory::NeedsTermination)
}
// Active but container dead
(RentalState::Active, ContainerState::NotRunning | ContainerState::NotFound, _) => {
Ok(OrphanCategory::NeedsTermination)
}
// Data integrity issues
_ if self.has_data_integrity_issues(rental) => {
Ok(OrphanCategory::NeedsInvestigation)
}
// Everything else is protected
_ => Ok(OrphanCategory::Protected),
}
}
/// Check if rental has recent billing activity
async fn has_recent_billing_activity(&self, rental: &RentalInfo) -> Result<bool> {
// Check last_telemetry_at field if available
if let Some(last_telemetry) = rental.last_telemetry_at {
let age = Utc::now() - last_telemetry;
let threshold = Duration::hours(self.config.billing_safety_window_hours as i64);
if age < threshold {
debug!(
"Rental {} has recent billing activity ({:?} ago)",
rental.rental_id, age
);
return Ok(true);
}
}
Ok(false)
}
/// Verify container state via SSH
async fn verify_container_state(&self, rental: &RentalInfo) -> ContainerState {
// Implementation details...
ContainerState::Unknown
}
/// Check for data integrity issues
fn has_data_integrity_issues(&self, rental: &RentalInfo) -> bool {
rental.container_id.is_empty()
|| rental.node_id.is_empty()
|| rental.ssh_credentials.host.is_empty()
}
/// Handle rentals safe to clean
async fn handle_safe_to_clean(&self, rental: &RentalInfo) -> Result<()> {
info!(
"GC: Cleaning rental {} (state: {:?}, age: {:?})",
rental.rental_id,
rental.state,
rental.age()
);
if self.config.dry_run {
info!("GC: [DRY RUN] Would delete rental {}", rental.rental_id);
return Ok(());
}
// Delete from database
self.persistence.delete_rental(&rental.rental_id).await?;
info!("GC: Deleted rental {}", rental.rental_id);
Ok(())
}
/// Handle rentals needing termination
async fn handle_needs_termination(&self, rental: &RentalInfo) -> Result<()> {
// Implementation details...
Ok(())
}
/// Handle rentals needing investigation
async fn handle_needs_investigation(&self, rental: &RentalInfo) -> Result<()> {
// Implementation details...
Ok(())
}
}
/// Collection statistics
#[derive(Debug, Default)]
struct CollectionStats {
scanned: usize,
cleaned: usize,
terminated: usize,
investigated: usize,
protected: usize,
errors: usize,
}6.3 Integration with RentalManager
File: rental/mod.rs (modifications)
pub struct RentalManager {
// ... existing fields
garbage_collector: Option<Arc<RentalGarbageCollector>>,
}
impl RentalManager {
pub async fn create(
config: &ValidatorConfig,
persistence: Arc<SimplePersistence>,
metrics: Arc<ValidatorPrometheusMetrics>,
) -> Result<Self> {
// ... existing initialization
// Create garbage collector if enabled
let garbage_collector = if config.garbage_collector.enabled {
let gc = RentalGarbageCollector::new(
persistence.clone(),
ssh_key_manager.clone(),
metrics.clone(),
config.garbage_collector.clone(),
);
Some(Arc::new(gc))
} else {
None
};
Ok(Self {
// ... existing fields
garbage_collector,
})
}
pub fn start(&self) {
// ... existing monitor starts
if let Some(gc) = &self.garbage_collector {
gc.start();
}
}
}
impl Drop for RentalManager {
fn drop(&mut self) {
// ... existing cleanup
if let Some(gc) = &self.garbage_collector {
gc.stop();
}
}
}6.4 Database Schema Updates
New Migration: migrations/XXX_add_rental_telemetry_tracking.sql
-- Add last_telemetry_at column to track billing activity
ALTER TABLE rentals ADD COLUMN last_telemetry_at TEXT;
-- Add indexes for efficient orphan queries
CREATE INDEX IF NOT EXISTS idx_rentals_state ON rentals(state);
CREATE INDEX IF NOT EXISTS idx_rentals_created_at ON rentals(created_at);
CREATE INDEX IF NOT EXISTS idx_rentals_updated_at ON rentals(updated_at);
CREATE INDEX IF NOT EXISTS idx_rentals_node_id ON rentals(node_id);
CREATE INDEX IF NOT EXISTS idx_rentals_miner_id ON rentals(miner_id);
-- Composite index for garbage collection queries
CREATE INDEX IF NOT EXISTS idx_rentals_gc_scan
ON rentals(state, created_at, last_telemetry_at)
WHERE state NOT IN ('stopped', 'failed');6.5 Types Updates
File: rental/types.rs (additions)
impl RentalInfo {
/// Get rental age since creation
pub fn age(&self) -> Duration {
Utc::now() - self.created_at
}
/// Get time since last update
pub fn staleness(&self) -> Option<Duration> {
self.updated_at.map(|updated| Utc::now() - updated)
}
/// Check if rental is in terminal state
pub fn is_terminal(&self) -> bool {
matches!(self.state, RentalState::Stopped | RentalState::Failed)
}
}
// Add last_telemetry_at field to RentalInfo
pub struct RentalInfo {
// ... existing fields
/// Timestamp of last billing telemetry collection
pub last_telemetry_at: Option<DateTime<Utc>>,
}6.6 Billing Integration
File: rental/billing.rs (modification)
impl RentalBillingMonitor {
async fn collect_rental_telemetry(&self, rental: &RentalInfo) -> Result<()> {
// ... existing collection logic
// Update last_telemetry_at timestamp
let mut updated = rental.clone();
updated.last_telemetry_at = Some(Utc::now());
self.persistence.save_rental(&updated).await?;
Ok(())
}
}7. Testing Strategy
7.1 Unit Tests
File: rental/garbage_collector.rs (test module)
#[cfg(test)]
mod tests {
use super::*;
#[tokio::test]
async fn test_classify_recent_rental_as_protected() {
// Rentals < 5 minutes old should never be touched
}
#[tokio::test]
async fn test_classify_rental_with_recent_billing_as_protected() {
// Rentals with telemetry in last 48h should be protected
}
#[tokio::test]
async fn test_classify_stuck_provisioning_rental() {
// Provisioning > 30min should need termination
}
#[tokio::test]
async fn test_classify_old_terminal_rental_as_cleanable() {
// Failed/Stopped > 7 days should be cleanable
}
#[tokio::test]
async fn test_classify_active_with_dead_container() {
// Active state but container not found should need termination
}
#[tokio::test]
async fn test_has_data_integrity_issues() {
// Missing required fields should be flagged
}
#[tokio::test]
async fn test_safe_deletion_in_dry_run_mode() {
// Dry run should log but not delete
}
}7.2 Integration Tests
File: tests/rental_garbage_collection_integration.rs
#[tokio::test]
async fn test_full_garbage_collection_cycle() {
// 1. Create test rentals in various states
// 2. Start garbage collector
// 3. Wait for scan
// 4. Verify correct classification and actions
}
#[tokio::test]
async fn test_gc_preserves_active_rentals() {
// Ensure active rentals with running containers are not touched
}
#[tokio::test]
async fn test_gc_respects_billing_safety_window() {
// Rentals with recent telemetry must not be cleaned
}
#[tokio::test]
async fn test_gc_handles_ssh_failures_gracefully() {
// SSH failures should not cause crashes or false positives
}
#[tokio::test]
async fn test_gc_metrics_updated_correctly() {
// Verify metrics are cleared when rentals are cleaned
}7.3 Manual Testing Scenarios
-
Normal Operation:
- Create rental, let it run, verify GC doesn't touch it
-
Stuck Provisioning:
- Create rental, kill deployment, wait 30min, verify GC terminates it
-
Dead Container:
- Create rental, manually kill container, verify GC detects and terminates
-
Billing Safety:
- Create rental with recent telemetry, verify GC protects it
-
Old Terminal:
- Create failed rental, wait 7 days, verify GC deletes it
-
Dry Run:
- Enable dry_run mode, verify logging without actions
8. Metrics and Observability
8.1 New Prometheus Metrics
// In metrics.rs
pub struct ValidatorPrometheusMetrics {
// ... existing metrics
/// Garbage collection scans
pub gc_scans_total: Counter,
/// Rentals processed by category
pub gc_rentals_processed: CounterVec, // labels: category
/// Garbage collection errors
pub gc_errors_total: Counter,
/// Last scan duration
pub gc_scan_duration_seconds: Histogram,
/// Rentals by classification
pub gc_rental_classification: GaugeVec, // labels: category
}8.2 Structured Logging
All GC operations must log with:
- Rental ID
- Current state
- Action taken
- Reason for action
- Age of rental
- Container state
- Billing activity status
Example:
info!(
rental_id = %rental.rental_id,
state = ?rental.state,
age_secs = rental.age().num_seconds(),
container_state = ?container_state,
has_billing = billing_active,
action = "terminate",
"GC: Terminating orphaned rental"
);8.3 Alert Conditions
Recommended alerts:
- High orphan rate: More than 10% of rentals classified as orphans
- GC errors: More than 5 errors per hour
- Stuck rentals: Any rental in non-terminal state > 24 hours
- Data integrity issues: Any rentals flagged for investigation
9. Rollout Plan
9.1 Phase 1: Implementation (Week 1-2)
Tasks:
- Implement core GarbageCollector struct
- Implement classification logic
- Implement container verification
- Implement billing safety checks
- Add database migrations
- Update RentalInfo with last_telemetry_at
- Integrate with RentalManager
- Add configuration validation
Deliverables:
- Fully implemented garbage_collector.rs module
- Database migration file
- Configuration structure and defaults
- Unit tests with >90% coverage
9.2 Phase 2: Integration Testing (Week 2-3)
Tasks:
- Write integration tests
- Test with mock billing service
- Test SSH failure scenarios
- Test concurrent operation with health monitor
- Performance testing under load
- Dry-run testing in staging environment
Deliverables:
- Complete integration test suite
- Performance benchmarks
- Staging environment validation report
9.3 Phase 3: Safe Rollout (Week 3-4)
Tasks:
- Deploy with dry_run=true in production
- Monitor logs for 48 hours
- Verify classification accuracy
- Check for false positives
- Enable actual cleanup if validated
- Monitor for 1 week with active cleanup
Deliverables:
- Production deployment with dry-run validation
- Classification accuracy report
- Full production deployment with monitoring
9.4 Phase 4: Documentation and Handoff (Week 4)
Tasks:
- Write operational runbook
- Document alert response procedures
- Create dashboard for GC monitoring
- Train operations team
- Document manual override procedures
Deliverables:
- Operations runbook
- Monitoring dashboard
- Training materials
- Manual intervention procedures
10. Risk Assessment
10.1 Critical Risks
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| False positive deletion of active rental | CRITICAL | Low | Billing safety window, age thresholds, dry-run testing |
| Billing data loss | CRITICAL | Low | Never delete rentals with recent telemetry, audit logging |
| SSH key compromise | HIGH | Low | Use existing SSH key manager, no new credentials |
| Container cleanup failures | MEDIUM | Medium | Graceful degradation, retry logic, manual cleanup |
| Database corruption | MEDIUM | Low | Transaction safety, comprehensive error handling |
| Performance degradation | LOW | Low | Batch limits, configurable intervals, efficient queries |
10.2 Operational Risks
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| Excessive logging volume | LOW | Medium | Appropriate log levels, sampling |
| Alert fatigue | LOW | Medium | Tuned alert thresholds, actionable alerts only |
| Configuration errors | MEDIUM | Low | Validation on startup, safe defaults |
| Monitoring blind spots | MEDIUM | Low | Comprehensive metrics, regular review |
10.3 Financial Risks
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| Incorrect miner payout | CRITICAL | Very Low | Billing safety window, conservative thresholds |
| Resource cost overruns | LOW | Low | Batch limits, efficient queries |
| Audit trail gaps | MEDIUM | Low | Comprehensive logging before actions |
11. Success Metrics
11.1 Functional Metrics
- Orphan Detection Rate: Percentage of actual orphans correctly identified
- False Positive Rate: Target < 0.1% (must be near zero)
- Cleanup Success Rate: Percentage of cleanups completed without errors
- Time to Detection: Average time from orphan creation to detection
11.2 Performance Metrics
- Scan Duration: Target < 30 seconds for 1000 rentals
- CPU Impact: Target < 1% average CPU usage
- Memory Impact: Target < 50MB additional memory
- Database Load: Target < 10 queries per scan
11.3 Reliability Metrics
- Error Rate: Target < 1 error per 1000 rentals processed
- Uptime: Target 100% (no GC-related crashes)
- Recovery Time: From error to normal operation < 1 minute
12. Open Questions and Decisions
12.1 Resolved Decisions
- Billing Safety Window: 48 hours (conservative)
- Terminal Retention: 7 days (balance audit needs vs storage)
- Scan Interval: 1 hour (sufficient for most orphans)
- Dry Run Default: False (but required for initial deployment)
12.2 Questions Requiring Clarification
-
Payment Settlement Process:
- Where/how are miner payments processed?
- Can we query payment status before cleanup?
- Recommendation: Add payment service integration or extend safety window
-
Manual Override Mechanism:
- Should there be a way to protect specific rentals from GC?
- Recommendation: Add
gc_exemptflag to rental metadata
-
Audit Trail Retention:
- Should deleted rentals be backed up separately?
- Recommendation: Add audit table or external log archival
-
Cross-Service Coordination:
- Does billing service need notification before deletion?
- Recommendation: Add billing service health check before GC actions
13. Implementation Checklist
Phase 1: Core Implementation
- Create
garbage_collector.rsmodule - Implement
OrphanCategoryenum - Implement
ContainerStateenum - Implement
GarbageCollectorConfigstruct with defaults - Implement
RentalGarbageCollectorstruct - Implement classification logic
- Implement container verification
- Implement billing safety checks
- Implement safe cleanup methods
- Add unit tests (>90% coverage)
Phase 2: Integration
- Add
last_telemetry_atfield toRentalInfo - Update billing monitor to track telemetry timestamps
- Create database migration for new column and indexes
- Add GC configuration to
ValidatorConfig - Integrate GC into
RentalManager - Add GC metrics to
ValidatorPrometheusMetrics - Implement graceful shutdown in
Drop - Add integration tests
Phase 3: Safety and Validation
- Implement dry-run mode
- Add configuration validation
- Add comprehensive logging
- Create monitoring dashboard
- Write operational runbook
- Conduct security review
- Performance testing
- Load testing
Phase 4: Deployment
- Deploy with dry_run=true to staging
- Monitor and validate for 48 hours
- Review classification accuracy
- Fix any issues identified
- Deploy with dry_run=true to production
- Monitor and validate for 48 hours
- Enable actual cleanup if validated
- Monitor for 1 week
- Document lessons learned
14. Appendix
A. References
- Health Monitoring:
crates/basilica-validator/src/rental/monitoring.rs - Billing Telemetry:
crates/basilica-validator/src/rental/billing.rs - Rental Manager:
crates/basilica-validator/src/rental/mod.rs - Container Client:
crates/basilica-validator/src/rental/container_client.rs - Persistence Layer:
crates/basilica-validator/src/persistence/
B. Glossary
- Orphan Rental: A rental record without valid container or stuck in non-terminal state
- Terminal State:
StoppedorFailedrental states (final states) - Billing Safety Window: Time period during which rentals with telemetry are protected
- Dry Run: Mode where GC logs actions without executing them
- Grace Period: Minimum age before rental can be considered for cleanup
C. Contact and Escalation
For questions or issues with this implementation:
- Review this architecture document
- Check operational runbook (to be created)
- Review code comments and tests
- Escalate to validator team lead if financial safety concerns arise
Document Status: Ready for Implementation Review
Next Review Date: Upon completion of Phase 1
Approval Required: Technical Lead, Financial Systems Owner