Skip to content

[FEATURE] Rental Orphan Garbage Collection System #217

@epappas

Description

@epappas

Rental Orphan Garbage Collection System Architecture

Version: 1.0
Date: 2025-10-29
Status: Design Document
Author: System Architecture Team
Classification: CRITICAL - Financial System Component

Executive Summary

This document specifies the architecture for an orphan rental garbage collection system within the Basilica validator. This is a financial-critical feature that directly affects miner payouts and billing accuracy. The system must operate with extreme reliability, conservative decision-making, and comprehensive audit logging.

Key Principles:

  • Safety First: Never delete rentals with active billing or recent telemetry
  • Conservative: When in doubt, preserve the rental record
  • Auditable: Log all decisions and actions comprehensively
  • Non-disruptive: Operate independently without affecting active rentals
  • Simple: Follow KISS, DRY, and SOLID principles

1. Problem Statement

1.1 Current Issues

The validator rental system can create orphaned rentals in several scenarios:

  1. Database Orphans: Rental records in database without corresponding containers

    • Container deployment failures that don't update state
    • Manual container deletion bypassing rental manager
    • SSH connection failures preventing container verification
  2. Container Orphans: Running containers without rental records

    • Database corruption or data loss
    • Manual rental record deletion
    • Race conditions during deployment
  3. Stuck State Rentals: Rentals in non-terminal states indefinitely

    • Rentals stuck in Provisioning after deployment failures
    • Rentals stuck in Stopping after termination failures
    • Rentals in Active with dead containers
  4. Stale References: Rentals referencing deleted miners or nodes

    • Miner deregistration without rental cleanup
    • Node removal without rental termination

1.2 Financial Impact

Critical Considerations:

  • Orphaned active rentals may generate incorrect billing telemetry
  • Stuck rentals block node availability for legitimate rentals
  • Missing container cleanup wastes miner resources
  • Incorrect state transitions affect miner scoring and rewards
  • Billing service requires accurate rental lifecycle events

1.3 Success Criteria

  1. Accuracy: 100% correct identification of orphan rentals (zero false positives)
  2. Billing Safety: Never interfere with rentals having billing activity in last 48 hours
  3. Performance: Minimal impact on validator operations (<1% CPU/memory)
  4. Auditability: Complete log trail for all cleanup actions
  5. Reliability: Graceful handling of SSH failures and network issues

2. System Architecture

2.1 Component Overview

┌─────────────────────────────────────────────────────────────┐
│                     RentalManager                            │
│  ┌────────────────┐  ┌────────────────┐  ┌───────────────┐ │
│  │ Health Monitor │  │ Billing Monitor│  │ GC Monitor    │ │
│  │  (30s loop)    │  │  (60s loop)    │  │ (3600s loop)  │ │
│  └────────────────┘  └────────────────┘  └───────────────┘ │
└─────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
┌───────▼────────┐   ┌────────▼─────────┐  ┌───────▼──────┐
│  Persistence   │   │ ContainerClient  │  │   Metrics    │
│   (SQLite)     │   │  (SSH + Docker)  │  │ (Prometheus) │
└────────────────┘   └──────────────────┘  └──────────────┘

2.2 New Component: RentalGarbageCollector

File: crates/basilica-validator/src/rental/garbage_collector.rs

Responsibilities:

  1. Periodically scan for orphaned rental records
  2. Verify container states against database records
  3. Safely transition rentals to terminal states
  4. Clean up container resources when appropriate
  5. Record metrics and audit logs
  6. Respect billing integrity constraints

Non-Responsibilities:

  • Does NOT handle billing finalization (handled by billing service)
  • Does NOT make payment decisions
  • Does NOT interfere with health monitoring
  • Does NOT handle active rental operations

2.3 Architecture Pattern

Follows established monitoring pattern:

pub struct RentalGarbageCollector {
    persistence: Arc<SimplePersistence>,
    ssh_key_manager: Arc<ValidatorSshKeyManager>,
    metrics: Arc<ValidatorPrometheusMetrics>,
    config: GarbageCollectorConfig,
    cancellation_token: CancellationToken,
}

Key Design Decisions:

  1. Independent Loop: Separate from health and billing monitors
  2. Read-Heavy: Primarily reads database and container state
  3. Conservative Updates: Only updates obviously orphaned rentals
  4. Idempotent: Can run multiple times without adverse effects
  5. Fail-Safe: Errors in processing one rental don't affect others

3. Orphan Rental Criteria

3.1 Classification System

Orphan rentals are classified into three risk categories:

Category A: Safe to Clean (Low Risk)

Rentals meeting ALL criteria:

  • State is Failed or Stopped (terminal states)
  • Age > 7 days since last update
  • No billing telemetry in last 48 hours
  • Container verified as non-existent or stopped

Action: Delete from database after audit logging

Category B: Needs Termination (Medium Risk)

Rentals meeting criteria:

  • State is Provisioning AND age > 30 minutes
  • State is Stopping AND age > 15 minutes
  • State is Active AND container verified as not running
  • No billing telemetry in last 2 hours

Action: Transition to Failed state, attempt container cleanup

Category C: Needs Investigation (High Risk)

Rentals meeting criteria:

  • References non-existent miner_id or node_id
  • Missing required fields (container_id, ssh_credentials)
  • Database record corruption

Action: Log warning, transition to Failed state, manual review required

Category X: Never Touch (Protected)

Rentals with ANY of these characteristics:

  • Billing telemetry in last 48 hours
  • Age < 5 minutes (grace period for deployment)
  • State is Active and container is running
  • Currently being processed by health monitor

Action: Skip entirely, preserve existing state

3.2 Orphan Detection Algorithm

async fn classify_rental(&self, rental: &RentalInfo) -> OrphanCategory {
    // CRITICAL: Never touch recent rentals
    if rental.age() < Duration::from_secs(300) {
        return OrphanCategory::Protected;
    }

    // CRITICAL: Check billing telemetry
    if self.has_recent_billing_activity(&rental.rental_id).await? {
        return OrphanCategory::Protected;
    }

    // Verify container state
    let container_state = self.verify_container_state(rental).await;

    match (rental.state.clone(), container_state) {
        // Terminal states old enough to delete
        (RentalState::Failed | RentalState::Stopped, ContainerState::NotFound)
            if rental.age() > Duration::from_days(7) => {
                OrphanCategory::SafeToClean
            }

        // Stuck in provisioning
        (RentalState::Provisioning, _)
            if rental.age() > Duration::from_secs(1800) => {
                OrphanCategory::NeedsTermination
            }

        // Stuck in stopping
        (RentalState::Stopping, _)
            if rental.age() > Duration::from_secs(900) => {
                OrphanCategory::NeedsTermination
            }

        // Active but container dead
        (RentalState::Active, ContainerState::NotRunning | ContainerState::NotFound) => {
            OrphanCategory::NeedsTermination
        }

        // Data integrity issues
        _ if rental.has_data_integrity_issues() => {
            OrphanCategory::NeedsInvestigation
        }

        // Everything else is protected
        _ => OrphanCategory::Protected
    }
}

3.3 Container State Verification

enum ContainerState {
    Running,      // Container exists and is running
    NotRunning,   // Container exists but stopped
    NotFound,     // Container does not exist
    Unknown,      // Cannot determine (SSH failure, etc.)
}

async fn verify_container_state(&self, rental: &RentalInfo) -> ContainerState {
    // Get validator's SSH key
    let validator_key_path = match self.ssh_key_manager.get_persistent_key() {
        Some((_, path)) => path,
        None => return ContainerState::Unknown,
    };

    // Create container client
    let client = match ContainerClient::new(
        rental.ssh_credentials.clone(),
        Some(validator_key_path),
    ) {
        Ok(c) => c,
        Err(e) => {
            warn!("Failed to create container client: {}", e);
            return ContainerState::Unknown;
        }
    };

    // Check container status with timeout
    match tokio::time::timeout(
        Duration::from_secs(10),
        client.get_container_status(&rental.container_id)
    ).await {
        Ok(Ok(status)) => {
            if status.state == "running" {
                ContainerState::Running
            } else {
                ContainerState::NotRunning
            }
        }
        Ok(Err(_)) => ContainerState::NotFound,
        Err(_) => ContainerState::Unknown,
    }
}

4. Billing Safety System

4.1 Billing Activity Detection

Critical Requirement: Never clean rentals with recent billing activity.

async fn has_recent_billing_activity(&self, rental_id: &str) -> Result<bool> {
    // Check if billing telemetry was collected recently
    // This requires adding last_telemetry_at field to RentalInfo
    // or querying billing service directly

    let rental = self.persistence.load_rental(rental_id).await?;

    if let Some(last_telemetry) = rental.last_telemetry_at {
        let age = Utc::now() - last_telemetry;
        if age < Duration::from_hours(48) {
            return Ok(true);
        }
    }

    Ok(false)
}

4.2 Safe State Transitions

All state transitions must preserve billing integrity:

async fn transition_to_failed(&self, rental: &RentalInfo) -> Result<()> {
    // CRITICAL: Only transition if no recent billing
    if self.has_recent_billing_activity(&rental.rental_id).await? {
        warn!(
            "Refusing to transition rental {} with recent billing activity",
            rental.rental_id
        );
        return Ok(());
    }

    let mut updated = rental.clone();
    updated.state = RentalState::Failed;
    updated.updated_at = Some(Utc::now());
    updated.terminated_at = Some(Utc::now());
    updated.termination_reason = Some("Garbage collected: orphaned rental".to_string());

    // Audit log BEFORE state change
    info!(
        "GC: Transitioning rental {} from {:?} to Failed (age: {:?}, reason: orphaned)",
        rental.rental_id,
        rental.state,
        rental.age()
    );

    self.persistence.save_rental(&updated).await?;

    // Clear metrics
    self.clear_rental_metrics(rental);

    Ok(())
}

4.3 Container Cleanup Protocol

async fn cleanup_container(&self, rental: &RentalInfo) -> Result<()> {
    let validator_key_path = self.ssh_key_manager
        .get_persistent_key()
        .ok_or_else(|| anyhow!("No SSH key"))?
        .1;

    let client = ContainerClient::new(
        rental.ssh_credentials.clone(),
        Some(validator_key_path),
    )?;

    // Try graceful stop first
    match tokio::time::timeout(
        Duration::from_secs(30),
        client.stop_container(&rental.container_id, false)
    ).await {
        Ok(Ok(_)) => {
            info!("GC: Gracefully stopped container {}", rental.container_id);
        }
        Ok(Err(e)) => {
            warn!("GC: Failed to stop container gracefully: {}", e);
        }
        Err(_) => {
            warn!("GC: Timeout stopping container, attempting force stop");
            // Force stop on timeout
            let _ = client.stop_container(&rental.container_id, true).await;
        }
    }

    // Remove container
    match client.remove_container(&rental.container_id).await {
        Ok(_) => {
            info!("GC: Removed container {}", rental.container_id);
        }
        Err(e) => {
            warn!("GC: Failed to remove container: {}", e);
        }
    }

    Ok(())
}

5. Configuration Specification

5.1 Configuration Structure

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct GarbageCollectorConfig {
    /// Enable garbage collection
    #[serde(default = "default_enabled")]
    pub enabled: bool,

    /// Scan interval in seconds (default: 1 hour)
    #[serde(default = "default_scan_interval_secs")]
    pub scan_interval_secs: u64,

    /// Minimum age before considering rental for cleanup (seconds)
    #[serde(default = "default_min_age_secs")]
    pub min_age_secs: u64,

    /// Age threshold for stuck provisioning rentals (seconds)
    #[serde(default = "default_provisioning_timeout_secs")]
    pub provisioning_timeout_secs: u64,

    /// Age threshold for stuck stopping rentals (seconds)
    #[serde(default = "default_stopping_timeout_secs")]
    pub stopping_timeout_secs: u64,

    /// Age threshold for deleting terminal state rentals (days)
    #[serde(default = "default_terminal_retention_days")]
    pub terminal_retention_days: u32,

    /// Billing safety window - never touch rentals with telemetry in this window (hours)
    #[serde(default = "default_billing_safety_window_hours")]
    pub billing_safety_window_hours: u32,

    /// Container verification timeout (seconds)
    #[serde(default = "default_container_check_timeout_secs")]
    pub container_check_timeout_secs: u64,

    /// Maximum rentals to process per scan
    #[serde(default = "default_max_batch_size")]
    pub max_batch_size: usize,

    /// Enable dry-run mode (log actions without executing)
    #[serde(default = "default_dry_run")]
    pub dry_run: bool,
}

// Default values
fn default_enabled() -> bool { true }
fn default_scan_interval_secs() -> u64 { 3600 }  // 1 hour
fn default_min_age_secs() -> u64 { 300 }  // 5 minutes grace period
fn default_provisioning_timeout_secs() -> u64 { 1800 }  // 30 minutes
fn default_stopping_timeout_secs() -> u64 { 900 }  // 15 minutes
fn default_terminal_retention_days() -> u32 { 7 }  // 1 week
fn default_billing_safety_window_hours() -> u32 { 48 }  // 2 days
fn default_container_check_timeout_secs() -> u64 { 10 }
fn default_max_batch_size() -> usize { 100 }
fn default_dry_run() -> bool { false }

5.2 TOML Configuration Example

[garbage_collector]
enabled = true
scan_interval_secs = 3600
min_age_secs = 300
provisioning_timeout_secs = 1800
stopping_timeout_secs = 900
terminal_retention_days = 7
billing_safety_window_hours = 48
container_check_timeout_secs = 10
max_batch_size = 100
dry_run = false

5.3 Configuration Validation

impl GarbageCollectorConfig {
    pub fn validate(&self) -> Result<(), ConfigurationError> {
        if self.scan_interval_secs == 0 {
            return Err(ConfigurationError::InvalidValue {
                key: "garbage_collector.scan_interval_secs".to_string(),
                value: "0".to_string(),
                reason: "Scan interval must be greater than 0".to_string(),
            });
        }

        if self.min_age_secs < 60 {
            return Err(ConfigurationError::InvalidValue {
                key: "garbage_collector.min_age_secs".to_string(),
                value: self.min_age_secs.to_string(),
                reason: "Minimum age must be at least 60 seconds".to_string(),
            });
        }

        if self.billing_safety_window_hours < 24 {
            return Err(ConfigurationError::InvalidValue {
                key: "garbage_collector.billing_safety_window_hours".to_string(),
                value: self.billing_safety_window_hours.to_string(),
                reason: "Billing safety window must be at least 24 hours".to_string(),
            });
        }

        Ok(())
    }
}

6. Implementation Specification

6.1 Module Structure

crates/basilica-validator/src/rental/
├── mod.rs                    (RentalManager with GC integration)
├── garbage_collector.rs      (NEW: Core GC implementation)
├── monitoring.rs             (Existing: Health monitoring)
├── billing.rs                (Existing: Billing telemetry)
├── types.rs                  (Updated: Add OrphanCategory, last_telemetry_at)
├── container_client.rs       (Existing: Container operations)
└── deployment.rs             (Existing: Container deployment)

6.2 Core Implementation

File: garbage_collector.rs

use anyhow::{Context, Result};
use chrono::{Duration, Utc};
use std::sync::Arc;
use tokio::time::interval;
use tokio_util::sync::CancellationToken;
use tracing::{debug, error, info, warn};

use super::container_client::ContainerClient;
use super::types::{RentalInfo, RentalState};
use crate::metrics::ValidatorPrometheusMetrics;
use crate::persistence::SimplePersistence;
use crate::ssh::ValidatorSshKeyManager;

/// Orphan rental categories
#[derive(Debug, Clone, PartialEq)]
pub enum OrphanCategory {
    /// Safe to delete - terminal state, old, no billing activity
    SafeToClean,
    /// Needs state transition to Failed and container cleanup
    NeedsTermination,
    /// Data integrity issues requiring manual investigation
    NeedsInvestigation,
    /// Protected - never touch
    Protected,
}

/// Container state as determined by verification
#[derive(Debug, Clone, PartialEq)]
pub enum ContainerState {
    Running,
    NotRunning,
    NotFound,
    Unknown,
}

/// Configuration for garbage collector
#[derive(Debug, Clone)]
pub struct GarbageCollectorConfig {
    pub enabled: bool,
    pub scan_interval_secs: u64,
    pub min_age_secs: u64,
    pub provisioning_timeout_secs: u64,
    pub stopping_timeout_secs: u64,
    pub terminal_retention_days: u32,
    pub billing_safety_window_hours: u32,
    pub container_check_timeout_secs: u64,
    pub max_batch_size: usize,
    pub dry_run: bool,
}

impl Default for GarbageCollectorConfig {
    fn default() -> Self {
        Self {
            enabled: true,
            scan_interval_secs: 3600,
            min_age_secs: 300,
            provisioning_timeout_secs: 1800,
            stopping_timeout_secs: 900,
            terminal_retention_days: 7,
            billing_safety_window_hours: 48,
            container_check_timeout_secs: 10,
            max_batch_size: 100,
            dry_run: false,
        }
    }
}

/// Rental garbage collector
#[derive(Clone)]
pub struct RentalGarbageCollector {
    persistence: Arc<SimplePersistence>,
    ssh_key_manager: Arc<ValidatorSshKeyManager>,
    metrics: Arc<ValidatorPrometheusMetrics>,
    config: GarbageCollectorConfig,
    cancellation_token: CancellationToken,
}

impl RentalGarbageCollector {
    /// Create new garbage collector
    pub fn new(
        persistence: Arc<SimplePersistence>,
        ssh_key_manager: Arc<ValidatorSshKeyManager>,
        metrics: Arc<ValidatorPrometheusMetrics>,
        config: GarbageCollectorConfig,
    ) -> Self {
        Self {
            persistence,
            ssh_key_manager,
            metrics,
            config,
            cancellation_token: CancellationToken::new(),
        }
    }

    /// Start garbage collection loop
    pub fn start(&self) {
        let collector = self.clone();
        tokio::spawn(async move {
            collector.collection_loop().await;
        });
    }

    /// Stop garbage collection
    pub fn stop(&self) {
        self.cancellation_token.cancel();
    }

    /// Main collection loop
    async fn collection_loop(&self) {
        let mut scan_interval = interval(
            std::time::Duration::from_secs(self.config.scan_interval_secs)
        );

        info!("Rental garbage collector started (dry_run: {})", self.config.dry_run);

        loop {
            tokio::select! {
                _ = self.cancellation_token.cancelled() => {
                    info!("Rental garbage collector stopped");
                    break;
                }
                _ = scan_interval.tick() => {
                    if let Err(e) = self.perform_collection().await {
                        error!("Garbage collection error: {}", e);
                    }
                }
            }
        }
    }

    /// Perform garbage collection scan
    async fn perform_collection(&self) -> Result<()> {
        info!("Starting garbage collection scan");

        // Query all non-terminated rentals
        let rentals = self.persistence
            .query_non_terminated_rentals()
            .await
            .context("Failed to query rentals")?;

        debug!("Scanning {} non-terminated rentals", rentals.len());

        let mut stats = CollectionStats::default();

        // Process rentals one by one
        for rental in rentals.iter().take(self.config.max_batch_size) {
            match self.process_rental(rental, &mut stats).await {
                Ok(_) => {}
                Err(e) => {
                    error!(
                        "Failed to process rental {}: {}",
                        rental.rental_id, e
                    );
                    stats.errors += 1;
                }
            }
        }

        info!(
            "Garbage collection scan complete: {} scanned, {} cleaned, {} terminated, \
             {} investigated, {} protected, {} errors",
            stats.scanned,
            stats.cleaned,
            stats.terminated,
            stats.investigated,
            stats.protected,
            stats.errors
        );

        Ok(())
    }

    /// Process a single rental
    async fn process_rental(
        &self,
        rental: &RentalInfo,
        stats: &mut CollectionStats,
    ) -> Result<()> {
        stats.scanned += 1;

        // Classify the rental
        let category = self.classify_rental(rental).await?;

        match category {
            OrphanCategory::SafeToClean => {
                stats.cleaned += 1;
                self.handle_safe_to_clean(rental).await?;
            }
            OrphanCategory::NeedsTermination => {
                stats.terminated += 1;
                self.handle_needs_termination(rental).await?;
            }
            OrphanCategory::NeedsInvestigation => {
                stats.investigated += 1;
                self.handle_needs_investigation(rental).await?;
            }
            OrphanCategory::Protected => {
                stats.protected += 1;
                debug!("Rental {} is protected, skipping", rental.rental_id);
            }
        }

        Ok(())
    }

    /// Classify a rental into orphan category
    async fn classify_rental(&self, rental: &RentalInfo) -> Result<OrphanCategory> {
        // CRITICAL: Never touch recent rentals
        let age = rental.age();
        if age < Duration::seconds(self.config.min_age_secs as i64) {
            return Ok(OrphanCategory::Protected);
        }

        // CRITICAL: Check billing telemetry
        if self.has_recent_billing_activity(rental).await? {
            return Ok(OrphanCategory::Protected);
        }

        // Verify container state
        let container_state = self.verify_container_state(rental).await;

        // Classification logic
        match (&rental.state, container_state, age) {
            // Terminal states old enough to delete
            (RentalState::Failed | RentalState::Stopped, ContainerState::NotFound, age)
                if age > Duration::days(self.config.terminal_retention_days as i64) => {
                    Ok(OrphanCategory::SafeToClean)
                }

            // Stuck in provisioning
            (RentalState::Provisioning, _, age)
                if age > Duration::seconds(self.config.provisioning_timeout_secs as i64) => {
                    Ok(OrphanCategory::NeedsTermination)
                }

            // Stuck in stopping
            (RentalState::Stopping, _, age)
                if age > Duration::seconds(self.config.stopping_timeout_secs as i64) => {
                    Ok(OrphanCategory::NeedsTermination)
                }

            // Active but container dead
            (RentalState::Active, ContainerState::NotRunning | ContainerState::NotFound, _) => {
                Ok(OrphanCategory::NeedsTermination)
            }

            // Data integrity issues
            _ if self.has_data_integrity_issues(rental) => {
                Ok(OrphanCategory::NeedsInvestigation)
            }

            // Everything else is protected
            _ => Ok(OrphanCategory::Protected),
        }
    }

    /// Check if rental has recent billing activity
    async fn has_recent_billing_activity(&self, rental: &RentalInfo) -> Result<bool> {
        // Check last_telemetry_at field if available
        if let Some(last_telemetry) = rental.last_telemetry_at {
            let age = Utc::now() - last_telemetry;
            let threshold = Duration::hours(self.config.billing_safety_window_hours as i64);

            if age < threshold {
                debug!(
                    "Rental {} has recent billing activity ({:?} ago)",
                    rental.rental_id, age
                );
                return Ok(true);
            }
        }

        Ok(false)
    }

    /// Verify container state via SSH
    async fn verify_container_state(&self, rental: &RentalInfo) -> ContainerState {
        // Implementation details...
        ContainerState::Unknown
    }

    /// Check for data integrity issues
    fn has_data_integrity_issues(&self, rental: &RentalInfo) -> bool {
        rental.container_id.is_empty()
            || rental.node_id.is_empty()
            || rental.ssh_credentials.host.is_empty()
    }

    /// Handle rentals safe to clean
    async fn handle_safe_to_clean(&self, rental: &RentalInfo) -> Result<()> {
        info!(
            "GC: Cleaning rental {} (state: {:?}, age: {:?})",
            rental.rental_id,
            rental.state,
            rental.age()
        );

        if self.config.dry_run {
            info!("GC: [DRY RUN] Would delete rental {}", rental.rental_id);
            return Ok(());
        }

        // Delete from database
        self.persistence.delete_rental(&rental.rental_id).await?;

        info!("GC: Deleted rental {}", rental.rental_id);
        Ok(())
    }

    /// Handle rentals needing termination
    async fn handle_needs_termination(&self, rental: &RentalInfo) -> Result<()> {
        // Implementation details...
        Ok(())
    }

    /// Handle rentals needing investigation
    async fn handle_needs_investigation(&self, rental: &RentalInfo) -> Result<()> {
        // Implementation details...
        Ok(())
    }
}

/// Collection statistics
#[derive(Debug, Default)]
struct CollectionStats {
    scanned: usize,
    cleaned: usize,
    terminated: usize,
    investigated: usize,
    protected: usize,
    errors: usize,
}

6.3 Integration with RentalManager

File: rental/mod.rs (modifications)

pub struct RentalManager {
    // ... existing fields
    garbage_collector: Option<Arc<RentalGarbageCollector>>,
}

impl RentalManager {
    pub async fn create(
        config: &ValidatorConfig,
        persistence: Arc<SimplePersistence>,
        metrics: Arc<ValidatorPrometheusMetrics>,
    ) -> Result<Self> {
        // ... existing initialization

        // Create garbage collector if enabled
        let garbage_collector = if config.garbage_collector.enabled {
            let gc = RentalGarbageCollector::new(
                persistence.clone(),
                ssh_key_manager.clone(),
                metrics.clone(),
                config.garbage_collector.clone(),
            );
            Some(Arc::new(gc))
        } else {
            None
        };

        Ok(Self {
            // ... existing fields
            garbage_collector,
        })
    }

    pub fn start(&self) {
        // ... existing monitor starts

        if let Some(gc) = &self.garbage_collector {
            gc.start();
        }
    }
}

impl Drop for RentalManager {
    fn drop(&mut self) {
        // ... existing cleanup

        if let Some(gc) = &self.garbage_collector {
            gc.stop();
        }
    }
}

6.4 Database Schema Updates

New Migration: migrations/XXX_add_rental_telemetry_tracking.sql

-- Add last_telemetry_at column to track billing activity
ALTER TABLE rentals ADD COLUMN last_telemetry_at TEXT;

-- Add indexes for efficient orphan queries
CREATE INDEX IF NOT EXISTS idx_rentals_state ON rentals(state);
CREATE INDEX IF NOT EXISTS idx_rentals_created_at ON rentals(created_at);
CREATE INDEX IF NOT EXISTS idx_rentals_updated_at ON rentals(updated_at);
CREATE INDEX IF NOT EXISTS idx_rentals_node_id ON rentals(node_id);
CREATE INDEX IF NOT EXISTS idx_rentals_miner_id ON rentals(miner_id);

-- Composite index for garbage collection queries
CREATE INDEX IF NOT EXISTS idx_rentals_gc_scan
  ON rentals(state, created_at, last_telemetry_at)
  WHERE state NOT IN ('stopped', 'failed');

6.5 Types Updates

File: rental/types.rs (additions)

impl RentalInfo {
    /// Get rental age since creation
    pub fn age(&self) -> Duration {
        Utc::now() - self.created_at
    }

    /// Get time since last update
    pub fn staleness(&self) -> Option<Duration> {
        self.updated_at.map(|updated| Utc::now() - updated)
    }

    /// Check if rental is in terminal state
    pub fn is_terminal(&self) -> bool {
        matches!(self.state, RentalState::Stopped | RentalState::Failed)
    }
}

// Add last_telemetry_at field to RentalInfo
pub struct RentalInfo {
    // ... existing fields

    /// Timestamp of last billing telemetry collection
    pub last_telemetry_at: Option<DateTime<Utc>>,
}

6.6 Billing Integration

File: rental/billing.rs (modification)

impl RentalBillingMonitor {
    async fn collect_rental_telemetry(&self, rental: &RentalInfo) -> Result<()> {
        // ... existing collection logic

        // Update last_telemetry_at timestamp
        let mut updated = rental.clone();
        updated.last_telemetry_at = Some(Utc::now());

        self.persistence.save_rental(&updated).await?;

        Ok(())
    }
}

7. Testing Strategy

7.1 Unit Tests

File: rental/garbage_collector.rs (test module)

#[cfg(test)]
mod tests {
    use super::*;

    #[tokio::test]
    async fn test_classify_recent_rental_as_protected() {
        // Rentals < 5 minutes old should never be touched
    }

    #[tokio::test]
    async fn test_classify_rental_with_recent_billing_as_protected() {
        // Rentals with telemetry in last 48h should be protected
    }

    #[tokio::test]
    async fn test_classify_stuck_provisioning_rental() {
        // Provisioning > 30min should need termination
    }

    #[tokio::test]
    async fn test_classify_old_terminal_rental_as_cleanable() {
        // Failed/Stopped > 7 days should be cleanable
    }

    #[tokio::test]
    async fn test_classify_active_with_dead_container() {
        // Active state but container not found should need termination
    }

    #[tokio::test]
    async fn test_has_data_integrity_issues() {
        // Missing required fields should be flagged
    }

    #[tokio::test]
    async fn test_safe_deletion_in_dry_run_mode() {
        // Dry run should log but not delete
    }
}

7.2 Integration Tests

File: tests/rental_garbage_collection_integration.rs

#[tokio::test]
async fn test_full_garbage_collection_cycle() {
    // 1. Create test rentals in various states
    // 2. Start garbage collector
    // 3. Wait for scan
    // 4. Verify correct classification and actions
}

#[tokio::test]
async fn test_gc_preserves_active_rentals() {
    // Ensure active rentals with running containers are not touched
}

#[tokio::test]
async fn test_gc_respects_billing_safety_window() {
    // Rentals with recent telemetry must not be cleaned
}

#[tokio::test]
async fn test_gc_handles_ssh_failures_gracefully() {
    // SSH failures should not cause crashes or false positives
}

#[tokio::test]
async fn test_gc_metrics_updated_correctly() {
    // Verify metrics are cleared when rentals are cleaned
}

7.3 Manual Testing Scenarios

  1. Normal Operation:

    • Create rental, let it run, verify GC doesn't touch it
  2. Stuck Provisioning:

    • Create rental, kill deployment, wait 30min, verify GC terminates it
  3. Dead Container:

    • Create rental, manually kill container, verify GC detects and terminates
  4. Billing Safety:

    • Create rental with recent telemetry, verify GC protects it
  5. Old Terminal:

    • Create failed rental, wait 7 days, verify GC deletes it
  6. Dry Run:

    • Enable dry_run mode, verify logging without actions

8. Metrics and Observability

8.1 New Prometheus Metrics

// In metrics.rs
pub struct ValidatorPrometheusMetrics {
    // ... existing metrics

    /// Garbage collection scans
    pub gc_scans_total: Counter,

    /// Rentals processed by category
    pub gc_rentals_processed: CounterVec,  // labels: category

    /// Garbage collection errors
    pub gc_errors_total: Counter,

    /// Last scan duration
    pub gc_scan_duration_seconds: Histogram,

    /// Rentals by classification
    pub gc_rental_classification: GaugeVec,  // labels: category
}

8.2 Structured Logging

All GC operations must log with:

  • Rental ID
  • Current state
  • Action taken
  • Reason for action
  • Age of rental
  • Container state
  • Billing activity status

Example:

info!(
    rental_id = %rental.rental_id,
    state = ?rental.state,
    age_secs = rental.age().num_seconds(),
    container_state = ?container_state,
    has_billing = billing_active,
    action = "terminate",
    "GC: Terminating orphaned rental"
);

8.3 Alert Conditions

Recommended alerts:

  1. High orphan rate: More than 10% of rentals classified as orphans
  2. GC errors: More than 5 errors per hour
  3. Stuck rentals: Any rental in non-terminal state > 24 hours
  4. Data integrity issues: Any rentals flagged for investigation

9. Rollout Plan

9.1 Phase 1: Implementation (Week 1-2)

Tasks:

  1. Implement core GarbageCollector struct
  2. Implement classification logic
  3. Implement container verification
  4. Implement billing safety checks
  5. Add database migrations
  6. Update RentalInfo with last_telemetry_at
  7. Integrate with RentalManager
  8. Add configuration validation

Deliverables:

  • Fully implemented garbage_collector.rs module
  • Database migration file
  • Configuration structure and defaults
  • Unit tests with >90% coverage

9.2 Phase 2: Integration Testing (Week 2-3)

Tasks:

  1. Write integration tests
  2. Test with mock billing service
  3. Test SSH failure scenarios
  4. Test concurrent operation with health monitor
  5. Performance testing under load
  6. Dry-run testing in staging environment

Deliverables:

  • Complete integration test suite
  • Performance benchmarks
  • Staging environment validation report

9.3 Phase 3: Safe Rollout (Week 3-4)

Tasks:

  1. Deploy with dry_run=true in production
  2. Monitor logs for 48 hours
  3. Verify classification accuracy
  4. Check for false positives
  5. Enable actual cleanup if validated
  6. Monitor for 1 week with active cleanup

Deliverables:

  • Production deployment with dry-run validation
  • Classification accuracy report
  • Full production deployment with monitoring

9.4 Phase 4: Documentation and Handoff (Week 4)

Tasks:

  1. Write operational runbook
  2. Document alert response procedures
  3. Create dashboard for GC monitoring
  4. Train operations team
  5. Document manual override procedures

Deliverables:

  • Operations runbook
  • Monitoring dashboard
  • Training materials
  • Manual intervention procedures

10. Risk Assessment

10.1 Critical Risks

Risk Impact Likelihood Mitigation
False positive deletion of active rental CRITICAL Low Billing safety window, age thresholds, dry-run testing
Billing data loss CRITICAL Low Never delete rentals with recent telemetry, audit logging
SSH key compromise HIGH Low Use existing SSH key manager, no new credentials
Container cleanup failures MEDIUM Medium Graceful degradation, retry logic, manual cleanup
Database corruption MEDIUM Low Transaction safety, comprehensive error handling
Performance degradation LOW Low Batch limits, configurable intervals, efficient queries

10.2 Operational Risks

Risk Impact Likelihood Mitigation
Excessive logging volume LOW Medium Appropriate log levels, sampling
Alert fatigue LOW Medium Tuned alert thresholds, actionable alerts only
Configuration errors MEDIUM Low Validation on startup, safe defaults
Monitoring blind spots MEDIUM Low Comprehensive metrics, regular review

10.3 Financial Risks

Risk Impact Likelihood Mitigation
Incorrect miner payout CRITICAL Very Low Billing safety window, conservative thresholds
Resource cost overruns LOW Low Batch limits, efficient queries
Audit trail gaps MEDIUM Low Comprehensive logging before actions

11. Success Metrics

11.1 Functional Metrics

  • Orphan Detection Rate: Percentage of actual orphans correctly identified
  • False Positive Rate: Target < 0.1% (must be near zero)
  • Cleanup Success Rate: Percentage of cleanups completed without errors
  • Time to Detection: Average time from orphan creation to detection

11.2 Performance Metrics

  • Scan Duration: Target < 30 seconds for 1000 rentals
  • CPU Impact: Target < 1% average CPU usage
  • Memory Impact: Target < 50MB additional memory
  • Database Load: Target < 10 queries per scan

11.3 Reliability Metrics

  • Error Rate: Target < 1 error per 1000 rentals processed
  • Uptime: Target 100% (no GC-related crashes)
  • Recovery Time: From error to normal operation < 1 minute

12. Open Questions and Decisions

12.1 Resolved Decisions

  1. Billing Safety Window: 48 hours (conservative)
  2. Terminal Retention: 7 days (balance audit needs vs storage)
  3. Scan Interval: 1 hour (sufficient for most orphans)
  4. Dry Run Default: False (but required for initial deployment)

12.2 Questions Requiring Clarification

  1. Payment Settlement Process:

    • Where/how are miner payments processed?
    • Can we query payment status before cleanup?
    • Recommendation: Add payment service integration or extend safety window
  2. Manual Override Mechanism:

    • Should there be a way to protect specific rentals from GC?
    • Recommendation: Add gc_exempt flag to rental metadata
  3. Audit Trail Retention:

    • Should deleted rentals be backed up separately?
    • Recommendation: Add audit table or external log archival
  4. Cross-Service Coordination:

    • Does billing service need notification before deletion?
    • Recommendation: Add billing service health check before GC actions

13. Implementation Checklist

Phase 1: Core Implementation

  • Create garbage_collector.rs module
  • Implement OrphanCategory enum
  • Implement ContainerState enum
  • Implement GarbageCollectorConfig struct with defaults
  • Implement RentalGarbageCollector struct
  • Implement classification logic
  • Implement container verification
  • Implement billing safety checks
  • Implement safe cleanup methods
  • Add unit tests (>90% coverage)

Phase 2: Integration

  • Add last_telemetry_at field to RentalInfo
  • Update billing monitor to track telemetry timestamps
  • Create database migration for new column and indexes
  • Add GC configuration to ValidatorConfig
  • Integrate GC into RentalManager
  • Add GC metrics to ValidatorPrometheusMetrics
  • Implement graceful shutdown in Drop
  • Add integration tests

Phase 3: Safety and Validation

  • Implement dry-run mode
  • Add configuration validation
  • Add comprehensive logging
  • Create monitoring dashboard
  • Write operational runbook
  • Conduct security review
  • Performance testing
  • Load testing

Phase 4: Deployment

  • Deploy with dry_run=true to staging
  • Monitor and validate for 48 hours
  • Review classification accuracy
  • Fix any issues identified
  • Deploy with dry_run=true to production
  • Monitor and validate for 48 hours
  • Enable actual cleanup if validated
  • Monitor for 1 week
  • Document lessons learned

14. Appendix

A. References

  • Health Monitoring: crates/basilica-validator/src/rental/monitoring.rs
  • Billing Telemetry: crates/basilica-validator/src/rental/billing.rs
  • Rental Manager: crates/basilica-validator/src/rental/mod.rs
  • Container Client: crates/basilica-validator/src/rental/container_client.rs
  • Persistence Layer: crates/basilica-validator/src/persistence/

B. Glossary

  • Orphan Rental: A rental record without valid container or stuck in non-terminal state
  • Terminal State: Stopped or Failed rental states (final states)
  • Billing Safety Window: Time period during which rentals with telemetry are protected
  • Dry Run: Mode where GC logs actions without executing them
  • Grace Period: Minimum age before rental can be considered for cleanup

C. Contact and Escalation

For questions or issues with this implementation:

  1. Review this architecture document
  2. Check operational runbook (to be created)
  3. Review code comments and tests
  4. Escalate to validator team lead if financial safety concerns arise

Document Status: Ready for Implementation Review
Next Review Date: Upon completion of Phase 1
Approval Required: Technical Lead, Financial Systems Owner

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions