[FEATURE]  Rental Orphan Garbage Collection System

# Rental Orphan Garbage Collection System Architecture

**Version:** 1.0
**Date:** 2025-10-29
**Status:** Design Document
**Author:** System Architecture Team
**Classification:** CRITICAL - Financial System Component

## Executive Summary

This document specifies the architecture for an orphan rental garbage collection system within the Basilica validator. This is a **financial-critical feature** that directly affects miner payouts and billing accuracy. The system must operate with extreme reliability, conservative decision-making, and comprehensive audit logging.

**Key Principles:**
- **Safety First**: Never delete rentals with active billing or recent telemetry
- **Conservative**: When in doubt, preserve the rental record
- **Auditable**: Log all decisions and actions comprehensively
- **Non-disruptive**: Operate independently without affecting active rentals
- **Simple**: Follow KISS, DRY, and SOLID principles

## 1. Problem Statement

### 1.1 Current Issues

The validator rental system can create orphaned rentals in several scenarios:

1. **Database Orphans**: Rental records in database without corresponding containers
   - Container deployment failures that don't update state
   - Manual container deletion bypassing rental manager
   - SSH connection failures preventing container verification

2. **Container Orphans**: Running containers without rental records
   - Database corruption or data loss
   - Manual rental record deletion
   - Race conditions during deployment

3. **Stuck State Rentals**: Rentals in non-terminal states indefinitely
   - Rentals stuck in `Provisioning` after deployment failures
   - Rentals stuck in `Stopping` after termination failures
   - Rentals in `Active` with dead containers

4. **Stale References**: Rentals referencing deleted miners or nodes
   - Miner deregistration without rental cleanup
   - Node removal without rental termination

### 1.2 Financial Impact

**Critical Considerations:**
- Orphaned active rentals may generate incorrect billing telemetry
- Stuck rentals block node availability for legitimate rentals
- Missing container cleanup wastes miner resources
- Incorrect state transitions affect miner scoring and rewards
- Billing service requires accurate rental lifecycle events

### 1.3 Success Criteria

1. **Accuracy**: 100% correct identification of orphan rentals (zero false positives)
2. **Billing Safety**: Never interfere with rentals having billing activity in last 48 hours
3. **Performance**: Minimal impact on validator operations (<1% CPU/memory)
4. **Auditability**: Complete log trail for all cleanup actions
5. **Reliability**: Graceful handling of SSH failures and network issues

## 2. System Architecture

### 2.1 Component Overview

```
┌─────────────────────────────────────────────────────────────┐
│                     RentalManager                            │
│  ┌────────────────┐  ┌────────────────┐  ┌───────────────┐ │
│  │ Health Monitor │  │ Billing Monitor│  │ GC Monitor    │ │
│  │  (30s loop)    │  │  (60s loop)    │  │ (3600s loop)  │ │
│  └────────────────┘  └────────────────┘  └───────────────┘ │
└─────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
┌───────▼────────┐   ┌────────▼─────────┐  ┌───────▼──────┐
│  Persistence   │   │ ContainerClient  │  │   Metrics    │
│   (SQLite)     │   │  (SSH + Docker)  │  │ (Prometheus) │
└────────────────┘   └──────────────────┘  └──────────────┘
```

### 2.2 New Component: RentalGarbageCollector

**File**: `crates/basilica-validator/src/rental/garbage_collector.rs`

**Responsibilities:**
1. Periodically scan for orphaned rental records
2. Verify container states against database records
3. Safely transition rentals to terminal states
4. Clean up container resources when appropriate
5. Record metrics and audit logs
6. Respect billing integrity constraints

**Non-Responsibilities:**
- Does NOT handle billing finalization (handled by billing service)
- Does NOT make payment decisions
- Does NOT interfere with health monitoring
- Does NOT handle active rental operations

### 2.3 Architecture Pattern

Follows established monitoring pattern:

```rust
pub struct RentalGarbageCollector {
    persistence: Arc<SimplePersistence>,
    ssh_key_manager: Arc<ValidatorSshKeyManager>,
    metrics: Arc<ValidatorPrometheusMetrics>,
    config: GarbageCollectorConfig,
    cancellation_token: CancellationToken,
}
```

**Key Design Decisions:**
1. **Independent Loop**: Separate from health and billing monitors
2. **Read-Heavy**: Primarily reads database and container state
3. **Conservative Updates**: Only updates obviously orphaned rentals
4. **Idempotent**: Can run multiple times without adverse effects
5. **Fail-Safe**: Errors in processing one rental don't affect others

## 3. Orphan Rental Criteria

### 3.1 Classification System

Orphan rentals are classified into three risk categories:

#### Category A: Safe to Clean (Low Risk)
Rentals meeting ALL criteria:
- State is `Failed` or `Stopped` (terminal states)
- Age > 7 days since last update
- No billing telemetry in last 48 hours
- Container verified as non-existent or stopped

**Action**: Delete from database after audit logging

#### Category B: Needs Termination (Medium Risk)
Rentals meeting criteria:
- State is `Provisioning` AND age > 30 minutes
- State is `Stopping` AND age > 15 minutes
- State is `Active` AND container verified as not running
- No billing telemetry in last 2 hours

**Action**: Transition to `Failed` state, attempt container cleanup

#### Category C: Needs Investigation (High Risk)
Rentals meeting criteria:
- References non-existent miner_id or node_id
- Missing required fields (container_id, ssh_credentials)
- Database record corruption

**Action**: Log warning, transition to `Failed` state, manual review required

#### Category X: Never Touch (Protected)
Rentals with ANY of these characteristics:
- Billing telemetry in last 48 hours
- Age < 5 minutes (grace period for deployment)
- State is `Active` and container is running
- Currently being processed by health monitor

**Action**: Skip entirely, preserve existing state

### 3.2 Orphan Detection Algorithm

```rust
async fn classify_rental(&self, rental: &RentalInfo) -> OrphanCategory {
    // CRITICAL: Never touch recent rentals
    if rental.age() < Duration::from_secs(300) {
        return OrphanCategory::Protected;
    }

    // CRITICAL: Check billing telemetry
    if self.has_recent_billing_activity(&rental.rental_id).await? {
        return OrphanCategory::Protected;
    }

    // Verify container state
    let container_state = self.verify_container_state(rental).await;

    match (rental.state.clone(), container_state) {
        // Terminal states old enough to delete
        (RentalState::Failed | RentalState::Stopped, ContainerState::NotFound)
            if rental.age() > Duration::from_days(7) => {
                OrphanCategory::SafeToClean
            }

        // Stuck in provisioning
        (RentalState::Provisioning, _)
            if rental.age() > Duration::from_secs(1800) => {
                OrphanCategory::NeedsTermination
            }

        // Stuck in stopping
        (RentalState::Stopping, _)
            if rental.age() > Duration::from_secs(900) => {
                OrphanCategory::NeedsTermination
            }

        // Active but container dead
        (RentalState::Active, ContainerState::NotRunning | ContainerState::NotFound) => {
            OrphanCategory::NeedsTermination
        }

        // Data integrity issues
        _ if rental.has_data_integrity_issues() => {
            OrphanCategory::NeedsInvestigation
        }

        // Everything else is protected
        _ => OrphanCategory::Protected
    }
}
```

### 3.3 Container State Verification

```rust
enum ContainerState {
    Running,      // Container exists and is running
    NotRunning,   // Container exists but stopped
    NotFound,     // Container does not exist
    Unknown,      // Cannot determine (SSH failure, etc.)
}

async fn verify_container_state(&self, rental: &RentalInfo) -> ContainerState {
    // Get validator's SSH key
    let validator_key_path = match self.ssh_key_manager.get_persistent_key() {
        Some((_, path)) => path,
        None => return ContainerState::Unknown,
    };

    // Create container client
    let client = match ContainerClient::new(
        rental.ssh_credentials.clone(),
        Some(validator_key_path),
    ) {
        Ok(c) => c,
        Err(e) => {
            warn!("Failed to create container client: {}", e);
            return ContainerState::Unknown;
        }
    };

    // Check container status with timeout
    match tokio::time::timeout(
        Duration::from_secs(10),
        client.get_container_status(&rental.container_id)
    ).await {
        Ok(Ok(status)) => {
            if status.state == "running" {
                ContainerState::Running
            } else {
                ContainerState::NotRunning
            }
        }
        Ok(Err(_)) => ContainerState::NotFound,
        Err(_) => ContainerState::Unknown,
    }
}
```

## 4. Billing Safety System

### 4.1 Billing Activity Detection

**Critical Requirement**: Never clean rentals with recent billing activity.

```rust
async fn has_recent_billing_activity(&self, rental_id: &str) -> Result<bool> {
    // Check if billing telemetry was collected recently
    // This requires adding last_telemetry_at field to RentalInfo
    // or querying billing service directly

    let rental = self.persistence.load_rental(rental_id).await?;

    if let Some(last_telemetry) = rental.last_telemetry_at {
        let age = Utc::now() - last_telemetry;
        if age < Duration::from_hours(48) {
            return Ok(true);
        }
    }

    Ok(false)
}
```

### 4.2 Safe State Transitions

All state transitions must preserve billing integrity:

```rust
async fn transition_to_failed(&self, rental: &RentalInfo) -> Result<()> {
    // CRITICAL: Only transition if no recent billing
    if self.has_recent_billing_activity(&rental.rental_id).await? {
        warn!(
            "Refusing to transition rental {} with recent billing activity",
            rental.rental_id
        );
        return Ok(());
    }

    let mut updated = rental.clone();
    updated.state = RentalState::Failed;
    updated.updated_at = Some(Utc::now());
    updated.terminated_at = Some(Utc::now());
    updated.termination_reason = Some("Garbage collected: orphaned rental".to_string());

    // Audit log BEFORE state change
    info!(
        "GC: Transitioning rental {} from {:?} to Failed (age: {:?}, reason: orphaned)",
        rental.rental_id,
        rental.state,
        rental.age()
    );

    self.persistence.save_rental(&updated).await?;

    // Clear metrics
    self.clear_rental_metrics(rental);

    Ok(())
}
```

### 4.3 Container Cleanup Protocol

```rust
async fn cleanup_container(&self, rental: &RentalInfo) -> Result<()> {
    let validator_key_path = self.ssh_key_manager
        .get_persistent_key()
        .ok_or_else(|| anyhow!("No SSH key"))?
        .1;

    let client = ContainerClient::new(
        rental.ssh_credentials.clone(),
        Some(validator_key_path),
    )?;

    // Try graceful stop first
    match tokio::time::timeout(
        Duration::from_secs(30),
        client.stop_container(&rental.container_id, false)
    ).await {
        Ok(Ok(_)) => {
            info!("GC: Gracefully stopped container {}", rental.container_id);
        }
        Ok(Err(e)) => {
            warn!("GC: Failed to stop container gracefully: {}", e);
        }
        Err(_) => {
            warn!("GC: Timeout stopping container, attempting force stop");
            // Force stop on timeout
            let _ = client.stop_container(&rental.container_id, true).await;
        }
    }

    // Remove container
    match client.remove_container(&rental.container_id).await {
        Ok(_) => {
            info!("GC: Removed container {}", rental.container_id);
        }
        Err(e) => {
            warn!("GC: Failed to remove container: {}", e);
        }
    }

    Ok(())
}
```

## 5. Configuration Specification

### 5.1 Configuration Structure

```rust
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct GarbageCollectorConfig {
    /// Enable garbage collection
    #[serde(default = "default_enabled")]
    pub enabled: bool,

    /// Scan interval in seconds (default: 1 hour)
    #[serde(default = "default_scan_interval_secs")]
    pub scan_interval_secs: u64,

    /// Minimum age before considering rental for cleanup (seconds)
    #[serde(default = "default_min_age_secs")]
    pub min_age_secs: u64,

    /// Age threshold for stuck provisioning rentals (seconds)
    #[serde(default = "default_provisioning_timeout_secs")]
    pub provisioning_timeout_secs: u64,

    /// Age threshold for stuck stopping rentals (seconds)
    #[serde(default = "default_stopping_timeout_secs")]
    pub stopping_timeout_secs: u64,

    /// Age threshold for deleting terminal state rentals (days)
    #[serde(default = "default_terminal_retention_days")]
    pub terminal_retention_days: u32,

    /// Billing safety window - never touch rentals with telemetry in this window (hours)
    #[serde(default = "default_billing_safety_window_hours")]
    pub billing_safety_window_hours: u32,

    /// Container verification timeout (seconds)
    #[serde(default = "default_container_check_timeout_secs")]
    pub container_check_timeout_secs: u64,

    /// Maximum rentals to process per scan
    #[serde(default = "default_max_batch_size")]
    pub max_batch_size: usize,

    /// Enable dry-run mode (log actions without executing)
    #[serde(default = "default_dry_run")]
    pub dry_run: bool,
}

// Default values
fn default_enabled() -> bool { true }
fn default_scan_interval_secs() -> u64 { 3600 }  // 1 hour
fn default_min_age_secs() -> u64 { 300 }  // 5 minutes grace period
fn default_provisioning_timeout_secs() -> u64 { 1800 }  // 30 minutes
fn default_stopping_timeout_secs() -> u64 { 900 }  // 15 minutes
fn default_terminal_retention_days() -> u32 { 7 }  // 1 week
fn default_billing_safety_window_hours() -> u32 { 48 }  // 2 days
fn default_container_check_timeout_secs() -> u64 { 10 }
fn default_max_batch_size() -> usize { 100 }
fn default_dry_run() -> bool { false }
```

### 5.2 TOML Configuration Example

```toml
[garbage_collector]
enabled = true
scan_interval_secs = 3600
min_age_secs = 300
provisioning_timeout_secs = 1800
stopping_timeout_secs = 900
terminal_retention_days = 7
billing_safety_window_hours = 48
container_check_timeout_secs = 10
max_batch_size = 100
dry_run = false
```

### 5.3 Configuration Validation

```rust
impl GarbageCollectorConfig {
    pub fn validate(&self) -> Result<(), ConfigurationError> {
        if self.scan_interval_secs == 0 {
            return Err(ConfigurationError::InvalidValue {
                key: "garbage_collector.scan_interval_secs".to_string(),
                value: "0".to_string(),
                reason: "Scan interval must be greater than 0".to_string(),
            });
        }

        if self.min_age_secs < 60 {
            return Err(ConfigurationError::InvalidValue {
                key: "garbage_collector.min_age_secs".to_string(),
                value: self.min_age_secs.to_string(),
                reason: "Minimum age must be at least 60 seconds".to_string(),
            });
        }

        if self.billing_safety_window_hours < 24 {
            return Err(ConfigurationError::InvalidValue {
                key: "garbage_collector.billing_safety_window_hours".to_string(),
                value: self.billing_safety_window_hours.to_string(),
                reason: "Billing safety window must be at least 24 hours".to_string(),
            });
        }

        Ok(())
    }
}
```

## 6. Implementation Specification

### 6.1 Module Structure

```
crates/basilica-validator/src/rental/
├── mod.rs                    (RentalManager with GC integration)
├── garbage_collector.rs      (NEW: Core GC implementation)
├── monitoring.rs             (Existing: Health monitoring)
├── billing.rs                (Existing: Billing telemetry)
├── types.rs                  (Updated: Add OrphanCategory, last_telemetry_at)
├── container_client.rs       (Existing: Container operations)
└── deployment.rs             (Existing: Container deployment)
```

### 6.2 Core Implementation

**File**: `garbage_collector.rs`

```rust
use anyhow::{Context, Result};
use chrono::{Duration, Utc};
use std::sync::Arc;
use tokio::time::interval;
use tokio_util::sync::CancellationToken;
use tracing::{debug, error, info, warn};

use super::container_client::ContainerClient;
use super::types::{RentalInfo, RentalState};
use crate::metrics::ValidatorPrometheusMetrics;
use crate::persistence::SimplePersistence;
use crate::ssh::ValidatorSshKeyManager;

/// Orphan rental categories
#[derive(Debug, Clone, PartialEq)]
pub enum OrphanCategory {
    /// Safe to delete - terminal state, old, no billing activity
    SafeToClean,
    /// Needs state transition to Failed and container cleanup
    NeedsTermination,
    /// Data integrity issues requiring manual investigation
    NeedsInvestigation,
    /// Protected - never touch
    Protected,
}

/// Container state as determined by verification
#[derive(Debug, Clone, PartialEq)]
pub enum ContainerState {
    Running,
    NotRunning,
    NotFound,
    Unknown,
}

/// Configuration for garbage collector
#[derive(Debug, Clone)]
pub struct GarbageCollectorConfig {
    pub enabled: bool,
    pub scan_interval_secs: u64,
    pub min_age_secs: u64,
    pub provisioning_timeout_secs: u64,
    pub stopping_timeout_secs: u64,
    pub terminal_retention_days: u32,
    pub billing_safety_window_hours: u32,
    pub container_check_timeout_secs: u64,
    pub max_batch_size: usize,
    pub dry_run: bool,
}

impl Default for GarbageCollectorConfig {
    fn default() -> Self {
        Self {
            enabled: true,
            scan_interval_secs: 3600,
            min_age_secs: 300,
            provisioning_timeout_secs: 1800,
            stopping_timeout_secs: 900,
            terminal_retention_days: 7,
            billing_safety_window_hours: 48,
            container_check_timeout_secs: 10,
            max_batch_size: 100,
            dry_run: false,
        }
    }
}

/// Rental garbage collector
#[derive(Clone)]
pub struct RentalGarbageCollector {
    persistence: Arc<SimplePersistence>,
    ssh_key_manager: Arc<ValidatorSshKeyManager>,
    metrics: Arc<ValidatorPrometheusMetrics>,
    config: GarbageCollectorConfig,
    cancellation_token: CancellationToken,
}

impl RentalGarbageCollector {
    /// Create new garbage collector
    pub fn new(
        persistence: Arc<SimplePersistence>,
        ssh_key_manager: Arc<ValidatorSshKeyManager>,
        metrics: Arc<ValidatorPrometheusMetrics>,
        config: GarbageCollectorConfig,
    ) -> Self {
        Self {
            persistence,
            ssh_key_manager,
            metrics,
            config,
            cancellation_token: CancellationToken::new(),
        }
    }

    /// Start garbage collection loop
    pub fn start(&self) {
        let collector = self.clone();
        tokio::spawn(async move {
            collector.collection_loop().await;
        });
    }

    /// Stop garbage collection
    pub fn stop(&self) {
        self.cancellation_token.cancel();
    }

    /// Main collection loop
    async fn collection_loop(&self) {
        let mut scan_interval = interval(
            std::time::Duration::from_secs(self.config.scan_interval_secs)
        );

        info!("Rental garbage collector started (dry_run: {})", self.config.dry_run);

        loop {
            tokio::select! {
                _ = self.cancellation_token.cancelled() => {
                    info!("Rental garbage collector stopped");
                    break;
                }
                _ = scan_interval.tick() => {
                    if let Err(e) = self.perform_collection().await {
                        error!("Garbage collection error: {}", e);
                    }
                }
            }
        }
    }

    /// Perform garbage collection scan
    async fn perform_collection(&self) -> Result<()> {
        info!("Starting garbage collection scan");

        // Query all non-terminated rentals
        let rentals = self.persistence
            .query_non_terminated_rentals()
            .await
            .context("Failed to query rentals")?;

        debug!("Scanning {} non-terminated rentals", rentals.len());

        let mut stats = CollectionStats::default();

        // Process rentals one by one
        for rental in rentals.iter().take(self.config.max_batch_size) {
            match self.process_rental(rental, &mut stats).await {
                Ok(_) => {}
                Err(e) => {
                    error!(
                        "Failed to process rental {}: {}",
                        rental.rental_id, e
                    );
                    stats.errors += 1;
                }
            }
        }

        info!(
            "Garbage collection scan complete: {} scanned, {} cleaned, {} terminated, \
             {} investigated, {} protected, {} errors",
            stats.scanned,
            stats.cleaned,
            stats.terminated,
            stats.investigated,
            stats.protected,
            stats.errors
        );

        Ok(())
    }

    /// Process a single rental
    async fn process_rental(
        &self,
        rental: &RentalInfo,
        stats: &mut CollectionStats,
    ) -> Result<()> {
        stats.scanned += 1;

        // Classify the rental
        let category = self.classify_rental(rental).await?;

        match category {
            OrphanCategory::SafeToClean => {
                stats.cleaned += 1;
                self.handle_safe_to_clean(rental).await?;
            }
            OrphanCategory::NeedsTermination => {
                stats.terminated += 1;
                self.handle_needs_termination(rental).await?;
            }
            OrphanCategory::NeedsInvestigation => {
                stats.investigated += 1;
                self.handle_needs_investigation(rental).await?;
            }
            OrphanCategory::Protected => {
                stats.protected += 1;
                debug!("Rental {} is protected, skipping", rental.rental_id);
            }
        }

        Ok(())
    }

    /// Classify a rental into orphan category
    async fn classify_rental(&self, rental: &RentalInfo) -> Result<OrphanCategory> {
        // CRITICAL: Never touch recent rentals
        let age = rental.age();
        if age < Duration::seconds(self.config.min_age_secs as i64) {
            return Ok(OrphanCategory::Protected);
        }

        // CRITICAL: Check billing telemetry
        if self.has_recent_billing_activity(rental).await? {
            return Ok(OrphanCategory::Protected);
        }

        // Verify container state
        let container_state = self.verify_container_state(rental).await;

        // Classification logic
        match (&rental.state, container_state, age) {
            // Terminal states old enough to delete
            (RentalState::Failed | RentalState::Stopped, ContainerState::NotFound, age)
                if age > Duration::days(self.config.terminal_retention_days as i64) => {
                    Ok(OrphanCategory::SafeToClean)
                }

            // Stuck in provisioning
            (RentalState::Provisioning, _, age)
                if age > Duration::seconds(self.config.provisioning_timeout_secs as i64) => {
                    Ok(OrphanCategory::NeedsTermination)
                }

            // Stuck in stopping
            (RentalState::Stopping, _, age)
                if age > Duration::seconds(self.config.stopping_timeout_secs as i64) => {
                    Ok(OrphanCategory::NeedsTermination)
                }

            // Active but container dead
            (RentalState::Active, ContainerState::NotRunning | ContainerState::NotFound, _) => {
                Ok(OrphanCategory::NeedsTermination)
            }

            // Data integrity issues
            _ if self.has_data_integrity_issues(rental) => {
                Ok(OrphanCategory::NeedsInvestigation)
            }

            // Everything else is protected
            _ => Ok(OrphanCategory::Protected),
        }
    }

    /// Check if rental has recent billing activity
    async fn has_recent_billing_activity(&self, rental: &RentalInfo) -> Result<bool> {
        // Check last_telemetry_at field if available
        if let Some(last_telemetry) = rental.last_telemetry_at {
            let age = Utc::now() - last_telemetry;
            let threshold = Duration::hours(self.config.billing_safety_window_hours as i64);

            if age < threshold {
                debug!(
                    "Rental {} has recent billing activity ({:?} ago)",
                    rental.rental_id, age
                );
                return Ok(true);
            }
        }

        Ok(false)
    }

    /// Verify container state via SSH
    async fn verify_container_state(&self, rental: &RentalInfo) -> ContainerState {
        // Implementation details...
        ContainerState::Unknown
    }

    /// Check for data integrity issues
    fn has_data_integrity_issues(&self, rental: &RentalInfo) -> bool {
        rental.container_id.is_empty()
            || rental.node_id.is_empty()
            || rental.ssh_credentials.host.is_empty()
    }

    /// Handle rentals safe to clean
    async fn handle_safe_to_clean(&self, rental: &RentalInfo) -> Result<()> {
        info!(
            "GC: Cleaning rental {} (state: {:?}, age: {:?})",
            rental.rental_id,
            rental.state,
            rental.age()
        );

        if self.config.dry_run {
            info!("GC: [DRY RUN] Would delete rental {}", rental.rental_id);
            return Ok(());
        }

        // Delete from database
        self.persistence.delete_rental(&rental.rental_id).await?;

        info!("GC: Deleted rental {}", rental.rental_id);
        Ok(())
    }

    /// Handle rentals needing termination
    async fn handle_needs_termination(&self, rental: &RentalInfo) -> Result<()> {
        // Implementation details...
        Ok(())
    }

    /// Handle rentals needing investigation
    async fn handle_needs_investigation(&self, rental: &RentalInfo) -> Result<()> {
        // Implementation details...
        Ok(())
    }
}

/// Collection statistics
#[derive(Debug, Default)]
struct CollectionStats {
    scanned: usize,
    cleaned: usize,
    terminated: usize,
    investigated: usize,
    protected: usize,
    errors: usize,
}
```

### 6.3 Integration with RentalManager

**File**: `rental/mod.rs` (modifications)

```rust
pub struct RentalManager {
    // ... existing fields
    garbage_collector: Option<Arc<RentalGarbageCollector>>,
}

impl RentalManager {
    pub async fn create(
        config: &ValidatorConfig,
        persistence: Arc<SimplePersistence>,
        metrics: Arc<ValidatorPrometheusMetrics>,
    ) -> Result<Self> {
        // ... existing initialization

        // Create garbage collector if enabled
        let garbage_collector = if config.garbage_collector.enabled {
            let gc = RentalGarbageCollector::new(
                persistence.clone(),
                ssh_key_manager.clone(),
                metrics.clone(),
                config.garbage_collector.clone(),
            );
            Some(Arc::new(gc))
        } else {
            None
        };

        Ok(Self {
            // ... existing fields
            garbage_collector,
        })
    }

    pub fn start(&self) {
        // ... existing monitor starts

        if let Some(gc) = &self.garbage_collector {
            gc.start();
        }
    }
}

impl Drop for RentalManager {
    fn drop(&mut self) {
        // ... existing cleanup

        if let Some(gc) = &self.garbage_collector {
            gc.stop();
        }
    }
}
```

### 6.4 Database Schema Updates

**New Migration**: `migrations/XXX_add_rental_telemetry_tracking.sql`

```sql
-- Add last_telemetry_at column to track billing activity
ALTER TABLE rentals ADD COLUMN last_telemetry_at TEXT;

-- Add indexes for efficient orphan queries
CREATE INDEX IF NOT EXISTS idx_rentals_state ON rentals(state);
CREATE INDEX IF NOT EXISTS idx_rentals_created_at ON rentals(created_at);
CREATE INDEX IF NOT EXISTS idx_rentals_updated_at ON rentals(updated_at);
CREATE INDEX IF NOT EXISTS idx_rentals_node_id ON rentals(node_id);
CREATE INDEX IF NOT EXISTS idx_rentals_miner_id ON rentals(miner_id);

-- Composite index for garbage collection queries
CREATE INDEX IF NOT EXISTS idx_rentals_gc_scan
  ON rentals(state, created_at, last_telemetry_at)
  WHERE state NOT IN ('stopped', 'failed');
```

### 6.5 Types Updates

**File**: `rental/types.rs` (additions)

```rust
impl RentalInfo {
    /// Get rental age since creation
    pub fn age(&self) -> Duration {
        Utc::now() - self.created_at
    }

    /// Get time since last update
    pub fn staleness(&self) -> Option<Duration> {
        self.updated_at.map(|updated| Utc::now() - updated)
    }

    /// Check if rental is in terminal state
    pub fn is_terminal(&self) -> bool {
        matches!(self.state, RentalState::Stopped | RentalState::Failed)
    }
}

// Add last_telemetry_at field to RentalInfo
pub struct RentalInfo {
    // ... existing fields

    /// Timestamp of last billing telemetry collection
    pub last_telemetry_at: Option<DateTime<Utc>>,
}
```

### 6.6 Billing Integration

**File**: `rental/billing.rs` (modification)

```rust
impl RentalBillingMonitor {
    async fn collect_rental_telemetry(&self, rental: &RentalInfo) -> Result<()> {
        // ... existing collection logic

        // Update last_telemetry_at timestamp
        let mut updated = rental.clone();
        updated.last_telemetry_at = Some(Utc::now());

        self.persistence.save_rental(&updated).await?;

        Ok(())
    }
}
```

## 7. Testing Strategy

### 7.1 Unit Tests

**File**: `rental/garbage_collector.rs` (test module)

```rust
#[cfg(test)]
mod tests {
    use super::*;

    #[tokio::test]
    async fn test_classify_recent_rental_as_protected() {
        // Rentals < 5 minutes old should never be touched
    }

    #[tokio::test]
    async fn test_classify_rental_with_recent_billing_as_protected() {
        // Rentals with telemetry in last 48h should be protected
    }

    #[tokio::test]
    async fn test_classify_stuck_provisioning_rental() {
        // Provisioning > 30min should need termination
    }

    #[tokio::test]
    async fn test_classify_old_terminal_rental_as_cleanable() {
        // Failed/Stopped > 7 days should be cleanable
    }

    #[tokio::test]
    async fn test_classify_active_with_dead_container() {
        // Active state but container not found should need termination
    }

    #[tokio::test]
    async fn test_has_data_integrity_issues() {
        // Missing required fields should be flagged
    }

    #[tokio::test]
    async fn test_safe_deletion_in_dry_run_mode() {
        // Dry run should log but not delete
    }
}
```

### 7.2 Integration Tests

**File**: `tests/rental_garbage_collection_integration.rs`

```rust
#[tokio::test]
async fn test_full_garbage_collection_cycle() {
    // 1. Create test rentals in various states
    // 2. Start garbage collector
    // 3. Wait for scan
    // 4. Verify correct classification and actions
}

#[tokio::test]
async fn test_gc_preserves_active_rentals() {
    // Ensure active rentals with running containers are not touched
}

#[tokio::test]
async fn test_gc_respects_billing_safety_window() {
    // Rentals with recent telemetry must not be cleaned
}

#[tokio::test]
async fn test_gc_handles_ssh_failures_gracefully() {
    // SSH failures should not cause crashes or false positives
}

#[tokio::test]
async fn test_gc_metrics_updated_correctly() {
    // Verify metrics are cleared when rentals are cleaned
}
```

### 7.3 Manual Testing Scenarios

1. **Normal Operation**:
   - Create rental, let it run, verify GC doesn't touch it

2. **Stuck Provisioning**:
   - Create rental, kill deployment, wait 30min, verify GC terminates it

3. **Dead Container**:
   - Create rental, manually kill container, verify GC detects and terminates

4. **Billing Safety**:
   - Create rental with recent telemetry, verify GC protects it

5. **Old Terminal**:
   - Create failed rental, wait 7 days, verify GC deletes it

6. **Dry Run**:
   - Enable dry_run mode, verify logging without actions

## 8. Metrics and Observability

### 8.1 New Prometheus Metrics

```rust
// In metrics.rs
pub struct ValidatorPrometheusMetrics {
    // ... existing metrics

    /// Garbage collection scans
    pub gc_scans_total: Counter,

    /// Rentals processed by category
    pub gc_rentals_processed: CounterVec,  // labels: category

    /// Garbage collection errors
    pub gc_errors_total: Counter,

    /// Last scan duration
    pub gc_scan_duration_seconds: Histogram,

    /// Rentals by classification
    pub gc_rental_classification: GaugeVec,  // labels: category
}
```

### 8.2 Structured Logging

All GC operations must log with:
- Rental ID
- Current state
- Action taken
- Reason for action
- Age of rental
- Container state
- Billing activity status

Example:
```rust
info!(
    rental_id = %rental.rental_id,
    state = ?rental.state,
    age_secs = rental.age().num_seconds(),
    container_state = ?container_state,
    has_billing = billing_active,
    action = "terminate",
    "GC: Terminating orphaned rental"
);
```

### 8.3 Alert Conditions

Recommended alerts:
1. **High orphan rate**: More than 10% of rentals classified as orphans
2. **GC errors**: More than 5 errors per hour
3. **Stuck rentals**: Any rental in non-terminal state > 24 hours
4. **Data integrity issues**: Any rentals flagged for investigation

## 9. Rollout Plan

### 9.1 Phase 1: Implementation (Week 1-2)

**Tasks:**
1. Implement core GarbageCollector struct
2. Implement classification logic
3. Implement container verification
4. Implement billing safety checks
5. Add database migrations
6. Update RentalInfo with last_telemetry_at
7. Integrate with RentalManager
8. Add configuration validation

**Deliverables:**
- Fully implemented garbage_collector.rs module
- Database migration file
- Configuration structure and defaults
- Unit tests with >90% coverage

### 9.2 Phase 2: Integration Testing (Week 2-3)

**Tasks:**
1. Write integration tests
2. Test with mock billing service
3. Test SSH failure scenarios
4. Test concurrent operation with health monitor
5. Performance testing under load
6. Dry-run testing in staging environment

**Deliverables:**
- Complete integration test suite
- Performance benchmarks
- Staging environment validation report

### 9.3 Phase 3: Safe Rollout (Week 3-4)

**Tasks:**
1. Deploy with dry_run=true in production
2. Monitor logs for 48 hours
3. Verify classification accuracy
4. Check for false positives
5. Enable actual cleanup if validated
6. Monitor for 1 week with active cleanup

**Deliverables:**
- Production deployment with dry-run validation
- Classification accuracy report
- Full production deployment with monitoring

### 9.4 Phase 4: Documentation and Handoff (Week 4)

**Tasks:**
1. Write operational runbook
2. Document alert response procedures
3. Create dashboard for GC monitoring
4. Train operations team
5. Document manual override procedures

**Deliverables:**
- Operations runbook
- Monitoring dashboard
- Training materials
- Manual intervention procedures

## 10. Risk Assessment

### 10.1 Critical Risks

| Risk | Impact | Likelihood | Mitigation |
|------|--------|------------|------------|
| False positive deletion of active rental | CRITICAL | Low | Billing safety window, age thresholds, dry-run testing |
| Billing data loss | CRITICAL | Low | Never delete rentals with recent telemetry, audit logging |
| SSH key compromise | HIGH | Low | Use existing SSH key manager, no new credentials |
| Container cleanup failures | MEDIUM | Medium | Graceful degradation, retry logic, manual cleanup |
| Database corruption | MEDIUM | Low | Transaction safety, comprehensive error handling |
| Performance degradation | LOW | Low | Batch limits, configurable intervals, efficient queries |

### 10.2 Operational Risks

| Risk | Impact | Likelihood | Mitigation |
|------|--------|------------|------------|
| Excessive logging volume | LOW | Medium | Appropriate log levels, sampling |
| Alert fatigue | LOW | Medium | Tuned alert thresholds, actionable alerts only |
| Configuration errors | MEDIUM | Low | Validation on startup, safe defaults |
| Monitoring blind spots | MEDIUM | Low | Comprehensive metrics, regular review |

### 10.3 Financial Risks

| Risk | Impact | Likelihood | Mitigation |
|------|--------|------------|------------|
| Incorrect miner payout | CRITICAL | Very Low | Billing safety window, conservative thresholds |
| Resource cost overruns | LOW | Low | Batch limits, efficient queries |
| Audit trail gaps | MEDIUM | Low | Comprehensive logging before actions |

## 11. Success Metrics

### 11.1 Functional Metrics

- **Orphan Detection Rate**: Percentage of actual orphans correctly identified
- **False Positive Rate**: Target < 0.1% (must be near zero)
- **Cleanup Success Rate**: Percentage of cleanups completed without errors
- **Time to Detection**: Average time from orphan creation to detection

### 11.2 Performance Metrics

- **Scan Duration**: Target < 30 seconds for 1000 rentals
- **CPU Impact**: Target < 1% average CPU usage
- **Memory Impact**: Target < 50MB additional memory
- **Database Load**: Target < 10 queries per scan

### 11.3 Reliability Metrics

- **Error Rate**: Target < 1 error per 1000 rentals processed
- **Uptime**: Target 100% (no GC-related crashes)
- **Recovery Time**: From error to normal operation < 1 minute

## 12. Open Questions and Decisions

### 12.1 Resolved Decisions

1. **Billing Safety Window**: 48 hours (conservative)
2. **Terminal Retention**: 7 days (balance audit needs vs storage)
3. **Scan Interval**: 1 hour (sufficient for most orphans)
4. **Dry Run Default**: False (but required for initial deployment)

### 12.2 Questions Requiring Clarification

1. **Payment Settlement Process**:
   - Where/how are miner payments processed?
   - Can we query payment status before cleanup?
   - Recommendation: Add payment service integration or extend safety window

2. **Manual Override Mechanism**:
   - Should there be a way to protect specific rentals from GC?
   - Recommendation: Add `gc_exempt` flag to rental metadata

3. **Audit Trail Retention**:
   - Should deleted rentals be backed up separately?
   - Recommendation: Add audit table or external log archival

4. **Cross-Service Coordination**:
   - Does billing service need notification before deletion?
   - Recommendation: Add billing service health check before GC actions

## 13. Implementation Checklist

### Phase 1: Core Implementation
- [ ] Create `garbage_collector.rs` module
- [ ] Implement `OrphanCategory` enum
- [ ] Implement `ContainerState` enum
- [ ] Implement `GarbageCollectorConfig` struct with defaults
- [ ] Implement `RentalGarbageCollector` struct
- [ ] Implement classification logic
- [ ] Implement container verification
- [ ] Implement billing safety checks
- [ ] Implement safe cleanup methods
- [ ] Add unit tests (>90% coverage)

### Phase 2: Integration
- [ ] Add `last_telemetry_at` field to `RentalInfo`
- [ ] Update billing monitor to track telemetry timestamps
- [ ] Create database migration for new column and indexes
- [ ] Add GC configuration to `ValidatorConfig`
- [ ] Integrate GC into `RentalManager`
- [ ] Add GC metrics to `ValidatorPrometheusMetrics`
- [ ] Implement graceful shutdown in `Drop`
- [ ] Add integration tests

### Phase 3: Safety and Validation
- [ ] Implement dry-run mode
- [ ] Add configuration validation
- [ ] Add comprehensive logging
- [ ] Create monitoring dashboard
- [ ] Write operational runbook
- [ ] Conduct security review
- [ ] Performance testing
- [ ] Load testing

### Phase 4: Deployment
- [ ] Deploy with dry_run=true to staging
- [ ] Monitor and validate for 48 hours
- [ ] Review classification accuracy
- [ ] Fix any issues identified
- [ ] Deploy with dry_run=true to production
- [ ] Monitor and validate for 48 hours
- [ ] Enable actual cleanup if validated
- [ ] Monitor for 1 week
- [ ] Document lessons learned

## 14. Appendix

### A. References

- Health Monitoring: `crates/basilica-validator/src/rental/monitoring.rs`
- Billing Telemetry: `crates/basilica-validator/src/rental/billing.rs`
- Rental Manager: `crates/basilica-validator/src/rental/mod.rs`
- Container Client: `crates/basilica-validator/src/rental/container_client.rs`
- Persistence Layer: `crates/basilica-validator/src/persistence/`

### B. Glossary

- **Orphan Rental**: A rental record without valid container or stuck in non-terminal state
- **Terminal State**: `Stopped` or `Failed` rental states (final states)
- **Billing Safety Window**: Time period during which rentals with telemetry are protected
- **Dry Run**: Mode where GC logs actions without executing them
- **Grace Period**: Minimum age before rental can be considered for cleanup

### C. Contact and Escalation

For questions or issues with this implementation:
1. Review this architecture document
2. Check operational runbook (to be created)
3. Review code comments and tests
4. Escalate to validator team lead if financial safety concerns arise

---

**Document Status**: Ready for Implementation Review
**Next Review Date**: Upon completion of Phase 1
**Approval Required**: Technical Lead, Financial Systems Owner


Risk	Impact	Likelihood	Mitigation
False positive deletion of active rental	CRITICAL	Low	Billing safety window, age thresholds, dry-run testing
Billing data loss	CRITICAL	Low	Never delete rentals with recent telemetry, audit logging
SSH key compromise	HIGH	Low	Use existing SSH key manager, no new credentials
Container cleanup failures	MEDIUM	Medium	Graceful degradation, retry logic, manual cleanup
Database corruption	MEDIUM	Low	Transaction safety, comprehensive error handling
Performance degradation	LOW	Low	Batch limits, configurable intervals, efficient queries

Risk	Impact	Likelihood	Mitigation
Excessive logging volume	LOW	Medium	Appropriate log levels, sampling
Alert fatigue	LOW	Medium	Tuned alert thresholds, actionable alerts only
Configuration errors	MEDIUM	Low	Validation on startup, safe defaults
Monitoring blind spots	MEDIUM	Low	Comprehensive metrics, regular review

Risk	Impact	Likelihood	Mitigation
Incorrect miner payout	CRITICAL	Very Low	Billing safety window, conservative thresholds
Resource cost overruns	LOW	Low	Batch limits, efficient queries
Audit trail gaps	MEDIUM	Low	Comprehensive logging before actions

[FEATURE] Rental Orphan Garbage Collection System #217

Description

Rental Orphan Garbage Collection System Architecture

Executive Summary

1. Problem Statement

1.1 Current Issues

1.2 Financial Impact

1.3 Success Criteria

2. System Architecture

2.1 Component Overview

2.2 New Component: RentalGarbageCollector

2.3 Architecture Pattern

3. Orphan Rental Criteria

3.1 Classification System

Category A: Safe to Clean (Low Risk)

Category B: Needs Termination (Medium Risk)

Category C: Needs Investigation (High Risk)

Category X: Never Touch (Protected)

3.2 Orphan Detection Algorithm

3.3 Container State Verification

4. Billing Safety System

4.1 Billing Activity Detection

4.2 Safe State Transitions

4.3 Container Cleanup Protocol

5. Configuration Specification

5.1 Configuration Structure

5.2 TOML Configuration Example

5.3 Configuration Validation

6. Implementation Specification

6.1 Module Structure

6.2 Core Implementation

6.3 Integration with RentalManager

6.4 Database Schema Updates

6.5 Types Updates

6.6 Billing Integration

7. Testing Strategy

7.1 Unit Tests

7.2 Integration Tests

7.3 Manual Testing Scenarios

8. Metrics and Observability

8.1 New Prometheus Metrics

8.2 Structured Logging

8.3 Alert Conditions

9. Rollout Plan

9.1 Phase 1: Implementation (Week 1-2)

9.2 Phase 2: Integration Testing (Week 2-3)

9.3 Phase 3: Safe Rollout (Week 3-4)

9.4 Phase 4: Documentation and Handoff (Week 4)

10. Risk Assessment

10.1 Critical Risks

10.2 Operational Risks

10.3 Financial Risks

11. Success Metrics

11.1 Functional Metrics

11.2 Performance Metrics

11.3 Reliability Metrics

12. Open Questions and Decisions

12.1 Resolved Decisions

12.2 Questions Requiring Clarification

13. Implementation Checklist

Phase 1: Core Implementation

Phase 2: Integration

Phase 3: Safety and Validation

Phase 4: Deployment

14. Appendix

A. References

B. Glossary

C. Contact and Escalation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions