bssh Architecture Documentation

Overview

bssh (Backend.AI SSH) is a high-performance parallel SSH command execution tool designed for managing Backend.AI clusters. This document describes the detailed architecture, implementation decisions, and design patterns used in the project.

System Architecture

┌─────────────────────────────────────────────────────────┐
│                     CLI Interface                       │
│                       (main.rs)                         │
└─────────────────────┬───────────────────────────────────┘
                      │
        ┌─────────────┼───────────────┐
        ▼             ▼               ▼
┌──────────────┐ ┌───────────┐ ┌─────────────┐
│   Commands   │ │  Config   │ │    Utils    │
│   Module     │ │  Manager  │ │   Module    │
│ (commands/*) │ │(config.rs)│ │  (utils/*)  │
└──────┬───────┘ └───────────┘ └─────────────┘
        │
        ▼
┌──────────────┐           ┌──────────────┐  ┌──────────┐
│   Executor   │◄──────────┤     Node     │  │    UI    │
│  (Parallel)  │           │    Parser    │  │  System  │
│(executor.rs) │           │  (node.rs)   │  │ (ui.rs)  │
└──────┬───────┘           └──────────────┘  └──────────┘
       │
       ├──────────┬────────────┐
       ▼          ▼            ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│   SSH    │ │   SSH    │ │   SSH    │
│  Client  │ │  Client  │ │  Client  │
│ (async)  │ │ (async)  │ │ (async)  │
└──────────┘ └──────────┘ └──────────┘

Modular Design (Refactored 2025-01-22)

The codebase has been restructured for better maintainability and scalability:

Minimal Entry Point (main.rs):
- Reduced from 987 lines to ~150 lines
- Only handles CLI parsing and command dispatching
- Delegates all business logic to specialized modules
Command Modules (commands/):
- exec.rs: Command execution with output management
- ping.rs: Connectivity testing
- interactive.rs: Interactive shell sessions (Phase 1 completed)
- list.rs: Cluster listing
- upload.rs: File upload operations
- download.rs: File download operations
- Each module is self-contained and independently testable
Utility Modules (utils/):
- fs.rs: File system operations (glob patterns, directory walking)
- output.rs: Command output file management
- logging.rs: Logging initialization
- Reusable across different commands

Component Details

1. CLI Interface (`cli.rs`, `main.rs`)

Design Decisions:

Uses clap v4 with derive macros for type-safe argument parsing
Subcommand pattern for different operations (exec, list, ping, upload, download)
Environment variable support via env attribute
Refactored (2025-01-22): Separated command logic from main.rs

Implementation:

// main.rs - Minimal dispatcher
async fn main() -> Result<()> {
    let cli = Cli::parse();
    match cli.command {
        Commands::Exec { .. } => exec::execute_command(params).await,
        Commands::List => list::list_clusters(&config),
        Commands::Ping => ping::ping_nodes(nodes, ...).await,
        Commands::Upload { .. } => upload::upload_file(params, ...).await,
        Commands::Download { .. } => download::download_file(params, ...).await,
    }
}

Trade-offs:

Derive macros increase compile time but provide better type safety
Subcommand pattern adds complexity but improves UX
Modular structure increases file count but improves testability

2. Configuration Management (`config.rs`)

Design Decisions:

YAML format for human readability
Hierarchical configuration with cluster → nodes structure
Default values with override capability
Full XDG Base Directory specification compliance

Configuration Loading Priority:

Backend.AI environment variables (auto-detection)
Current directory (./config.yaml)
XDG config directory ($XDG_CONFIG_HOME/bssh/config.yaml or ~/.config/bssh/config.yaml)
CLI specified path (via --config flag)

XDG Support:

Respects $XDG_CONFIG_HOME environment variable
Uses directories crate's ProjectDirs for platform-specific paths
Follows XDG Base Directory specification
Tilde expansion for paths using shellexpand

Key Features:

Lazy loading of configuration
Validation at parse time
Support for both file-based and CLI-specified nodes
✅ Environment variable expansion (Phase 1 - Completed 2025-08-21)
- Supports ${VAR} and $VAR syntax
- Expands in hostnames and usernames
- Graceful fallback for undefined variables

Data Model:

pub struct Config {
    pub clusters: HashMap<String, Cluster>,
    pub default_cluster: Option<String>,
    pub ssh_config: SshConfig,
}

pub struct Cluster {
    pub nodes: Vec<Node>,
    pub ssh_key: Option<PathBuf>,
    pub user: Option<String>,
}

3. Parallel Executor (`executor.rs`)

Design Decisions:

Tokio-based async execution for maximum concurrency
Semaphore-based concurrency limiting to prevent resource exhaustion
Progress bar visualization using indicatif
Streaming output collection for real-time feedback

Concurrency Model:

let semaphore = Arc::new(Semaphore::new(max_parallel));
let tasks: Vec<JoinHandle<Result<ExecutionResult>>> = nodes
    .into_iter()
    .map(|node| {
        let permit = semaphore.clone().acquire_owned();
        tokio::spawn(async move {
            let _permit = permit.await;
            execute_on_node(node, command).await
        })
    })
    .collect();

Performance Optimizations:

Connection reuse within same node (planned)
Buffered I/O for output collection
Early termination on critical failures

4. SSH Client (`ssh/client.rs`)

Library Choice: async-ssh2-tokio

Why not thrussh: async-ssh2-tokio provides simpler API and better OpenSSH compatibility
Why not openssh: Need fine-grained control over connections
Why not ssh2: Need async/await support for concurrent operations

Implementation Details:

Async/await pattern for non-blocking I/O
Support for both key-based and agent authentication
Configurable timeouts and retry logic

Security Implementation (Phase 1 - Completed 2025-08-21):

✅ Host key verification with three modes:
- StrictHostKeyChecking::Yes - Strict verification using known_hosts
- StrictHostKeyChecking::No - Skip all verification
- StrictHostKeyChecking::AcceptNew - TOFU mode (limited by library)
✅ CLI flag --strict-host-key-checking with default "accept-new"
✅ Uses system known_hosts file (~/.ssh/known_hosts)

Remaining Limitations:

Missing SFTP support for file operations
Accept-new mode falls back to NoCheck due to library limitations
Connection reuse not possible with async-ssh2-tokio (see Connection Pooling section)

5. Connection Pooling (`ssh/pool.rs`)

Current Status: Placeholder implementation (Phase 3, 2025-08-21)

Design Decision: After thorough analysis, connection pooling was determined to be not beneficial for bssh's current usage pattern. The implementation exists as a placeholder for future features.

Analysis Results:

Current Usage Pattern: Each CLI invocation executes exactly one operation per host then terminates
No Reuse Scenarios: There are no cases where connections would be reused within a single bssh execution
Library Limitation: async-ssh2-tokio's Client type doesn't support cloning or connection reuse
Performance Impact: Zero benefit for current one-shot command execution model

When Pooling Would Be Beneficial:

Interactive mode with persistent shell sessions
Watch mode for periodic command execution
Server mode providing an HTTP API
Batch command execution from files
Command pipelining on the same hosts

Implementation:

pub struct ConnectionPool {
    _connections: Arc<RwLock<Vec<ConnectionKey>>>,  // Placeholder
    ttl: Duration,
    enabled: bool,
    max_connections: usize,
}

Current Behavior:

Always creates new connections regardless of enabled flag
Provides API surface for future pooling implementation
No performance overhead when disabled (default)

Recommendation: Focus on more impactful optimizations like:

Connection timeout tuning
SSH compression for large outputs
Buffered I/O optimizations
Early termination on critical failures
Parallel DNS resolution

6. Node Management (`node.rs`)

Design Decisions:

Flexible parsing supporting multiple formats
Smart defaults (port 22, current user)
Validation at parse time

Supported Formats:

hostname → Simple hostname
user@hostname → With username
hostname:port → With custom port
user@hostname:port → Full specification
[ipv6::addr]:port → IPv6 support

Data Flow

Command Execution Flow

CLI Parsing → Parse arguments and load configuration
Node Resolution → Determine target nodes from config or CLI
Executor Setup → Create semaphore and progress bars
Parallel Spawn → Launch tokio tasks for each node
SSH Connection → Establish authenticated SSH session
Command Execution → Run command and collect output
Result Aggregation → Collect all results and report

Error Handling Strategy

Connection Failures: Report per-node, continue with others
Authentication Failures: Fail fast with clear error message
Command Failures: Report exit code, continue execution
Timeout Handling: Configurable per-operation timeouts

Performance Characteristics

Benchmarks (Target)

Nodes	Command	Time	Memory
10	uptime	<2s	<50MB
100	uptime	<5s	<200MB
1000	uptime	<30s	<1GB

Bottlenecks

SSH Handshake: ~200-500ms per connection
Memory: Output buffering for large responses
CPU: Minimal, mostly I/O bound

Optimization Strategies

Connection Pooling: Reuse connections for multiple commands
Pipelining: Send multiple commands in single session
Compression: Enable SSH compression for large outputs
Caching: Cache host keys and authentication

Interactive Mode Architecture

Overview

Interactive mode provides persistent shell sessions with single-node or multiplexed multi-node support. This feature enables real-time interaction with cluster nodes, maintaining stateful connections for extended operations.

Design Decisions

PTY Support:
- Full pseudo-terminal allocation for proper shell interaction
- Terminal size detection and dynamic resizing
- ANSI escape sequence support for colored output
Session Management:
- Persistent SSH connections with keep-alive
- Graceful reconnection on connection drops
- Session state tracking (working directory, environment)
Input/Output Multiplexing:
- Commands broadcast to all nodes simultaneously
- Node-prefixed output with color coding
- Visual status indicators (● connected, ○ disconnected)

Implementation Details

struct NodeSession {
    node: Node,
    client: Client,
    channel: Channel<Msg>,
    working_dir: String,
    is_connected: bool,
}

Modes of Operation

Single-Node Mode (--single-node):
- Interactive shell on one selected node
- Full terminal emulation
- Command history with rustyline
Multiplex Mode (default):
- Commands sent to all nodes
- Synchronized output display
- Node status tracking

Future Enhancements (Phase 2-3)

Node switching with !node1, !node2 commands
Session persistence and detach/reattach
Full TUI with ratatui (split panes, monitoring)
File manager integration
Performance metrics visualization

Security Model

Current Implementation

SSH key-based authentication
No password storage
Agent forwarding support

Planned Improvements

Host Key Verification:
- Known_hosts file support
- TOFU (Trust On First Use) mode
- Strict mode with pre-shared keys
Audit Logging:
- Command execution history
- Connection attempts
- Authentication failures
Secrets Management:
- Integration with system keyring
- Encrypted configuration support

User Interface System (`ui.rs`)

Design Philosophy

The UI system provides a modern, clean, and elegant command-line interface with semantic colors and Unicode symbols for better visual hierarchy and user experience.

Key Components

Color Scheme:
- Cyan: Headers, prompts, and informational elements
- Green: Success indicators and positive outcomes
- Red: Failure indicators and errors
- Yellow: Counts, numbers, and warnings
- Blue: Active/processing states
- Dimmed: Secondary information and decorative elements
Unicode Symbols:
- ● (filled circle): Status indicators (colored based on state)
- ○ (empty circle): Pending/inactive state
- ◐/◑ (partial circles): In-progress animations
- ▶ (triangle): Section headers and actions
- • (bullet): List items
- └ (corner): Error details and nested information
- ✓/✗: Success/failure checkmarks
UI Components:

NodeStatus Enum:
- Represents the current state of a node (Pending, Connecting, Executing, Success, Failed)
- Provides colored symbols and text representations
NodeGrid:
- Compact grid layout for displaying multiple node statuses
- Responsive to terminal width
- Shows real-time status updates during execution
OutputFormatter:
- Formats command output with proper indentation and wrapping
- Handles terminal width constraints
- Provides consistent formatting for headers, summaries, and results

Implementation Details

pub enum NodeStatus {
    Pending,
    Connecting,
    Executing,
    Success,
    Failed(String),
}

impl NodeStatus {
    pub fn symbol(&self) -> String {
        match self {
            NodeStatus::Pending => "○".dimmed(),
            NodeStatus::Connecting => "◐".yellow(),
            NodeStatus::Executing => "◑".blue(),
            NodeStatus::Success => "●".green(),
            NodeStatus::Failed(_) => "●".red(),
        }
    }
}

Progress Indicators

Uses indicatif for animated progress spinners during execution
Custom tick characters for smooth animation: ⣾⣽⣻⣟⣯⣷⣿
Per-node progress bars with status messages

Terminal Responsiveness

Detects terminal width using terminal_size crate
Adapts output formatting based on available space
Wraps long lines intelligently while preserving indentation

Output Examples

Command Execution:

► Executing on 3 nodes:
  echo 'test'

[node1] ⣾ Connecting...
[node2] ◑ Executing...
[node3] ● Success

✓ node1
  test output

✗ node2 - Failed
  └ Connection timeout

════════════════════════════════════════
 Summary: 3 nodes • 2 successful • 1 failed
════════════════════════════════════════

Cluster Listing:

▶ Available clusters

  ● production (5 nodes)
    • prod-1.example.com
    • prod-2.example.com
    ...
    
  ● staging (2 nodes)
    • stage-1.example.com
    • stage-2.example.com

Testing Strategy

Unit Tests

Configuration parsing edge cases
Node format parsing
Error handling paths

Integration Tests

Mock SSH server for protocol testing
Docker-based real SSH testing
Cluster simulation

Coverage Goals

Core modules: >90%
SSH client: >80%
Overall: >85%

Future Improvements

Short-term (v0.2)

Medium-term (v0.3)

SFTP support for efficient file transfers
Interactive session support (PTY)
Command templates and scripts
Result caching
Parallel file distribution

Long-term (v1.0)

Technical Debt

Host Key Verification: Currently disabled, security risk ✅ Fixed in Phase 1
Test Coverage: Integration tests missing
Error Messages: Need better context and suggestions
Documentation: API docs incomplete

Change Log

Phase 1 - Critical Fixes (2025-08-21)

Completed:

Host Key Verification: Implemented three modes of verification with CLI flag
List Command Bug: Fixed logic to allow list without host specification
Environment Variables: Added expansion support for YAML configuration

Impact:

Security improved with proper host key checking
Better UX with fixed list command
More flexible configuration with env var support

Phase 3 - Connection Pooling Analysis (2025-08-21)

Completed:

Connection Pool Module: Implemented placeholder connection pool infrastructure
Performance Analysis: Determined pooling provides no benefit for current usage pattern
Architecture Documentation: Documented design decision and rationale

Key Findings:

Current one-shot execution model doesn't benefit from connection pooling
async-ssh2-tokio Client doesn't support connection reuse or cloning
Pooling would only benefit future features like interactive mode or watch mode

Recommendation:

Keep placeholder implementation for future use
Focus on other performance optimizations with immediate impact
Revisit when implementing persistent/interactive features

Phase 4 - Code Structure Refactoring (2025-01-22)

Completed:

Modular Command Structure: Separated commands into individual modules
Utility Extraction: Created reusable utility modules for common functions
Main.rs Simplification: Reduced from 987 to ~150 lines

New Structure:

src/
├── commands/         # Command implementations
│   ├── exec.rs      # Execute command (~75 lines)
│   ├── ping.rs      # Connectivity test (~80 lines)
│   ├── list.rs      # List clusters (~50 lines)
│   ├── upload.rs    # File upload (~175 lines)
│   └── download.rs  # File download (~240 lines)
├── utils/           # Utility functions
│   ├── fs.rs        # File system utilities (~100 lines)
│   ├── output.rs    # Output management (~200 lines)
│   └── logging.rs   # Logging setup (~30 lines)
└── main.rs          # CLI dispatcher (~150 lines)

Benefits:

Improved Maintainability: Each command is self-contained
Better Testability: Individual modules can be tested in isolation
Enhanced Scalability: New commands can be added without touching main.rs
Code Reusability: Utility functions are shared across commands
Clear Separation of Concerns: Each module has a single responsibility

Metrics:

Main.rs size reduction: 84% (987 → 150 lines)
Average module size: ~100 lines
Total modules created: 9 new files
No functionality changes, only structural improvements

Dependencies and Licensing

All dependencies are compatible with Apache-2.0 licensing:

tokio: MIT
async-ssh2-tokio: MIT
clap: MIT/Apache-2.0
serde: MIT/Apache-2.0
Other dependencies: Similar permissive licenses

Appendix

A. Configuration Schema

# Full configuration example
clusters:
  production:
    nodes:
      - host: node1.example.com
        port: 22
        user: admin
    ssh_key: ~/.ssh/id_rsa
    known_hosts: ~/.ssh/known_hosts
    
default_cluster: production

ssh_config:
  connect_timeout: 10
  command_timeout: 300
  max_retries: 3

B. Error Codes

Code	Description
1	General error
2	Configuration error
3	Connection failed
4	Authentication failed
5	Command execution failed
10	Partial failure (some nodes failed)

C. Performance Tuning

Environment variables for tuning:

BSSH_MAX_PARALLEL: Maximum parallel connections
BSSH_CONNECT_TIMEOUT: Connection timeout in seconds
BSSH_BUFFER_SIZE: Output buffer size per connection
RUST_LOG: Logging level (trace/debug/info/warn/error)

Uh oh!

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

bssh Architecture Documentation

Overview

System Architecture

Modular Design (Refactored 2025-01-22)

Component Details

1. CLI Interface (cli.rs, main.rs)

2. Configuration Management (config.rs)

3. Parallel Executor (executor.rs)

4. SSH Client (ssh/client.rs)

5. Connection Pooling (ssh/pool.rs)

6. Node Management (node.rs)

Data Flow

Command Execution Flow

Error Handling Strategy

Performance Characteristics

Benchmarks (Target)

Bottlenecks

Optimization Strategies

Interactive Mode Architecture

Overview

Design Decisions

Implementation Details

Modes of Operation

Future Enhancements (Phase 2-3)

Security Model

Current Implementation

Planned Improvements

User Interface System (ui.rs)

Design Philosophy

Key Components

Implementation Details

Progress Indicators

Terminal Responsiveness

Output Examples

Testing Strategy

Unit Tests

Integration Tests

Coverage Goals

Future Improvements

Short-term (v0.2)

Medium-term (v0.3)

Long-term (v1.0)

Technical Debt

Change Log

Phase 1 - Critical Fixes (2025-08-21)

Phase 3 - Connection Pooling Analysis (2025-08-21)

Phase 4 - Code Structure Refactoring (2025-01-22)

Dependencies and Licensing

Appendix

A. Configuration Schema

B. Error Codes

C. Performance Tuning

1. CLI Interface (`cli.rs`, `main.rs`)

2. Configuration Management (`config.rs`)

3. Parallel Executor (`executor.rs`)

4. SSH Client (`ssh/client.rs`)

5. Connection Pooling (`ssh/pool.rs`)

6. Node Management (`node.rs`)

User Interface System (`ui.rs`)