bssh (Backend.AI SSH) is a high-performance parallel SSH command execution tool designed for managing Backend.AI clusters. This document describes the detailed architecture, implementation decisions, and design patterns used in the project.
┌─────────────────────────────────────────────────────────┐
│ CLI Interface │
│ (main.rs) │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────┼───────────────┐
▼ ▼ ▼
┌──────────────┐ ┌───────────┐ ┌─────────────┐
│ Commands │ │ Config │ │ Utils │
│ Module │ │ Manager │ │ Module │
│ (commands/*) │ │(config.rs)│ │ (utils/*) │
└──────┬───────┘ └───────────┘ └─────────────┘
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────┐
│ Executor │◄──────────┤ Node │ │ UI │
│ (Parallel) │ │ Parser │ │ System │
│(executor.rs) │ │ (node.rs) │ │ (ui.rs) │
└──────┬───────┘ └──────────────┘ └──────────┘
│
├──────────┬────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ SSH │ │ SSH │ │ SSH │
│ Client │ │ Client │ │ Client │
│ (async) │ │ (async) │ │ (async) │
└──────────┘ └──────────┘ └──────────┘
The codebase has been restructured for better maintainability and scalability:
-
Minimal Entry Point (
main.rs):- Reduced from 987 lines to ~150 lines
- Only handles CLI parsing and command dispatching
- Delegates all business logic to specialized modules
-
Command Modules (
commands/):exec.rs: Command execution with output managementping.rs: Connectivity testinginteractive.rs: Interactive shell sessions (Phase 1 completed)list.rs: Cluster listingupload.rs: File upload operationsdownload.rs: File download operations- Each module is self-contained and independently testable
-
Utility Modules (
utils/):fs.rs: File system operations (glob patterns, directory walking)output.rs: Command output file managementlogging.rs: Logging initialization- Reusable across different commands
Design Decisions:
- Uses clap v4 with derive macros for type-safe argument parsing
- Subcommand pattern for different operations (exec, list, ping, upload, download)
- Environment variable support via
envattribute - Refactored (2025-01-22): Separated command logic from main.rs
Implementation:
// main.rs - Minimal dispatcher
async fn main() -> Result<()> {
let cli = Cli::parse();
match cli.command {
Commands::Exec { .. } => exec::execute_command(params).await,
Commands::List => list::list_clusters(&config),
Commands::Ping => ping::ping_nodes(nodes, ...).await,
Commands::Upload { .. } => upload::upload_file(params, ...).await,
Commands::Download { .. } => download::download_file(params, ...).await,
}
}Trade-offs:
- Derive macros increase compile time but provide better type safety
- Subcommand pattern adds complexity but improves UX
- Modular structure increases file count but improves testability
Design Decisions:
- YAML format for human readability
- Hierarchical configuration with cluster → nodes structure
- Default values with override capability
- Full XDG Base Directory specification compliance
Configuration Loading Priority:
- Backend.AI environment variables (auto-detection)
- Current directory (
./config.yaml) - XDG config directory (
$XDG_CONFIG_HOME/bssh/config.yamlor~/.config/bssh/config.yaml) - CLI specified path (via
--configflag)
XDG Support:
- Respects
$XDG_CONFIG_HOMEenvironment variable - Uses
directoriescrate'sProjectDirsfor platform-specific paths - Follows XDG Base Directory specification
- Tilde expansion for paths using
shellexpand
Key Features:
- Lazy loading of configuration
- Validation at parse time
- Support for both file-based and CLI-specified nodes
- ✅ Environment variable expansion (Phase 1 - Completed 2025-08-21)
- Supports
${VAR}and$VARsyntax - Expands in hostnames and usernames
- Graceful fallback for undefined variables
- Supports
Data Model:
pub struct Config {
pub clusters: HashMap<String, Cluster>,
pub default_cluster: Option<String>,
pub ssh_config: SshConfig,
}
pub struct Cluster {
pub nodes: Vec<Node>,
pub ssh_key: Option<PathBuf>,
pub user: Option<String>,
}Design Decisions:
- Tokio-based async execution for maximum concurrency
- Semaphore-based concurrency limiting to prevent resource exhaustion
- Progress bar visualization using
indicatif - Streaming output collection for real-time feedback
Concurrency Model:
let semaphore = Arc::new(Semaphore::new(max_parallel));
let tasks: Vec<JoinHandle<Result<ExecutionResult>>> = nodes
.into_iter()
.map(|node| {
let permit = semaphore.clone().acquire_owned();
tokio::spawn(async move {
let _permit = permit.await;
execute_on_node(node, command).await
})
})
.collect();Performance Optimizations:
- Connection reuse within same node (planned)
- Buffered I/O for output collection
- Early termination on critical failures
Library Choice: async-ssh2-tokio
- Why not thrussh: async-ssh2-tokio provides simpler API and better OpenSSH compatibility
- Why not openssh: Need fine-grained control over connections
- Why not ssh2: Need async/await support for concurrent operations
Implementation Details:
- Async/await pattern for non-blocking I/O
- Support for both key-based and agent authentication
- Configurable timeouts and retry logic
Security Implementation (Phase 1 - Completed 2025-08-21):
- ✅ Host key verification with three modes:
StrictHostKeyChecking::Yes- Strict verification using known_hostsStrictHostKeyChecking::No- Skip all verificationStrictHostKeyChecking::AcceptNew- TOFU mode (limited by library)
- ✅ CLI flag
--strict-host-key-checkingwith default "accept-new" - ✅ Uses system known_hosts file (~/.ssh/known_hosts)
Remaining Limitations:
- Missing SFTP support for file operations
- Accept-new mode falls back to NoCheck due to library limitations
- Connection reuse not possible with async-ssh2-tokio (see Connection Pooling section)
Current Status: Placeholder implementation (Phase 3, 2025-08-21)
Design Decision: After thorough analysis, connection pooling was determined to be not beneficial for bssh's current usage pattern. The implementation exists as a placeholder for future features.
Analysis Results:
- Current Usage Pattern: Each CLI invocation executes exactly one operation per host then terminates
- No Reuse Scenarios: There are no cases where connections would be reused within a single bssh execution
- Library Limitation: async-ssh2-tokio's
Clienttype doesn't support cloning or connection reuse - Performance Impact: Zero benefit for current one-shot command execution model
When Pooling Would Be Beneficial:
- Interactive mode with persistent shell sessions
- Watch mode for periodic command execution
- Server mode providing an HTTP API
- Batch command execution from files
- Command pipelining on the same hosts
Implementation:
pub struct ConnectionPool {
_connections: Arc<RwLock<Vec<ConnectionKey>>>, // Placeholder
ttl: Duration,
enabled: bool,
max_connections: usize,
}Current Behavior:
- Always creates new connections regardless of
enabledflag - Provides API surface for future pooling implementation
- No performance overhead when disabled (default)
Recommendation: Focus on more impactful optimizations like:
- Connection timeout tuning
- SSH compression for large outputs
- Buffered I/O optimizations
- Early termination on critical failures
- Parallel DNS resolution
Design Decisions:
- Flexible parsing supporting multiple formats
- Smart defaults (port 22, current user)
- Validation at parse time
Supported Formats:
hostname→ Simple hostnameuser@hostname→ With usernamehostname:port→ With custom portuser@hostname:port→ Full specification[ipv6::addr]:port→ IPv6 support
- CLI Parsing → Parse arguments and load configuration
- Node Resolution → Determine target nodes from config or CLI
- Executor Setup → Create semaphore and progress bars
- Parallel Spawn → Launch tokio tasks for each node
- SSH Connection → Establish authenticated SSH session
- Command Execution → Run command and collect output
- Result Aggregation → Collect all results and report
- Connection Failures: Report per-node, continue with others
- Authentication Failures: Fail fast with clear error message
- Command Failures: Report exit code, continue execution
- Timeout Handling: Configurable per-operation timeouts
| Nodes | Command | Time | Memory |
|---|---|---|---|
| 10 | uptime | <2s | <50MB |
| 100 | uptime | <5s | <200MB |
| 1000 | uptime | <30s | <1GB |
- SSH Handshake: ~200-500ms per connection
- Memory: Output buffering for large responses
- CPU: Minimal, mostly I/O bound
- Connection Pooling: Reuse connections for multiple commands
- Pipelining: Send multiple commands in single session
- Compression: Enable SSH compression for large outputs
- Caching: Cache host keys and authentication
Interactive mode provides persistent shell sessions with single-node or multiplexed multi-node support. This feature enables real-time interaction with cluster nodes, maintaining stateful connections for extended operations.
-
PTY Support:
- Full pseudo-terminal allocation for proper shell interaction
- Terminal size detection and dynamic resizing
- ANSI escape sequence support for colored output
-
Session Management:
- Persistent SSH connections with keep-alive
- Graceful reconnection on connection drops
- Session state tracking (working directory, environment)
-
Input/Output Multiplexing:
- Commands broadcast to all nodes simultaneously
- Node-prefixed output with color coding
- Visual status indicators (● connected, ○ disconnected)
struct NodeSession {
node: Node,
client: Client,
channel: Channel<Msg>,
working_dir: String,
is_connected: bool,
}-
Single-Node Mode (
--single-node):- Interactive shell on one selected node
- Full terminal emulation
- Command history with rustyline
-
Multiplex Mode (default):
- Commands sent to all nodes
- Synchronized output display
- Node status tracking
- Node switching with
!node1,!node2commands - Session persistence and detach/reattach
- Full TUI with ratatui (split panes, monitoring)
- File manager integration
- Performance metrics visualization
- SSH key-based authentication
- No password storage
- Agent forwarding support
-
Host Key Verification:
- Known_hosts file support
- TOFU (Trust On First Use) mode
- Strict mode with pre-shared keys
-
Audit Logging:
- Command execution history
- Connection attempts
- Authentication failures
-
Secrets Management:
- Integration with system keyring
- Encrypted configuration support
The UI system provides a modern, clean, and elegant command-line interface with semantic colors and Unicode symbols for better visual hierarchy and user experience.
-
Color Scheme:
- Cyan: Headers, prompts, and informational elements
- Green: Success indicators and positive outcomes
- Red: Failure indicators and errors
- Yellow: Counts, numbers, and warnings
- Blue: Active/processing states
- Dimmed: Secondary information and decorative elements
-
Unicode Symbols:
●(filled circle): Status indicators (colored based on state)○(empty circle): Pending/inactive state◐/◑(partial circles): In-progress animations▶(triangle): Section headers and actions•(bullet): List items└(corner): Error details and nested information✓/✗: Success/failure checkmarks
-
UI Components:
NodeStatus Enum:
- Represents the current state of a node (Pending, Connecting, Executing, Success, Failed)
- Provides colored symbols and text representations
NodeGrid:
- Compact grid layout for displaying multiple node statuses
- Responsive to terminal width
- Shows real-time status updates during execution
OutputFormatter:
- Formats command output with proper indentation and wrapping
- Handles terminal width constraints
- Provides consistent formatting for headers, summaries, and results
pub enum NodeStatus {
Pending,
Connecting,
Executing,
Success,
Failed(String),
}
impl NodeStatus {
pub fn symbol(&self) -> String {
match self {
NodeStatus::Pending => "○".dimmed(),
NodeStatus::Connecting => "◐".yellow(),
NodeStatus::Executing => "◑".blue(),
NodeStatus::Success => "●".green(),
NodeStatus::Failed(_) => "●".red(),
}
}
}- Uses
indicatiffor animated progress spinners during execution - Custom tick characters for smooth animation:
⣾⣽⣻⣟⣯⣷⣿ - Per-node progress bars with status messages
- Detects terminal width using
terminal_sizecrate - Adapts output formatting based on available space
- Wraps long lines intelligently while preserving indentation
Command Execution:
► Executing on 3 nodes:
echo 'test'
[node1] ⣾ Connecting...
[node2] ◑ Executing...
[node3] ● Success
✓ node1
test output
✗ node2 - Failed
└ Connection timeout
════════════════════════════════════════
Summary: 3 nodes • 2 successful • 1 failed
════════════════════════════════════════
Cluster Listing:
▶ Available clusters
● production (5 nodes)
• prod-1.example.com
• prod-2.example.com
...
● staging (2 nodes)
• stage-1.example.com
• stage-2.example.com
- Configuration parsing edge cases
- Node format parsing
- Error handling paths
- Mock SSH server for protocol testing
- Docker-based real SSH testing
- Cluster simulation
- Core modules: >90%
- SSH client: >80%
- Overall: >85%
- Implement proper host key verification
- Add connection pooling
- Complete file copy functionality
- Add dry-run mode
- Implement output filtering
- SFTP support for efficient file transfers
- Interactive session support (PTY)
- Command templates and scripts
- Result caching
- Parallel file distribution
- Web UI dashboard
- REST API server mode
- Kubernetes operator integration
- Metrics and monitoring
- Plugin system
Host Key Verification: Currently disabled, security risk✅ Fixed in Phase 1- Test Coverage: Integration tests missing
- Error Messages: Need better context and suggestions
- Documentation: API docs incomplete
Completed:
- Host Key Verification: Implemented three modes of verification with CLI flag
- List Command Bug: Fixed logic to allow list without host specification
- Environment Variables: Added expansion support for YAML configuration
Impact:
- Security improved with proper host key checking
- Better UX with fixed list command
- More flexible configuration with env var support
Completed:
- Connection Pool Module: Implemented placeholder connection pool infrastructure
- Performance Analysis: Determined pooling provides no benefit for current usage pattern
- Architecture Documentation: Documented design decision and rationale
Key Findings:
- Current one-shot execution model doesn't benefit from connection pooling
- async-ssh2-tokio Client doesn't support connection reuse or cloning
- Pooling would only benefit future features like interactive mode or watch mode
Recommendation:
- Keep placeholder implementation for future use
- Focus on other performance optimizations with immediate impact
- Revisit when implementing persistent/interactive features
Completed:
- Modular Command Structure: Separated commands into individual modules
- Utility Extraction: Created reusable utility modules for common functions
- Main.rs Simplification: Reduced from 987 to ~150 lines
New Structure:
src/
├── commands/ # Command implementations
│ ├── exec.rs # Execute command (~75 lines)
│ ├── ping.rs # Connectivity test (~80 lines)
│ ├── list.rs # List clusters (~50 lines)
│ ├── upload.rs # File upload (~175 lines)
│ └── download.rs # File download (~240 lines)
├── utils/ # Utility functions
│ ├── fs.rs # File system utilities (~100 lines)
│ ├── output.rs # Output management (~200 lines)
│ └── logging.rs # Logging setup (~30 lines)
└── main.rs # CLI dispatcher (~150 lines)
Benefits:
- Improved Maintainability: Each command is self-contained
- Better Testability: Individual modules can be tested in isolation
- Enhanced Scalability: New commands can be added without touching main.rs
- Code Reusability: Utility functions are shared across commands
- Clear Separation of Concerns: Each module has a single responsibility
Metrics:
- Main.rs size reduction: 84% (987 → 150 lines)
- Average module size: ~100 lines
- Total modules created: 9 new files
- No functionality changes, only structural improvements
All dependencies are compatible with Apache-2.0 licensing:
tokio: MITasync-ssh2-tokio: MITclap: MIT/Apache-2.0serde: MIT/Apache-2.0- Other dependencies: Similar permissive licenses
# Full configuration example
clusters:
production:
nodes:
- host: node1.example.com
port: 22
user: admin
ssh_key: ~/.ssh/id_rsa
known_hosts: ~/.ssh/known_hosts
default_cluster: production
ssh_config:
connect_timeout: 10
command_timeout: 300
max_retries: 3| Code | Description |
|---|---|
| 1 | General error |
| 2 | Configuration error |
| 3 | Connection failed |
| 4 | Authentication failed |
| 5 | Command execution failed |
| 10 | Partial failure (some nodes failed) |
Environment variables for tuning:
BSSH_MAX_PARALLEL: Maximum parallel connectionsBSSH_CONNECT_TIMEOUT: Connection timeout in secondsBSSH_BUFFER_SIZE: Output buffer size per connectionRUST_LOG: Logging level (trace/debug/info/warn/error)