|
| 1 | +# Design Proposal: Orchestrator Deployment Experience Improvement |
| 2 | + |
| 3 | +Author(s): Charles Chan |
| 4 | + |
| 5 | +Last updated: May, 27, 2025 |
| 6 | + |
| 7 | +## Abstract |
| 8 | + |
| 9 | +The current orchestrator installer in 3.0 suffers from a number of architectural and operational issues that impede |
| 10 | +development velocity, cloud support, and user experience: |
| 11 | + |
| 12 | +- The cloud installer and infrastructure provisioning are implemented as **large monolithic shell scripts**. |
| 13 | + This not only makes them difficult to maintain but also renders unit testing nearly impossible. |
| 14 | +- The installer is **only tested late in the hardware integration pipeline (HIP)**, which delays feedback and makes bugs harder to trace and resolve. |
| 15 | +- The on-prem installer was developed in parallel but shares little code or structure with the cloud installer. |
| 16 | + This results in **inconsistent behaviors and duplicated logic between cloud and on-prem** deployments. |
| 17 | +- There is **no clear boundary between infrastructure provisioning and orchestrator** setup, |
| 18 | + making it difficult to port components to another cloud provider or isolate issues during upgrade and testing. |
| 19 | +- **Upgrade support was added as a second thought** and lacks proper design. |
| 20 | +- **Error handling is poor**; raw error logs are surfaced directly to users with no actionable remediation, |
| 21 | + and failures require rerunning entire stages without guidance. |
| 22 | + |
| 23 | +This proposal aims to significantly improve the deployment experience of EMF across multiple environments (AWS, Azure, and on-prem). |
| 24 | +The new installer will prioritize user experience by offering a streamlined, zero-touch installation process after configuration, |
| 25 | +along with clear error handling and actionable feedback. |
| 26 | +It will also increase cloud portability through clear infrastructure abstraction and support for Azure. |
| 27 | +Finally, by replacing monolithic shell scripts with modular Go components and adding test coverage, we will enable faster iteration and more frequent releases. |
| 28 | + |
| 29 | +## Proposal |
| 30 | + |
| 31 | +### Scope of work |
| 32 | + |
| 33 | +- A **unified installer** that supports AWS, Azure, and on-prem targets. |
| 34 | +- A **text user interface (TUI) configuration builder** that guides users through required inputs, |
| 35 | + with the ability to preload values from environment variables or prior installations. |
| 36 | + It should minimize user input. For example, scale profiles for infra and orchestrator can be automatically applied according to user-specified target scale. |
| 37 | +- A **well-defined abstraction between infrastructure and orchestrator** logic that enables independent testing and upgrading, |
| 38 | + as well as the ability to plug in new cloud providers via Go modules. |
| 39 | +- Every module should be **toggled independently** and have minimal external dependency. |
| 40 | +- Installer should support **upgrade** and **uninstall** from a previous version |
| 41 | +- We should be able to run on-prem installer on the machine where EMF is being deployed. |
| 42 | + It should not require an aditional admin machine. |
| 43 | +- Be compatible with upcoming Azure implementation and ongoing replacement of kind with on-prem in Coder |
| 44 | + |
| 45 | +### Out of scope |
| 46 | + |
| 47 | +- (EMF-3.2) A clear **progress visualization** showing the overall progress |
| 48 | +- (EMF-3.2) **Diff previews** should be available during upgrade flows, showing schema migrations or configuration changes. |
| 49 | +- (EMF-3.2) **Wrapped and actionable error messages**. Raw logs should be saved to files, and restarts should be possible from the point of failure. |
| 50 | +- (EMF-3.2) Installer should support **orchestrator CLI integration** (e.g. `cli deploy aws`) and parallel execution of non-dependent tasks. |
| 51 | +- Optimizing total deployment time, as current durations are acceptable. |
| 52 | +- Full automation of post-deployment IAM/org/user configuration (users will be guided to complete this manually). |
| 53 | + |
| 54 | +### Design Principles |
| 55 | + |
| 56 | +#### General |
| 57 | + |
| 58 | +- All operations must be idempotent and safe to retry. |
| 59 | +- Unit tests will be written for each module, avoiding reliance on slow end-to-end tests. |
| 60 | +- `pre` and `post` hooks will be supported in both the config builder and each installer stages (maybe even steps, TBD), |
| 61 | + which is useful for schema migration and backup/restore during upgrade. |
| 62 | +- Maintain a better hierarchy of edge-manageability-framework top level folder |
| 63 | + - Nest `pod-configs`, `terraform`, `installer`, `on-prem-installer` under `installer` |
| 64 | + |
| 65 | +#### Installer |
| 66 | + |
| 67 | +- Once a configuration file is created, the installation should require no further user interaction. |
| 68 | +- All shell scripts (e.g., `provision.sh`) will be replaced with Go code. |
| 69 | +- Variable duplication across platforms will be eliminated using shared Terraform variable naming and outputs. |
| 70 | +- Use of global variables and relative paths will be minimized. |
| 71 | +- Developers should be able to modify, build, and run installer locally |
| 72 | +- The same installer should be able to handle multiple deployments. |
| 73 | + The context of target cluster should be derived from the supplied user config. |
| 74 | + |
| 75 | +#### Config Builder |
| 76 | + |
| 77 | +- Configuration will be reduced to only the fields required for the selected environment. |
| 78 | +- Full YAML config will be rendered for user review and advanced modification. |
| 79 | +- Prior configurations can be loaded and migrated forward during upgrades. |
| 80 | +- Schema validation will ensure correctness before proceeding. |
| 81 | +- The default mode should be interactive, but it should also have an non-interactive mode |
| 82 | + - e.g. CI may want to invoke config builder to validate configuration before kicking off the deployment |
| 83 | + |
| 84 | +#### Progress Visualization / Error Handling |
| 85 | + |
| 86 | +- A text-based progress bar will display milestones, elapsed time, estimated remaining time, and current stage. |
| 87 | +- Stage verification will occur both before (input validation) and after (desired state validation) each module runs. |
| 88 | +- Logs will be saved to a file and only shown to users when necessary. The default view will focus on high-level progress and status. |
| 89 | + |
| 90 | +### Installation Workflow |
| 91 | + |
| 92 | + |
| 93 | + |
| 94 | +#### Stage 0: Configuration |
| 95 | + |
| 96 | +This stage involves collecting all necessary user input at once using the TUI config helper. |
| 97 | +The configuration is stored as a single YAML file. |
| 98 | + |
| 99 | +Input: |
| 100 | + |
| 101 | +- Account info, region, cluster name, cert, etc. |
| 102 | + |
| 103 | +Output: |
| 104 | + |
| 105 | +- `User Config` – hierarchical YAML file used in all subsequent stages. |
| 106 | + |
| 107 | +#### Stage 1: Infrastructure |
| 108 | + |
| 109 | +Provisions the raw Kubernetes environment, storage backend, and load balancer. |
| 110 | +The infrastructure module uses provider-specific backends (e.g., AWS, Azure, or on-prem), registered via Go interfaces. |
| 111 | + |
| 112 | +Input: |
| 113 | + |
| 114 | +- `User Config` |
| 115 | +- `Runtime State` (e.g., generated network info) |
| 116 | + |
| 117 | +Output: |
| 118 | + |
| 119 | +- Raw Kubernetes environment |
| 120 | +- Storage class |
| 121 | +- Load balancer setup |
| 122 | + |
| 123 | +#### Stage 2: Pre-Orchestrator |
| 124 | + |
| 125 | +Performs setup that must be completed before Argo CD can take over. |
| 126 | +This includes injecting secrets, setting up namespaces with required labels, importing TLS certs, and installing Gitea and Argo CD. |
| 127 | + |
| 128 | +Input: |
| 129 | + |
| 130 | +- Kubernetes cluster from Stage 1 |
| 131 | +- `User Config` (e.g., TLS certificates) |
| 132 | +- `Runtime State` (e.g., database master password) |
| 133 | + |
| 134 | +Output: |
| 135 | + |
| 136 | +- Cluster in a ready state for Argo CD bootstrapping |
| 137 | + |
| 138 | +Design Constraint: |
| 139 | + |
| 140 | +- Infrastructure modules and orchestrator modules must remain decoupled. |
| 141 | + Only the installer mediates exchange of infra-specific info via: |
| 142 | + - Rendered Argo CD configuration or ConfigMap (for non-sensitive values like S3 URL) |
| 143 | + - Kubernetes secrets (for sensitive values like credentials) |
| 144 | + |
| 145 | +#### Stage 3: Orchestrator Deployment |
| 146 | + |
| 147 | +Deploys the Argo CD root app and monitors progress until all apps are synced and healthy. |
| 148 | + |
| 149 | +Input: |
| 150 | + |
| 151 | +- `User Config` (e.g., cluster name, target scale) |
| 152 | +- `Runtime State` (e.g., S3 bucket name) |
| 153 | + |
| 154 | +Output: |
| 155 | + |
| 156 | +- All orchestrator Argo CD apps are synced and healthy |
| 157 | +- DKAM completes the download and signing of OS profiles |
| 158 | + |
| 159 | +#### Stage 4: Post-Orchestrator |
| 160 | + |
| 161 | +Provides post-deployment guidance to users on setting up IAM roles, multi-tenant organizations, and user access. |
| 162 | + |
| 163 | +Output: |
| 164 | + |
| 165 | +- Display helpful links and CLI instructions |
| 166 | +- (Next release: better integrated with orchestrator CLI) |
| 167 | + |
| 168 | +### Implementation Details |
| 169 | + |
| 170 | +- **Secrets Management:** |
| 171 | + Secrets required during installation runtime will be stored in memory or in a secure state file. |
| 172 | + Secrets needed post-deployment will be persisted as Kubernetes secrets. |
| 173 | + |
| 174 | +- **Configuration and State Management:** |
| 175 | + Both `User Config` and `Runtime State` will be stored as a single structured YAML file, |
| 176 | + both persisted locally or in the cloud, similar to Terraform state files. |
| 177 | + These configurations will be versioned, enabling version specific upgrade logic such as configuration schema and/or data migration |
| 178 | + The config builder will support loading previous configurations, migrating them to the latest schema, |
| 179 | + and prompting for any new required attributes. |
| 180 | + We should leverage 3rd party libraries such as [Viper](https://github.com/spf13/viper) to handle configurations. |
| 181 | + |
| 182 | +- **Configuration Consumption:** |
| 183 | + Each installer module will implement a config provider that parses the *User Config* and *Runtime State*, and generates module-specific configuration (e.g., Helm values, Terraform variables). |
| 184 | + |
| 185 | +- **Upgrade Workflow:** |
| 186 | + During upgrade, the installer will generate a new configuration and display a diff preview to the user before proceeding. |
| 187 | + |
| 188 | +- **Modular Infrastructure Provider Interface:** |
| 189 | + Infrastructure providers (AWS, Azure, On-Prem) will implement a shared Go interface and register themselves as plug-ins. This abstraction ensures separation from orchestrator logic and allows easy extension to new cloud backends. |
| 190 | + |
| 191 | +- **Programmatic and orchestrator CLI Integration:** |
| 192 | + The installer must support both CLI usage (e.g., `cli deploy aws`) and programmatic invocation for integration with other tools like the Orch CLI. |
| 193 | + |
| 194 | +- **Parallel Execution:** |
| 195 | + Dependencies between steps should be explicitly defined. |
| 196 | + Tasks that are independent will be executed in parallel to optimize installation time. |
| 197 | + This is a requirement for the standalone edge node mass provisioning. |
| 198 | + |
| 199 | +- **Logging and Error Handling:** |
| 200 | + All logs will be dumped to a file automatically. |
| 201 | + Modules will return standardized error codes with consistent logging behavior across the system. |
| 202 | + |
| 203 | +- **Output validation:** |
| 204 | + We should validate the output of each step and ensure the system is in the desired state before proceeding. |
| 205 | + The validation logic should be shared across different cloud provider implementations, ensuring a consistent behavior across different environments. |
| 206 | + |
| 207 | +## Rationale |
| 208 | + |
| 209 | +[A discussion of alternate approaches that have been considered and the trade |
| 210 | +offs, advantages, and disadvantages of the chosen approach.] |
| 211 | + |
| 212 | +## Affected components and Teams |
| 213 | + |
| 214 | +- Foundational Platform Service |
| 215 | +- CI/CD |
| 216 | +- Documentation |
| 217 | + |
| 218 | +## Implementation plan |
| 219 | + |
| 220 | +- Design - interface between installer and modules, config format |
| 221 | +- Design - Cloud Upgrade |
| 222 | +- Design - On-Prem Upgrade |
| 223 | +- Common - Implement installer framework and core logic |
| 224 | +- Stage 0: interactive config helper |
| 225 | +- Stage 1 - AWS - Reimplement as installer module |
| 226 | + - Implement Cloud upgrade from 3.0 |
| 227 | +- Stage 1 - On-Prem - Reimplement as installer module |
| 228 | + - Implement On-Prem update from 3.0 |
| 229 | +- Stage 2 - Implement common pre-orch jobs (cloud) |
| 230 | +- Stage 2 - Implement common pre-orch jobs (onprem) |
| 231 | +- Stage 3 - Monitor Argo CD deployment |
| 232 | +- Nightly tests for Cloud upgrade |
| 233 | +- Nightly tests for On-Prem upgrade |
| 234 | +- Deployment Doc - Cloud deployment |
| 235 | +- Deployment Doc - On-Prem deployment |
| 236 | +- CI and release automation - installer binary |
| 237 | + |
| 238 | +Required Resources: 7.5 FTE, 6 weeks (2 sprints) |
| 239 | + |
| 240 | +## Open issues (if applicable) |
| 241 | + |
| 242 | +[A discussion of issues relating to this proposal for which the author does not |
| 243 | +know the solution. This section may be omitted if there are none.] |
0 commit comments