Skip to content

Commit 80ab49b

Browse files
docs(design): improve orchestrator deployment experience (#290)
Co-authored-by: Andrei Palade <[email protected]>
1 parent 36a5495 commit 80ab49b

File tree

2 files changed

+247
-0
lines changed

2 files changed

+247
-0
lines changed
Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
# Design Proposal: Orchestrator Deployment Experience Improvement
2+
3+
Author(s): Charles Chan
4+
5+
Last updated: May, 27, 2025
6+
7+
## Abstract
8+
9+
The current orchestrator installer in 3.0 suffers from a number of architectural and operational issues that impede
10+
development velocity, cloud support, and user experience:
11+
12+
- The cloud installer and infrastructure provisioning are implemented as **large monolithic shell scripts**.
13+
This not only makes them difficult to maintain but also renders unit testing nearly impossible.
14+
- The installer is **only tested late in the hardware integration pipeline (HIP)**, which delays feedback and makes bugs harder to trace and resolve.
15+
- The on-prem installer was developed in parallel but shares little code or structure with the cloud installer.
16+
This results in **inconsistent behaviors and duplicated logic between cloud and on-prem** deployments.
17+
- There is **no clear boundary between infrastructure provisioning and orchestrator** setup,
18+
making it difficult to port components to another cloud provider or isolate issues during upgrade and testing.
19+
- **Upgrade support was added as a second thought** and lacks proper design.
20+
- **Error handling is poor**; raw error logs are surfaced directly to users with no actionable remediation,
21+
and failures require rerunning entire stages without guidance.
22+
23+
This proposal aims to significantly improve the deployment experience of EMF across multiple environments (AWS, Azure, and on-prem).
24+
The new installer will prioritize user experience by offering a streamlined, zero-touch installation process after configuration,
25+
along with clear error handling and actionable feedback.
26+
It will also increase cloud portability through clear infrastructure abstraction and support for Azure.
27+
Finally, by replacing monolithic shell scripts with modular Go components and adding test coverage, we will enable faster iteration and more frequent releases.
28+
29+
## Proposal
30+
31+
### Scope of work
32+
33+
- A **unified installer** that supports AWS, Azure, and on-prem targets.
34+
- A **text user interface (TUI) configuration builder** that guides users through required inputs,
35+
with the ability to preload values from environment variables or prior installations.
36+
It should minimize user input. For example, scale profiles for infra and orchestrator can be automatically applied according to user-specified target scale.
37+
- A **well-defined abstraction between infrastructure and orchestrator** logic that enables independent testing and upgrading,
38+
as well as the ability to plug in new cloud providers via Go modules.
39+
- Every module should be **toggled independently** and have minimal external dependency.
40+
- Installer should support **upgrade** and **uninstall** from a previous version
41+
- We should be able to run on-prem installer on the machine where EMF is being deployed.
42+
It should not require an aditional admin machine.
43+
- Be compatible with upcoming Azure implementation and ongoing replacement of kind with on-prem in Coder
44+
45+
### Out of scope
46+
47+
- (EMF-3.2) A clear **progress visualization** showing the overall progress
48+
- (EMF-3.2) **Diff previews** should be available during upgrade flows, showing schema migrations or configuration changes.
49+
- (EMF-3.2) **Wrapped and actionable error messages**. Raw logs should be saved to files, and restarts should be possible from the point of failure.
50+
- (EMF-3.2) Installer should support **orchestrator CLI integration** (e.g. `cli deploy aws`) and parallel execution of non-dependent tasks.
51+
- Optimizing total deployment time, as current durations are acceptable.
52+
- Full automation of post-deployment IAM/org/user configuration (users will be guided to complete this manually).
53+
54+
### Design Principles
55+
56+
#### General
57+
58+
- All operations must be idempotent and safe to retry.
59+
- Unit tests will be written for each module, avoiding reliance on slow end-to-end tests.
60+
- `pre` and `post` hooks will be supported in both the config builder and each installer stages (maybe even steps, TBD),
61+
which is useful for schema migration and backup/restore during upgrade.
62+
- Maintain a better hierarchy of edge-manageability-framework top level folder
63+
- Nest `pod-configs`, `terraform`, `installer`, `on-prem-installer` under `installer`
64+
65+
#### Installer
66+
67+
- Once a configuration file is created, the installation should require no further user interaction.
68+
- All shell scripts (e.g., `provision.sh`) will be replaced with Go code.
69+
- Variable duplication across platforms will be eliminated using shared Terraform variable naming and outputs.
70+
- Use of global variables and relative paths will be minimized.
71+
- Developers should be able to modify, build, and run installer locally
72+
- The same installer should be able to handle multiple deployments.
73+
The context of target cluster should be derived from the supplied user config.
74+
75+
#### Config Builder
76+
77+
- Configuration will be reduced to only the fields required for the selected environment.
78+
- Full YAML config will be rendered for user review and advanced modification.
79+
- Prior configurations can be loaded and migrated forward during upgrades.
80+
- Schema validation will ensure correctness before proceeding.
81+
- The default mode should be interactive, but it should also have an non-interactive mode
82+
- e.g. CI may want to invoke config builder to validate configuration before kicking off the deployment
83+
84+
#### Progress Visualization / Error Handling
85+
86+
- A text-based progress bar will display milestones, elapsed time, estimated remaining time, and current stage.
87+
- Stage verification will occur both before (input validation) and after (desired state validation) each module runs.
88+
- Logs will be saved to a file and only shown to users when necessary. The default view will focus on high-level progress and status.
89+
90+
### Installation Workflow
91+
92+
![Installer Workflow](./deploy-experience-improvement.svg)
93+
94+
#### Stage 0: Configuration
95+
96+
This stage involves collecting all necessary user input at once using the TUI config helper.
97+
The configuration is stored as a single YAML file.
98+
99+
Input:
100+
101+
- Account info, region, cluster name, cert, etc.
102+
103+
Output:
104+
105+
- `User Config` – hierarchical YAML file used in all subsequent stages.
106+
107+
#### Stage 1: Infrastructure
108+
109+
Provisions the raw Kubernetes environment, storage backend, and load balancer.
110+
The infrastructure module uses provider-specific backends (e.g., AWS, Azure, or on-prem), registered via Go interfaces.
111+
112+
Input:
113+
114+
- `User Config`
115+
- `Runtime State` (e.g., generated network info)
116+
117+
Output:
118+
119+
- Raw Kubernetes environment
120+
- Storage class
121+
- Load balancer setup
122+
123+
#### Stage 2: Pre-Orchestrator
124+
125+
Performs setup that must be completed before Argo CD can take over.
126+
This includes injecting secrets, setting up namespaces with required labels, importing TLS certs, and installing Gitea and Argo CD.
127+
128+
Input:
129+
130+
- Kubernetes cluster from Stage 1
131+
- `User Config` (e.g., TLS certificates)
132+
- `Runtime State` (e.g., database master password)
133+
134+
Output:
135+
136+
- Cluster in a ready state for Argo CD bootstrapping
137+
138+
Design Constraint:
139+
140+
- Infrastructure modules and orchestrator modules must remain decoupled.
141+
Only the installer mediates exchange of infra-specific info via:
142+
- Rendered Argo CD configuration or ConfigMap (for non-sensitive values like S3 URL)
143+
- Kubernetes secrets (for sensitive values like credentials)
144+
145+
#### Stage 3: Orchestrator Deployment
146+
147+
Deploys the Argo CD root app and monitors progress until all apps are synced and healthy.
148+
149+
Input:
150+
151+
- `User Config` (e.g., cluster name, target scale)
152+
- `Runtime State` (e.g., S3 bucket name)
153+
154+
Output:
155+
156+
- All orchestrator Argo CD apps are synced and healthy
157+
- DKAM completes the download and signing of OS profiles
158+
159+
#### Stage 4: Post-Orchestrator
160+
161+
Provides post-deployment guidance to users on setting up IAM roles, multi-tenant organizations, and user access.
162+
163+
Output:
164+
165+
- Display helpful links and CLI instructions
166+
- (Next release: better integrated with orchestrator CLI)
167+
168+
### Implementation Details
169+
170+
- **Secrets Management:**
171+
Secrets required during installation runtime will be stored in memory or in a secure state file.
172+
Secrets needed post-deployment will be persisted as Kubernetes secrets.
173+
174+
- **Configuration and State Management:**
175+
Both `User Config` and `Runtime State` will be stored as a single structured YAML file,
176+
both persisted locally or in the cloud, similar to Terraform state files.
177+
These configurations will be versioned, enabling version specific upgrade logic such as configuration schema and/or data migration
178+
The config builder will support loading previous configurations, migrating them to the latest schema,
179+
and prompting for any new required attributes.
180+
We should leverage 3rd party libraries such as [Viper](https://github.com/spf13/viper) to handle configurations.
181+
182+
- **Configuration Consumption:**
183+
Each installer module will implement a config provider that parses the *User Config* and *Runtime State*, and generates module-specific configuration (e.g., Helm values, Terraform variables).
184+
185+
- **Upgrade Workflow:**
186+
During upgrade, the installer will generate a new configuration and display a diff preview to the user before proceeding.
187+
188+
- **Modular Infrastructure Provider Interface:**
189+
Infrastructure providers (AWS, Azure, On-Prem) will implement a shared Go interface and register themselves as plug-ins. This abstraction ensures separation from orchestrator logic and allows easy extension to new cloud backends.
190+
191+
- **Programmatic and orchestrator CLI Integration:**
192+
The installer must support both CLI usage (e.g., `cli deploy aws`) and programmatic invocation for integration with other tools like the Orch CLI.
193+
194+
- **Parallel Execution:**
195+
Dependencies between steps should be explicitly defined.
196+
Tasks that are independent will be executed in parallel to optimize installation time.
197+
This is a requirement for the standalone edge node mass provisioning.
198+
199+
- **Logging and Error Handling:**
200+
All logs will be dumped to a file automatically.
201+
Modules will return standardized error codes with consistent logging behavior across the system.
202+
203+
- **Output validation:**
204+
We should validate the output of each step and ensure the system is in the desired state before proceeding.
205+
The validation logic should be shared across different cloud provider implementations, ensuring a consistent behavior across different environments.
206+
207+
## Rationale
208+
209+
[A discussion of alternate approaches that have been considered and the trade
210+
offs, advantages, and disadvantages of the chosen approach.]
211+
212+
## Affected components and Teams
213+
214+
- Foundational Platform Service
215+
- CI/CD
216+
- Documentation
217+
218+
## Implementation plan
219+
220+
- Design - interface between installer and modules, config format
221+
- Design - Cloud Upgrade
222+
- Design - On-Prem Upgrade
223+
- Common - Implement installer framework and core logic
224+
- Stage 0: interactive config helper
225+
- Stage 1 - AWS - Reimplement as installer module
226+
- Implement Cloud upgrade from 3.0
227+
- Stage 1 - On-Prem - Reimplement as installer module
228+
- Implement On-Prem update from 3.0
229+
- Stage 2 - Implement common pre-orch jobs (cloud)
230+
- Stage 2 - Implement common pre-orch jobs (onprem)
231+
- Stage 3 - Monitor Argo CD deployment
232+
- Nightly tests for Cloud upgrade
233+
- Nightly tests for On-Prem upgrade
234+
- Deployment Doc - Cloud deployment
235+
- Deployment Doc - On-Prem deployment
236+
- CI and release automation - installer binary
237+
238+
Required Resources: 7.5 FTE, 6 weeks (2 sprints)
239+
240+
## Open issues (if applicable)
241+
242+
[A discussion of issues relating to this proposal for which the author does not
243+
know the solution. This section may be omitted if there are none.]

0 commit comments

Comments
 (0)