-
Notifications
You must be signed in to change notification settings - Fork 8
Design proposal on converting Standalone ENs to Managed ENs #244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,236 @@ | ||
# Design Proposal: Convert Standalone Edge Nodes to Managed Edge Nodes | ||
|
||
Author(s): Tomasz Osiński | ||
|
||
Last updated: 29.04.2025 | ||
|
||
## Abstract | ||
|
||
A Customer Journey for Open Edge Platform assumes that customers can manually deploy a set of Standalone | ||
Edge Nodes (EMT-S) that can be onboarded to the Edge Orchestrator at later stage, once a customer is ready to scale their deployment. | ||
SENs are converted to managed Edge Nodes which, once onboarded, are fully owned by the Edge Orchestrator - customers | ||
can manage them (e.g., install clusters, applications or perform Day2 OS updates) through the Edge Orchestrator UI and API. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cluster might already be there on the EN. it is better to leave it at, manage them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if we build these images - I expect that co is able to import those clusters. Is that correct? |
||
|
||
The Customer Journey is as follows: | ||
|
||
1. A customer installs one or more Standalone Edge Nodes following the user guides. | ||
2. The customer uses the SEN to deploy K8s clusters, applications, etc. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. During the first install, EMT-S already installs EMT and Kubernetes. |
||
3. The customer decides to scale out their deployment and onboard the SENs to the Edge Orchestrator. | ||
4. Once SENs are onboarded, the customer starts to use the Edge Orchestrator to manage the SENs. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is better to create a new step - Customer installs the EMF onprem/on cloud to support the scale out deployment. |
||
The customer can now use the Edge Orchestrator to manage the SENs, including installing clusters, applications, and performing Day2 OS updates. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cluster might already be there in most cases. So it is either we can leave it a manage the node or say import cluster and perform LCM of devices |
||
5. The customer can now provision additional Edge Nodes via remote provisioning or manually create additional SENs and follow the same workflow to onboard them. | ||
|
||
This document describes the design of the onboarding process of Standalone Edge Nodes to managed Edge Nodes (step 3) to allow for all further steps. | ||
|
||
> NOTE: Converting standalone K8s clusters and applications to managed clusters and applications is out of scope for this document. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. better to reword as importing cluster and existing apps on EMT-S node is not in the scope of this design. This doc is specific to device managment via EIM. |
||
|
||
## Proposal | ||
|
||
### Design goals | ||
|
||
This design aims at: | ||
|
||
- Providing a solution for onboarding Edge Nodes in a fully automated way, with minimal manual steps. | ||
- Keeping the solution OS-independent, i.e., the solution should work on any OS that is supported by the Edge Orchestrator. | ||
- Reducing the user interaction (i.e., logging into the OS, injecting USB sticks, etc.) with Edge Node machines. | ||
- Enabling the onboarding process at scale, i.e., the design should enable onboarding of multiple EMT-S nodes at once. | ||
|
||
The proposed solution is an MVP approach - some of the above goals may not be fully achieved in the first version of the design, | ||
but the design doesn't preclude achieving them in the future. | ||
|
||
### Assumptions | ||
|
||
- Customers will drive the onboarding process from a local developer machine, with monitor, keyboard and mouse. | ||
- The local developer machine will have direct access to ENs via local subnet. Customers can SSH into the ENs. | ||
- SENs being onboarded won't be equipped with monitor, keyboard or mouse. The only machine equipped with peripherals will be | ||
the local developer/admin machine. | ||
- The design will re-use the current APIs of Onboarding Manager to drive IO/NIO-based onboarding. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Better to provide links to IO and NIO.. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there is no documentation around IO/NIO. The only thing we document is the pre-registration. IO is always treated as failure scenario - fallback option |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Other assumptions we could add is.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually I was trying to make the design OS-independent so that we could support bring your own OS model. We have too much logic that is only used for a single OS distro (like different Tink workflows for EMT and Ubuntu) - I wanted to avoid that. |
||
### Solution | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you want this PR to capture onboarding of a Ubuntu node... where customer can download and install EIM agents from no-auth RS and then subsequently onboard the EN ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't that part of Standalone EN ? I kind of assumed that day0 status is all agents are downloaded and installed, but cannot communicate. Do you think we should cover "standalone Ubuntu" case here as well? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or maybe they are just disabled and we cna create later the config and then restart the agents - ubuntu scenario |
||
|
||
```mermaid | ||
sequenceDiagram | ||
%%{wrap}%% | ||
autonumber | ||
|
||
box LightYellow Edge Node(s) | ||
participant ctl as Standalone Onboarding CLI | ||
end | ||
|
||
participant user as User | ||
|
||
box rgb(235,255,255) Edge Orchestrator | ||
|
||
participant pa as Provisioning Nginx | ||
participant om as Onboarding Manager | ||
participant kc as KeyCloak | ||
participant inv as Inventory / API | ||
|
||
end | ||
|
||
note over ctl,user: User installs Standalone EN, decides to scale out | ||
|
||
note over pa,inv: OS profiles created beforehand, no Host/Instance pre-registration | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Better to clarify, OS profile is created which maps to the OS of the EMT-S nodes. |
||
|
||
user->>user: Retrieves standalone onboarding user credentials | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a separate credential for EMT-S node ? Is it not same for managed EN ? |
||
user->>ctl: Logs in to the EN and invokes CLI tool | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should also add the sequence diagram for the bulk onboard of EMT-S Nodes to the orch. |
||
note over ctl: CLI triggered | ||
|
||
ctl->>ctl: Checks local OS prerequisites | ||
|
||
ctl->>ctl: Take user inputs and validate | ||
|
||
ctl->>+pa: Get orchestrator CA certificate | ||
pa-->>-ctl: [CA certificate] | ||
|
||
ctl->>+kc: Retrieve JWT token for standalone onboarding user | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why can we not use NIO registration instead of obtaining JWT using credentials here? The NIO flow does trust the deviceinit. Should reduce user interaction. I reckon security is not a concern as we allow NIO with both SB enabled/disabled. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree - this is something that was discussed long time ago with PDM too when they wanted to opt-in for a lightweight FDO |
||
kc->>-ctl: [JWT token] | ||
|
||
ctl->>+om: Invoke OnboardStandaloneService gRPC (UUID, Serial Number, OS info, ...) | ||
|
||
om->>om: Validate JWT standalone onboarding role | ||
|
||
om->>inv: Query Inventory for OS resource by OS version and distribution | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what if the OS version does not match the OS resource ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should just return an error to the caller - so in generl we can always retry |
||
|
||
om->>om: Generate cloud-init for Standalone EN | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we need to specific generate cloud init that is needed for the EN agents to work with orch. This ensured the cloud init will not overwrite/undo what user might have configured on the EN. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah we should clrify what is needed for |
||
|
||
om->>inv: Create Host and Instance resource, Set EN as Onboarded and Provisioned | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. current_state. I still believe we should force for a top-down reg as pre-requ so we have always desired-current state evolving together |
||
om->>-ctl: Return cloud-init | ||
|
||
ctl->>ctl: Save cloud-init & run manually or reboot | ||
|
||
note over ctl: After OS reboot or manual cloud-init execution | ||
|
||
ctl->>ctl: Activate Edge Node Agents | ||
``` | ||
|
||
**Step 0.** A user already provisioned a set of Standalone ENs following the user guides and decides to scale out. | ||
They don't need to have direct access to the Edge Orchestrator UI/API (could be different personas on-site vs. remote administrator). | ||
Users don't need to perform any configuration on the Edge Orchestrator beforehand, but we assume that the Edge Orchestrator supports | ||
the OS version of the Standalone ENs. The remote administrator configures a special user/role for SEN onboarding. The special user/role | ||
is associated with a single project that will determine in which tenant the SEN is imported. | ||
|
||
1. An on-site user retrieves user credentials for SEN onboarding from the remote administrator. | ||
2. An on-site user logs into the node and invokes the CLI tool that should already be installed on the EMT image. | ||
The user enters into the interactive session with the CLI tool. | ||
3. The CLI tool performs initial OS prerequisites checks - for instance, it can check if the UUID and SN are properly set on the OS. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we plan to unify around the single CLI tool? Or is this a separate CLI ? |
||
4. A user is prompted for inputs. The user will be asked for the orchestrator FQDN, proxy settings (the already set proxy settings should be presented to the user), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking... we need to provide a UX where customer can download a payload from the Orch instance (like what we download certs today). This payload can have certs, config and other artifacts for EMT-S node to basically connect to Orch (even OTP can be part of it). When customer copies this payload to all ENs (e.g. bulk scp) this agent on the EN (e.g. node agent) can start to communicate with Orch and move to normal operations. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I really like this more than IO - that mimics in somehow the FDO flow |
||
and SEN onboarding user credentials. The CLI tool should validate that all required parameters are set and the orchestrator should be reachable at this point. | ||
Moreover, the CLI tool can ask for additional input from the user such as which Local Accounts, Site or metadata to configure for a given EN (user must select a configuration that | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. minor: for any additional input that already exist in the Orchestrator I believe would be beneficial to present an interactive choice to the user, for example: This is very import if the persona that uses the CLI tool does not have access to the orchestrator to retrieve the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, I've thought about that, but we would need to expose API to Inventory on the southbound interface. Or let CLI communicate with northbound API too, which could be feasible but would probably need a different role. Thoughts @daniele-moro @pierventre ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can use the nb-apis. not sure why we would need a different role though |
||
already exists on the orchestrator and OM should be responsible for validating if that configuration exists in the Inventory). | ||
Also, the CLI tool retrieves hardware info (UUID, Serial Number), OS info (OS version and distro from `/etc/os-release`), current the Secure Boot and Full-Disk Encryption settings, | ||
MAC/IP address of the management interface. | ||
5. The CLI tool downloads the CA certificate from the Edge Orchestrator. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Even to start with, we could move the certificate procurement (to provisioning machine) to an offline mechanism & put into the SEN rather than SEN making an outbound call. Implicit trust has always been questionable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's true, but may require a bit more manual steps, which here the CLI automates that. |
||
Note that it may be less secure, but more automated to download it via CLI. We can revisit this step and require | ||
users to provide certificate as a user input. Also, this step is only required when the orchestrator uses self-signed certificate. | ||
6. The CA certificate is downloaded and saved to the local filesystem for the future use. | ||
7. The CLI tool communicates with the southbound Keycloak endpoint to retrieve JWT token based on SEN onboarding user credentials. | ||
8. The JWT token is provided to the CLI tool and used in subsequent calls to the Onboarding Manager's gRPC endpoint. | ||
9. At this point, the CLI tool gathered all user inputs (and validated them) and HW/OS info. It invokes the | ||
Onboarding Manager's gRPC endpoint to trigger the onboarding process. | ||
Note that the CLI tool should retry if any step above fails on the orchestrator side and the Onboarding Manager should be able to | ||
handle partial states. | ||
10. Onboarding Manager validates the JWT role as it currently does for all gRPC APIs. | ||
11. Onboarding Manager reads the OS info and queries OS profile from Inventory based on the OS version and OS distro. | ||
OS version should uniqely identify the OS profile. Note that this is true for EMT OS profiles, but may not be true for mutable OSes. | ||
See open issues for more considerations on the support of mutable OSes. | ||
12. The Onboarding Manager creates a dedicated cloud-init configuration for SEN. The current cloud-init library can be used. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are the static IPs (feature planned for 3.1 and ) included in this cloud-init? Are they collected via the CLI or auto-discovered from the system? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a different cloud-init. This is a default cloud-init that EIM uses for EN provisioning. The other per-EN config can be provided via another cloud-init - @niket-intc is working on the design. We can make CLI collect what day0 configuration should be applied (local accounts, per-EN config, sites, etc.) as user inputs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we really want this? CLI is becoming like an orchestrator. Let's focus on the reqs only |
||
Note that the SEN will only require a subset of current cloud-init that is generated for remote provisioning. | ||
13. Once OS profile is found and matched, all required HW/OS info provided and cloud-init generated, the OM creates Host and Instance resources. | ||
The statuses should be set to Onboarded and Provisioned. Host's desired and current state should be set to `ONBOARDED`. | ||
The Instance desired state should be set to `RUNNING`, while the current state to `UNSPECIFIED` (until BM agents are up and running). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are you sure about this ? I feel we should modify the hrm to touch only the modern status and avoid the modification of the current state. I feel the current state should be own only by the OM |
||
14. Once the onboarding is completed, the CLI tool receives the generated cloud-init configuration. | ||
15. The CLI tool saves the cloud-init under the standard path that is used by EIM. | ||
The user can either run the cloud-init manually or reboot the system to trigger the cloud-init. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can the CLI prompt the user to reboot the system (or invoke cloud-init directly) to complete the process? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, definitely doable |
||
16. The cloud-init would provide all necessary configs for BM agents to start. Once they boot, they should start | ||
communicating with the orchestrator and the EN status should be changed to Running. | ||
|
||
## Rationale | ||
|
||
### Considerations on how to trigger SEN onboarding | ||
|
||
The design choice is to use a simple CLI that will be invoked by local on-site admin to trigger SEN onboarding. | ||
We discussed the following alternatives: | ||
|
||
1. **SEN onboarding triggered by injecting a USB stick with a configuration file.** | ||
2. **SEN onboarding triggered by copying a configuration file under a well-defined OS path** | ||
|
||
Both approaches require having a local daemon scanning for USB sticks or on-disk config files. A local daemon is more | ||
"heavy" solution as it would require special security permissions and a more complex lifecycle management via systemd. | ||
It is also less interactive, so that possible intermediate errors may be harder to debug (on the contrary the CLI tool guides users through process). | ||
|
||
Both options are still valid for future releases, but for now we decided to go with the CLI tool as the MVP solution. | ||
|
||
### Considerations on the user workflow | ||
|
||
There are two major workflows we support - bottom-up (IO) and top-down (NIO, requires EN pre-registration). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should not use IO/NIO. Let's substitute with just EMT-S onboarding/with/without pre-registration |
||
|
||
In this ADR we selected to use the bottom-up approach as it requires less manual steps - user logs in to EN and | ||
run the entire workflow from the EN itself, without the need to access the Edge Orchestrator UI/API. | ||
|
||
Also, the IO flow doesn't require any modifications to UI. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you elaborate here ? |
||
|
||
### Considerations on scaling the SEN onboarding | ||
|
||
The current design assumes that the user will SSH into a Standalone EN and trigger onboarding one by one (it can still be automated by a script). | ||
Depending on the customers' requirements we can provide a kind of a "Bulk Onboarding Tool" that will allow to onboard multiple ENs at once. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or maybe we can automate with ansible? |
||
|
||
### How to map local OS users to Local Accounts? | ||
|
||
For now, we won't import existing OS users as local accounts. If we support NIO in the future, | ||
we can add support for defining local accounts that should be configured on the pre-provisioned SEN that is onboarded. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can create the default-local account though if it is configured for the project |
||
|
||
## Affected components and Teams | ||
|
||
### Impact on OS profiles | ||
|
||
The design assumes that the same OS profile, that is used by SEN, is already created in the Edge Orchestrator. | ||
Therefore, we will provide a user guide on how to create OS profiles before scaling out SEN deployment. No UI support is planned for now. | ||
|
||
However, this can have the following limitations: 1) OS profile must always be supported by the Edge Orchestrator, | ||
but this can be a customized OS image that is not supported by the Edge Orchestrator, | ||
2) it requires additional step for users to go and create OS profiles in the Edge Orchestrator before onboarding (if they don't exist). | ||
|
||
See open issues for more considerations on this topic. | ||
|
||
### Impact on Multi-Tenancy and IAM | ||
|
||
There should be a dedicated IAM role that is only allowed to onboard standalone Edge Nodes to the Edge Orchestrator. | ||
It should be a different role than the current onboarding role that is used for remote onboarding. | ||
|
||
### Impact on the Edge Microvisor Toolkit | ||
|
||
The design will require a new CLI tool that will be baked into the EMT image. | ||
|
||
## Implementation plan | ||
|
||
The implementation targets 3.1. The implemention will include: | ||
- Modifications to the Onboarding Manager to support onboarding of Standalone ENs | ||
- New CLI tool that is developed as part of infra-onboarding or integrated with EMF CLI tool | ||
- New SPEC file to the Edge Microvisor Toolkit to bake the new CLI tool into the OS image | ||
|
||
The implementation will be done by the EIM team. | ||
|
||
## Open issues (if applicable) | ||
|
||
- The OS version of Standalone ENs deployed by customers may not have a corresponding OS Profile in the Edge Orchestrator. | ||
For instance, the customer may have deployed a Standalone EN with custom EMT image that never existed in the Edge Orchestrator. | ||
Another example is a customer that has deployed a Standalone EN with an old OS image that is not supported by the Edge Orchestrator | ||
(which has already been upgraded to use newer versions). Yet another issue is that the OS version from `/etc/os-release` | ||
will not uniquely identify the OS image if that's a mutable OS (Ubuntu case) and Onboarding Manager is unable to query or validate mutable OS profile | ||
based on info provided from SENs. | ||
There are possible solutions to this: | ||
- Users should be able to create their own OS Profiles that uses a custom OS image that was used for Standalone ENs. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shall we put some guardrails in the OS_profile ? For example prevent the provisioning or a/b updates based on the type of the profile There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe add STANDALONE_OS as type and ofrce the user to use that |
||
This will allow them to scale out with any custom OS image they used, but requires additional steps from the user to "prepare" | ||
the Edge Orchestrator. Also, there might be a problem of lack of compatibility of new OS Profiles with the old OS images from the same OS family, | ||
resulting in, for example, failed A/B updates. | ||
- Relax the scope of OS Profile. Currently, OS Profiles define the exact version of OS image. We could relax that | ||
and make OS Profile define "the OS family" - possible A/B updates would only be possible within the same OS family. | ||
Users may also be able to create their own OS Profiles that define the OS family. In this way, the Standalone EN will only | ||
be registered to a broad OS family, allow for any OS version within that family. However, it won't allow | ||
users to provision new ENs with the same OS image that they used for Standalone Edge Node. | ||
- Relax the OS profile even more and make it a Day0 construct only. This means that the OS profile will be used | ||
at Day0 to provision ENs, but all OS info will be retrieved from running OS and reported to the orchestrator. | ||
This will help easily support any OS distro in the future, but requires refactoring of immutable OS A/B update workflow as | ||
it heavily relies on the OS profiles' versioning now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is better to rename SEN to EMT-S node