design: Scale provisioning of EMT-S nodes #250

krishnajs · 2025-04-30T14:00:46Z

Description

This PR adds design proposal for Scale provisioning of EMT-S edge node supporting OXM workflow

Any Newly Introduced Dependencies

N/A

How Has This Been Tested?

N/A

Checklist:

I agree to use the APACHE-2.0 license for my code changes
I have not introduced any 3rd party dependency changes
I have performed a self-review of my code

github-actions

Thank you for your contribution! Please make sure to review our Contributing Guide.

design-proposals/emts-scale-provisioning.md

daniele-moro · 2025-05-07T09:27:54Z

design-proposals/emts-scale-provisioning.md

+The provisioning of EMT-S at scale will be driven by a local orchestrator instance that will be slimmed down to include only necessary components. In a nutshell,
+it will consist of:
+
+- Foundational Platform Services that are required to deploy and run the orchestrator instance.


Argo? How do we handle the deployment of the orchestrator?

yes, let's assume for now this is deployed via the EIM-Standalone profile you are working on @daniele-moro so effectively we leverage the orchestrator on prem installer.

I strongly advise against creating a dedicated SKU for this, like we previously did in a previous release.

In a parallel work effort, I proposed that we make Orchestrator configurable at initial deployment time AND runtime.

In this effort, I propose that instead of a separate variant of Orchestrator, we utilize this proposed dynamically configurable Orchestrator and explicitly configure it to disable unnecessary features).

cc @charlesmcchan

yes, no variant is intended, a profile is what we are going to use for now, further optimizations can be achieved in the future.

There is already a way to do that via Argo CD profile, but it is not as straightforward as it can be. I will make "the ability to enable/disable components easily" an explicit requirement for the new installer that we are working on.

@charlesmcchan @se-chris-thach I want to test EIM-only/standalone profile, can I start creating a profile or should I use a different approach?

design-proposals/emts-scale-provisioning.md

Andrea-Campanella · 2025-05-07T11:31:44Z

design-proposals/emts-scale-provisioning.md

+The provisioning of EMT-S at scale will be driven by a local orchestrator instance that will be slimmed down to include only necessary components. In a nutshell,
+it will consist of:
+
+- Foundational Platform Services that are required to deploy and run the orchestrator instance.


yes, let's assume for now this is deployed via the EIM-Standalone profile you are working on @daniele-moro so effectively we leverage the orchestrator on prem installer.

design-proposals/emts-scale-provisioning.md

se-chris-thach

Can we please add detail around the security considerations for this new design?

Ram-srini · 2025-05-12T10:20:02Z

design-proposals/emts-scale-provisioning.md

+- Deploy the provisioning service on the local network that support PXE Boot (BIOS/UEFI with DHCP + TFTP) boot and iPXE with HTTPs.  
+- Have a UX to pre-register BareMetal edge nodes using Serial number or UUID or MAC address.
+- Provision different OS profiles to different edge nodes selected based on Serial number or UUID or MAC address.
+- Provision default OS when a device on the LAN boots over PXE and is not pre-registered.


What is the meaning of default OS here, Are we installing supporting the Ubuntu also in EMT-S?

no, but there can be multiple versions of EMT-S, we will have. default one in this case. Aslo this can be extended to other immutable ISO images.

design-proposals/emts-scale-provisioning.md

Andrea-Campanella · 2025-05-12T11:34:55Z

design-proposals/emts-scale-provisioning.md

+- Deploy the provisioning service on the local network that support PXE Boot (BIOS/UEFI with DHCP + TFTP) boot and iPXE with HTTPs.  
+- Have a UX to pre-register BareMetal edge nodes using Serial number or UUID or MAC address.
+- Provision different OS profiles to different edge nodes selected based on Serial number or UUID or MAC address.
+- Provision default OS when a device on the LAN boots over PXE and is not pre-registered.


no, but there can be multiple versions of EMT-S, we will have. default one in this case. Aslo this can be extended to other immutable ISO images.

design-proposals/emts-scale-provisioning.md

Andrea-Campanella · 2025-05-12T11:39:41Z

design-proposals/emts-scale-provisioning.md

+In other words, the only difference between the new PXE-based boot and HTTP-based boot is how the OS provisioning is triggered. The subsequent workflow remains the same -
+it leverages Micro-OS, device discovery and Tinkerbell workflow to complete OS provisioning.
+
+**NOTE1**: Customers should provide their own local DHCP server for dynamic IP address assignment.


this needs ample documentation

also I can't see it in the diagram below, even though the DHCP request is intercepted by SMEE, we ought to clarify and add it in the diagram.

design-proposals/emts-scale-provisioning.md

pierventre · 2025-05-12T17:01:28Z

design-proposals/emts-scale-provisioning.md

+
+**NOTE1**: Customers should provide their own local DHCP server for dynamic IP address assignment.
+
+**NOTE2**: The workflow assumes that the EIM Standalone is deployed locally on customers' premises and ENs have direct connectivity with EIM services.


shall we say something about the interaction with external (not managed by us) DHCP servers?

I'll add a diagram with a sample network topology but we're not really making any changes to network requirements. We had been relying on customer-managed DHCP server for long time already

I believe we need to tell the user to change few things in their dhcp server too? Dont we need the next-server option?

next-server is provided by SMEE's DHCP server which is configured in proxy DHCP mode

If we enforce any change to DHCP servers, this design proposal wouldn't make sense to me

ok if you believe is not necessary - but this is what I found to make them coexist:

A PXE client sends a DHCP request (broadcast).

The external DHCP server responds with: An IP address, The next-server (SMEE server IP), The filename (iPXE binary).

Tinkerbell SMEE (in proxy DHCP mode) responds with PXE-specific options (e.g., boot filename and TFTP/HTTP server details).

The PXE client downloads the iPXE binary and executes the boot script provided by Boots.

Not sure if this is suggested for network race conditions

let me do some more experiments to confirm once I have my target on-prem environment running

pierventre · 2025-05-12T18:27:49Z

design-proposals/emts-scale-provisioning.md

+
+**NOTE2**: The workflow assumes that the EIM Standalone is deployed locally on customers' premises and ENs have direct connectivity with EIM services.
+
+The high-level PXE-based provisioning workflow is as follows:


also in the sequence say something about external DHCP interactions. Or create a small paragraph with the assumptions and the checklist user should comply:

Boot works in proxy dhcp mode only

External DHCP server uses DHCP options....

pierventre · 2025-05-12T18:36:17Z

design-proposals/emts-scale-provisioning.md

+
+3. **Expose Tinkerbell SMEE's DHCP/TFTP server via K8s External IP**
+
+Tinkerbell SMEE's DHCP/TFTP server must be reachable from a local, on-prem L2 network to handle DHCP/TFTP requests. This requires modifications to FPS services.


is it not enough use hostNetwork for port 67-69? Which modifications do we expect?

Perhaps, I'll work on another PoC to make SMEE work on the on-prem deployment

pierventre · 2025-05-12T18:46:19Z

design-proposals/emts-scale-provisioning.md

+#### Alternative workflow with managed EMF
+
+In some cases it may be desirable for users to scale EMT-S provisioning via legacy PXE, but the EMF (or EIM) is deployed in a central location.
+This solution also enables the PXE boot when the entire EMF (or EIM Standalone) is deployed as remote, cloud-based, managed solution (not on-prem).
+
+In this alternative there is only a small local piece of EIM (called "EIM-local" hereinafter) deployed on site (via installation script provided by EIM team).
+The EIM-local assists in the initial PXE boot and make possible for ENs to initiate boot via PXE. Once booted into Micro-OS, the provisioning process is taken over by the cloud-based orchestrator.
+
+The EIM-local consists of the following components:
+1. **Standalone Tinkerbell SMEE** providing DHCP/TFTP server to support legacy PXE boot.
+2. **Local HTTP server** (e..g, Nginx) storing mirrors of `boot.ipxe` and Micro-OS image.
+3. (OPTIONAL) **K8s cluster with MetalLB extension** to make Standalone SMEE's DHCP/TFTP servers accessible from a local network. Only needed if EIM-local is deployed on top of Kubernetes.
+   Note that the EIM-local can also be deployed as standalone OS services or Docker containers with `--network=host`.
+
+**NOTE1:** Local HTTP server providing `boot.ipxe` and Micro-OS image is needed to overcome HTTPS issue as
+SMEE's built-in iPXE doesn't include EMF's CA certificate. This alternative design assumes no modifications to Tinkerbell SMEE, for simplicity.


Honestly, I argue that improving DKAM seems to me a better option here - we basically empower DKAM to curate for any arbitrary orchestrator (PROVISIONING_URL, ORCH_CA).

Of course I prefer the 2nd option below

pierventre · 2025-05-12T20:18:21Z

design-proposals/emts-scale-provisioning.md

+  - No Kyverno for policies
+- No UI deployed, the EN pre-registration is done via Bulk Import Tool, API or CLI tool.
+- Limited observability stack that should only provide logs. No alerting, SRE, Prometheus.
+- Abandon HA requirements - EIM-S is primarily a single-node, on-prem deployment. Any data loss can be backed by hardware-level redundancy. Think of EIM-S as Rancher Desktop or Virtualbox.


we dont have any HA mechanism when deployed locally

Even in the observability stack ?

Ram-srini · 2025-05-13T03:55:01Z

design-proposals/emts-scale-provisioning.md

+2. **Expose Provisioning Nginx via HTTP for on-prem EIM standalone only**
+
+Currently, Provisioning Nginx is deployed behind HTTPS and the iPXE binary must be built with orchestrator's CA certificate for successful handshake.
+Tinkerbell SMEE obviously doesn't include the EMF's CA certificate. Therefore, any communication over HTTPS is not feasible.


Good to check with SAFE team once to go ahead with HTTP provided local network is protected with firewall.

Had a chat with @haribabug. He will provide his feedback on this ADR. I'm going to expand the network requirements section. Basically we rely on being in the local L2 network that is isolated (without any inbound connections from outside)

Is it worth using Smee if it is going to cause a divergence in security policy and deployment process like this?

There are many other off-the-shelf DHCP servers out there (dnsmasq, udhcpd, Kea, etc.) that could fill this gap, or we could see if Smee could be modified.

Co-authored-by: Andrea Campanella <[email protected]>

…ge-manageability-framework into sen-scale-proposal

zdw · 2025-05-14T15:42:14Z

design-proposals/emts-scale-provisioning.md

+- Deploy the provisioning service on the local network that support PXE Boot (BIOS/UEFI with DHCP + TFTP) boot and iPXE with HTTPs.  
+- Have a UX to pre-register BareMetal edge nodes using Serial number or UUID.
+- Provision different OS profiles to different edge nodes selected based on Serial number or UUID.
+- Provision default OS when a device on the LAN boots over PXE and is not pre-registered.


Clarify that the "default OS when not pre-registered" is different from the "bottom up" registration workflow - the EN would not be registered at the end of this process, just having the default OS installed.

Maybe call this "Anonymous OS install" or something to specify that it isn't registered.

Also, in this case there non-authenticate OS install would not send logs to EMF, correct?

@krishnajs is this requirements still valid with pre-registration flow? I think we should remove it.

design-proposals/emts-scale-provisioning.md

zdw · 2025-05-14T15:50:18Z

design-proposals/emts-scale-provisioning.md

+In other words, the only difference between the new PXE-based boot and HTTP-based boot is how the OS provisioning is triggered. The subsequent workflow remains the same -
+it leverages Micro-OS, device discovery and Tinkerbell workflow to complete OS provisioning.
+
+**NOTE1**: Customers should provide their own local DHCP server for dynamic IP address assignment.


What configuration will be needed to perform this? Typically any DHCP servers requires defining a subnet (or multiple subnets, which can have different config), so we would need the user to provide this information.

Ideally we could reconfigure this if IP ranges change, without requiring a reinstall.

zdw · 2025-05-14T15:58:53Z

design-proposals/emts-scale-provisioning.md

+2. **Expose Provisioning Nginx via HTTP for on-prem EIM standalone only**
+
+Currently, Provisioning Nginx is deployed behind HTTPS and the iPXE binary must be built with orchestrator's CA certificate for successful handshake.
+Tinkerbell SMEE obviously doesn't include the EMF's CA certificate. Therefore, any communication over HTTPS is not feasible.


Is it worth using Smee if it is going to cause a divergence in security policy and deployment process like this?

There are many other off-the-shelf DHCP servers out there (dnsmasq, udhcpd, Kea, etc.) that could fill this gap, or we could see if Smee could be modified.

design-proposals/emts-scale-provisioning.md

teone · 2025-05-16T17:16:14Z

design-proposals/emts-scale-provisioning.md

+  note over microos,inv: OS provisioning continues with the standard flow, through Micro-OS and Tinkerbell workflow
+```
+
+1. Users perform standard EN registration via UI or Bulk Import Tool. For each EN we can define OS profile and additonal settings (e.g., site, local account).


Users perform standard EN registration via UI

The registration flow is not the standard one as whether the Cluster is installed or not entirely depend on the selection of the OS profile (via cloud-init). We should make sure that the user only has the <project_id>_ir/irw roles so that the UI can hide the Cluster creation steps from the registration/provisioning flow.

cc @satya-in @soniabha-intc

Andrea-Campanella · 2025-05-16T17:23:23Z

design-proposals/emts-scale-provisioning.md

+1) **Support legacy PXE boot to scale EMT-S provisioning** - the EIM will be extended with a local DHCP/TFTP server that helps initiate OS provisioning via legacy PXE boot.
+   The legacy PXE boot uses the devices' PXE firmware to bootstrap into iPXE, which then continues the EN provisioning process.
+2) **Extend usage of Platform Bundle to be compatible with USB-based EMT-S** - to be consistent with USB-based EMT-S, the EIM will consume Platform Bundle that contains all the scripts and files to install standalone K8s cluster and other customizations.
+3) **Use EIM standalone deployment** - the proposed solution involves deploying a specific EMF profile, referred to as EIM standalone profile. 


In 3.1 we deploy the whole orchestrator in the OXM flow reducing the resource requirements (TODO: measure resources needed for 100 within 5 minutes, also unlimited resources as good as we can get numbers.) and use roles NOT to show App and CO elements in UI, default MT user, projects and org created automatically via a module during installation, with a PoC of Infra-only profile of EMF with "extreme" simplification but still using the orchestrator as a whole, not part of 3.1. Reason behind decision to keep the whole of EMF is limited resource/performance/time and UX gain compared to the effort we need to invest to maintain. PoC of Daniele will be revisited for a possible 3.2 implementation.

as per the discussion please move this lower in the ADR and advertise as PoC.

Andrea-Campanella · 2025-05-16T17:24:15Z

design-proposals/emts-scale-provisioning.md

+Also, an alternative to using SMEE was considered to avoid divergence in security policies (HTTP vs. HTTPS for Provisioning Nginx).
+This direction requires more development effort, but is still left as future improvement. It's further elaborated in a separate [design proposal](https://github.com/open-edge-platform/edge-manageability-framework/pull/309).
+
+## Affected components and Teams


@daniele-moro to add the role to avoid User accessing App and CO.

krishnajs added 2 commits April 29, 2025 11:56

Create emts-scale-provisioning.md

1aba0c2

Update emts-scale-provisioning.md

c275644

github-actions bot reviewed Apr 30, 2025

View reviewed changes

osinstom reviewed Apr 30, 2025

View reviewed changes

design-proposals/emts-scale-provisioning.md Outdated Show resolved Hide resolved

design-proposals/emts-scale-provisioning.md Show resolved Hide resolved

design-proposals/emts-scale-provisioning.md Show resolved Hide resolved

draft proposal

a545042

osinstom changed the title ~~Sen scale proposal~~ design: Scale provisioning of EMT-S nodes May 6, 2025

Merge branch 'main' into sen-scale-proposal

ce22269

pierventre reviewed May 6, 2025

View reviewed changes

design-proposals/emts-scale-provisioning.md Outdated Show resolved Hide resolved

pierventre reviewed May 6, 2025

View reviewed changes

design-proposals/emts-scale-provisioning.md Outdated Show resolved Hide resolved

pierventre reviewed May 6, 2025

View reviewed changes

design-proposals/emts-scale-provisioning.md Outdated Show resolved Hide resolved

pierventre reviewed May 6, 2025

View reviewed changes

design-proposals/emts-scale-provisioning.md Show resolved Hide resolved

update

71c264b

daniele-moro reviewed May 7, 2025

View reviewed changes

Andrea-Campanella reviewed May 7, 2025

View reviewed changes

se-chris-thach reviewed May 7, 2025

View reviewed changes

krishnajs added 4 commits May 8, 2025 10:22

Update emts-scale-provisioning.md

4e22e1c

Update emts-scale-provisioning.md

e17dab2

Update emts-scale-provisioning.md

1d4ac0c

Update emts-scale-provisioning.md

a2f5819

ajaythakurintel added the Proposal Identify a PR as a design proposal to be reviewed. label May 9, 2025

osinstom added 2 commits May 9, 2025 14:54

update

f3ec5cb

update EIM-S section

63f9524

osinstom marked this pull request as ready for review May 9, 2025 13:57

osinstom requested a review from mdbalvin as a code owner May 9, 2025 13:57

Merge branch 'main' into sen-scale-proposal

5740676

osinstom requested review from rad-szulim, yi-tseng-intel, ashridatta, teone and callumnobleintel as code owners May 9, 2025 13:57

Ram-srini reviewed May 12, 2025

View reviewed changes

Andrea-Campanella reviewed May 12, 2025

View reviewed changes

pierventre reviewed May 12, 2025

View reviewed changes

Ram-srini reviewed May 13, 2025

View reviewed changes

osinstom and others added 8 commits May 13, 2025 10:39

Update design-proposals/emts-scale-provisioning.md

1adeeed

Co-authored-by: Andrea Campanella <[email protected]>

Update design-proposals/emts-scale-provisioning.md

4a152b3

Co-authored-by: Andrea Campanella <[email protected]>

Update design-proposals/emts-scale-provisioning.md

72bef0c

Co-authored-by: Andrea Campanella <[email protected]>

Update design-proposals/emts-scale-provisioning.md

a483dcc

Co-authored-by: Andrea Campanella <[email protected]>

up

0f9a41d

Merge branch 'sen-scale-proposal' of github.com:open-edge-platform/ed…

7f327f4

…ge-manageability-framework into sen-scale-proposal

up

8214026

up

e4917bc

osinstom mentioned this pull request May 13, 2025

design: PXE-based provisioning with managed EMF #309

Open

3 tasks

osinstom added 3 commits May 13, 2025 11:40

cleanup

02552a9

wip

712facc

update

ddbc0c0

zdw reviewed May 14, 2025

View reviewed changes

daniele-moro reviewed May 15, 2025

View reviewed changes

design-proposals/emts-scale-provisioning.md Outdated Show resolved Hide resolved

osinstom added 5 commits May 15, 2025 11:23

fix typo

c8ff347

add clarifications and note about detach

5bf35ff

Merge branch 'main' into sen-scale-proposal

0398f79

clarify agents activation

a090af0

clarify infra-managers

27282eb

teone reviewed May 16, 2025

View reviewed changes

Andrea-Campanella reviewed May 16, 2025

View reviewed changes

wip

74a2b5c


		NOTE1: Customers should provide their own local DHCP server for dynamic IP address assignment.

		NOTE2: The workflow assumes that the EIM Standalone is deployed locally on customers' premises and ENs have direct connectivity with EIM services.


		NOTE2: The workflow assumes that the EIM Standalone is deployed locally on customers' premises and ENs have direct connectivity with EIM services.

		The high-level PXE-based provisioning workflow is as follows:


		3. Expose Tinkerbell SMEE's DHCP/TFTP server via K8s External IP

		Tinkerbell SMEE's DHCP/TFTP server must be reachable from a local, on-prem L2 network to handle DHCP/TFTP requests. This requires modifications to FPS services.

design: Scale provisioning of EMT-S nodes #250

Are you sure you want to change the base?

design: Scale provisioning of EMT-S nodes #250

Conversation

krishnajs commented Apr 30, 2025

Description

Any Newly Introduced Dependencies

How Has This Been Tested?

Checklist:

github-actions bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

se-chris-thach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pierventre May 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pierventre May 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pierventre May 13, 2025 •

edited

Loading

pierventre May 12, 2025 •

edited

Loading