-
Notifications
You must be signed in to change notification settings - Fork 8
design: Scale provisioning of EMT-S nodes #250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 5 commits
Commits
Show all changes
46 commits
Select commit
Hold shift + click to select a range
1aba0c2
Create emts-scale-provisioning.md
krishnajs c275644
Update emts-scale-provisioning.md
krishnajs a545042
draft proposal
osinstom ce22269
Merge branch 'main' into sen-scale-proposal
osinstom 71c264b
update
osinstom 4e22e1c
Update emts-scale-provisioning.md
krishnajs e17dab2
Update emts-scale-provisioning.md
krishnajs 1d4ac0c
Update emts-scale-provisioning.md
krishnajs a2f5819
Update emts-scale-provisioning.md
krishnajs f3ec5cb
update
osinstom 63f9524
update EIM-S section
osinstom 5740676
Merge branch 'main' into sen-scale-proposal
osinstom 8e47ec4
update
osinstom 1adeeed
Update design-proposals/emts-scale-provisioning.md
osinstom 4a152b3
Update design-proposals/emts-scale-provisioning.md
osinstom 72bef0c
Update design-proposals/emts-scale-provisioning.md
osinstom a483dcc
Update design-proposals/emts-scale-provisioning.md
osinstom 0f9a41d
up
osinstom 7f327f4
Merge branch 'sen-scale-proposal' of github.com:open-edge-platform/ed…
osinstom 8214026
up
osinstom e4917bc
up
osinstom 02552a9
cleanup
osinstom 712facc
wip
osinstom ddbc0c0
update
osinstom c8ff347
fix typo
osinstom 5bf35ff
add clarifications and note about detach
osinstom 0398f79
Merge branch 'main' into sen-scale-proposal
osinstom a090af0
clarify agents activation
osinstom 27282eb
clarify infra-managers
osinstom 74a2b5c
wip
osinstom e092b73
update
osinstom 9ae0e21
add section on Internet connectivity
osinstom 41a6abb
update ADR
osinstom 46cc640
final version
osinstom 10641d9
Merge branch 'main' into sen-scale-proposal
osinstom dfecb20
up
osinstom 16143d5
Merge branch 'sen-scale-proposal' of github.com:open-edge-platform/ed…
osinstom 01e07fc
final update
osinstom d317067
final update
osinstom 831c646
Merge branch 'main' into sen-scale-proposal
osinstom 96819de
final update
osinstom 31763e8
Merge branch 'sen-scale-proposal' of github.com:open-edge-platform/ed…
osinstom b71e1d8
final update
osinstom 44235d6
Merge branch 'main' into sen-scale-proposal
osinstom ff49dc3
Merge branch 'main' into sen-scale-proposal
osinstom a0ae2ad
Trigger CI
osinstom File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
# Design Proposal: Scale provisioning of EMT-S edge node supporting OXM workflow | ||
|
||
Author(s): EIM team. | ||
|
||
Last updated: 29.04.2025 | ||
|
||
## Abstract | ||
|
||
The Edge Microvisor Toolkit Standalone (EMT-S) node is designed to enable enterprise customers and developers to evaluate Intel silicon-based platforms for Edge AI use cases. In this context, Original Equipment Manufacturers (OXMs) play a critical role by preparing edge nodes in bulk for end customers at their facilities before shipping them to the deployment locations. To support the OXM workflow for Edge AI use cases, it is essential to implement a scalable provisioning solution for multiple EMT-S nodes. This document outlines the design proposal for enabling large-scale provisioning of EMT-S edge nodes to meet the requirements of the OXM workflow. | ||
|
||
### Proposal Summary | ||
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
EIM leverages the Tinkerbell solution for provisioning operating systems on edge nodes. The current implementation supports OS provisioning via bootable USB or iPXE/HTTPs boot. To enable scalable provisioning of EMT-S edge nodes for the OXM workflow, this proposal suggests integrating PXE-based provisioning into EIM by utilizing the `smee` (formerly `boots`) component of Tinkerbell. Additionally, EMF and EIM will support configurations to deploy this as a standalone EIM solution (to be referred in future as EIM-S), tailored for OXMs to efficiently provision edge nodes at scale. OXM will have the option of provisioning edge nodes using bootable USB, iPXE/HTTPs boot, or PXE-based provisioning. The solution will also include a user experience (UX) for pre-registering edge nodes using serial numbers, UUIDs, or MAC addresses. Furthermore, the solution will support the provisioning of different operating system profiles based on the selected identifiers. In cases where a device on the local area network (LAN) boots over PXE and is not pre-registered, the default operating system will be provisioned | ||
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### MVP requirements | ||
|
||
Following are the MVP requirements for the scale provisioning of EMT-S edge node supporting OXM workflow: | ||
|
||
- Provision multiple BareMetal edge nodes without onboarding for the purpose of standalone/singleton use. | ||
- Provide deploy a service on the local network that can achieve this provisioning at scale. | ||
- Deploy the provisioning service on the local network that support PXE Boot (BIOS/UEFI with DHCP + TFTP) boot and iPXE with HTTPs. | ||
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Have a UX to pre-register BareMetal edge nodes using Serial number or UUID or MAC address. | ||
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Provision different OS profiles to different edge nodes selected based on Serial number or UUID or MAC address. | ||
- Provision default OS when a device on the LAN boots over PXE and is not pre-registered. | ||
pierventre marked this conversation as resolved.
Show resolved
Hide resolved
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
krishnajs marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Have a ux of collecting provisioning logs and status of edge nodes. | ||
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
> Note: It might be possible for EMF-EIM to support provisioning of the EMT-S nodes. supporting this capability as part of MVP depends on any active customer requirements. | ||
|
||
## Solution | ||
|
||
The solution assumes that we will deploy a slimmed down version of EIM (aka EIM standalone) on a customer's premises (OXM warehouse). The slimmed-down EIM | ||
should only consist of required components to drive OS provisioning. | ||
|
||
### Slimmed-down EIM | ||
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The provisioning of EMT-S at scale will be driven by a local orchestrator instance that will be slimmed down to include only necessary components. In a nutshell, | ||
it will consist of: | ||
|
||
- Foundational Platform Services that are required to deploy and run the orchestrator instance. | ||
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Limited observability stack that should only provide logs. | ||
- A reduced flavor of EIM - only infra-core and infra-onboarding will be deployed (no infra-managers as they won't be used by EMT-S). | ||
- Limited Web UI with EIM only. | ||
- No cluster and application orchestrator deployed. | ||
- Abandon HA requirements - EIM-S is primarily a single-node, on-prem deployment. Any data loss can be backed by hardware-level redundancy. Think of EIM-S as Rancher Desktop or Virtualbox. | ||
|
||
FPS components that are essential to EIM are: | ||
- IAM, Multi-Tenancy components | ||
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Keycloak | ||
- RS-proxy | ||
- Vault | ||
- secrets-config | ||
- MetalLB services | ||
|
||
The following FPS/Observability components should be disabled: | ||
- Kyverno | ||
- Prometheus and all metrics-related components (including infra-core's exporter, Mimir) | ||
- SRE exporter | ||
- Alerting monitors | ||
- Loki should be scaled down to minimal deployment | ||
|
||
**The installation of EIM-S should become a one-line command operation.** | ||
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### PXE-based provisioning workflow | ||
|
||
For MVP we will heavily rely on capabilities provided by Tinkerbell SMEE. We will leverage DHCP and TFTP server implementations from SMEE and only make several modifications to the | ||
EIM stack. | ||
|
||
By default, Tinkerbell SMEE relies on MAC addresses to uniquely identify PXE-booting machines and customize the iPXE script per machine. | ||
In the case of EIM, we use a static iPXE script that is not customized per Edge Node. Therefore, we can avoid using MAC address as a unique identifier | ||
and it will let use avoid adding MAC address as another EN identifier during pre-registration. | ||
|
||
The below flow will be feasible to achieve with minimal modifications to the EIM stack. The following changes must be made: | ||
|
||
- Tinkerbell SMEE must be enabled via infra-charts to provide DHCP and TFTP server | ||
- Tinkerbell SMEE must be configured with the following flags to avoid lookups by MAC address and to use our EIM iPXE script: | ||
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `-dhcp-mode=auto-proxy` | ||
- `-dhcp-http-ipxe-script-url=<URL to iPXE script on Provisioning NGINX>` (e.g., `https://tinkerbell-nginx.CLUSTER_DOMAIN/tink-stack/boot.ipxe`) | ||
- `-dhcp-http-ipxe-script-prepend-mac=false` | ||
- We may need to slightly adjust the `boot.ipxe` that is produced by DKAM to meet PXE boot requirements | ||
- We need to revise how we sign iPXE script and HookOS image - in the PXE-based provisioning flow we shouldn't upload any certificates/SB keys. | ||
- We may want to have a minimal Tinkerbell workflow for EMT-S (e.g., we don't need to install cloud-init since EMT-S nodes won't be connected to the same orchestrator instance, likely) | ||
|
||
```mermaid | ||
sequenceDiagram | ||
%%{wrap}%% | ||
autonumber | ||
box LightYellow Edge Node | ||
participant bios as BIOS (PXE) | ||
participant ipxe as iPXE | ||
end | ||
participant user as User | ||
box rgb(235,255,255) Edge Orchestrator | ||
participant smee as Tinkerbell SMEE | ||
participant pa as Provisioning Nginx | ||
participant inv as Inventory / API | ||
end | ||
|
||
user->>inv: Pre-register ENs with SN/UUID | ||
note over bios,ipxe: PXE boot starts | ||
|
||
bios->>smee: DHCP Discover | ||
smee->>bios: DHCP reply with TFTP endpoint storing base EFI script | ||
|
||
bios->>ipxe: Leaves PXE context, taken over by iPXE | ||
ipxe->>smee: DHCP Discover | ||
smee->>ipxe: DHCP reply with HTTP endpoint where boot.ipxe is stored | ||
|
||
ipxe->>+pa: Download boot.ipxe | ||
pa->>-ipxe: [boot.ipxe] | ||
|
||
note over bios,inv: OS provisioning continues in a standard flow, through HookOS and Tinkerbell workflow | ||
``` | ||
|
||
Notes: | ||
- Similar to the standard flow - If EN is not pre-registered, the process will fallback to Interactive Onboarding waiting for a user to provide credentials. | ||
- Similar to the standard flow - If EN is pre-registered, but the OS profile is not selected, the default OS will be provisioned. | ||
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- In Step 3, SMEE returns an URL to EFI script that is stored on SMEE'S TFTP server. This operation is fully handled by SMEE and it supports various of hardware PXE architectures: | ||
https://github.com/tinkerbell/smee/blob/main/internal/dhcp/dhcp.go#L44. | ||
- In Step 6, SMEE replies with HTTP(S) URL to iPXE script stored on Provisioning Nginx. We need custom SMEE configuration (mentioned above) to make it happen. | ||
- The rest of steps after Step 6 follows the existing OS provisioning flow. In summary, the difference between HTTPS-based boot and PXE boot is how the iPXE script is triggered. | ||
|
||
### Rationale | ||
|
||
The current design proposal allows to easily support legacy PXE boot while keeping the current UX around EN pre-registration. It also lets us to keep all the features that EIM currently supports, | ||
including logging, KPI instrumentation, observability, etc. | ||
|
||
The alternative considered was to use a standalone Tinkerbell deployment without the rest of the EIM stack. While it has clear advantage of a more lightweight deployment, | ||
it would completely change the UX as we would need to familiarize customers with Tinkerbell APIs (or have a custom CLI tool to help them manage Tinkerbell CRDs). | ||
With the current proposal we keep using the current UX, with possibility to use Bulk Import Tool to scale preregistration process and selectively choose OS profiles for Edge Nodes. | ||
|
||
## Affected components and Teams | ||
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- **FPS team** should work with the help of EIM team on slimming down the FPS components to meet EIM-S requirements. | ||
- **EIM team** responsible for slimming down deployment of EIM and enabling Tinkerbell SMEE with required configuration. | ||
|
||
## Open issues | ||
|
||
- By default, the provisioning flow will complete and all the agents will be installed, started and connected to the orchestrator instance that deployed them. | ||
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
|
||
We may need a way to stop the provisioning flow without installing and starting agents. | ||
- In this solution, PXE boot will always be allowed for any device in a local subnet trying to initiate PXE boot. | ||
osinstom marked this conversation as resolved.
Show resolved
Hide resolved
|
||
On the contrary, the Tinkerbell allows to control whether PXE boot is enabled/disabled for a device. For now, I don't see any need to control who is allowed to PXE-boot. | ||
- If we needed to support ISO images for EMT-D, we would need to create a new OS profile. We would also need to add a new Tinkerbell action that supports flashing ISO images to the disk. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.