Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
/pkg/pillar/cmd/downloader/ @milan-zededa @rouming
/pkg/pillar/cmd/ledmanager/ @rucoder @rene
/pkg/pillar/cmd/nim/ @milan-zededa
/pkg/pillar/cmd/scepclient/ @milan-zededa
/pkg/pillar/cmd/tpmmgr/ @rucoder @shjala
/pkg/pillar/cmd/volumemgr/ @OhmSpectator @rouming @europaul
/pkg/pillar/cmd/zedagent/ @OhmSpectator @milan-zededa @rouming @uncleDecart
Expand Down
4 changes: 4 additions & 0 deletions docs/CONFIG-PROPERTIES.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,9 @@
| diag.probe.remote.http.endpoint | string | `"http://www.google.com"` | - | - | Remote endpoint (URL, IP instead of hostname is accepted) queried over HTTP to assess the state of network connectivity whenever the controller is not reachable. Used only for diagnostics (no functional impact). Set to an empty string to disable. |
| diag.probe.remote.https.endpoint | string | `"https://www.google.com"` | - | - | Remote endpoint (URL, IP instead of hostname is NOT accepted) queried over HTTPS to assess the state of network connectivity whenever the controller is not reachable. Used only for diagnostics (no functional impact). Set to an empty string to disable. |
| app.enable.tcp.mss.clamping | bool | true | - | - | Configuration property that enables EVE to automatically adjust (clamp) the TCP MSS on forwarded application traffic to match the path MTU, preventing fragmentation and connectivity issues on lower-MTU links. |
| scep.retry.interval | timer in seconds | 300 (5 minutes) | 60 (1 minute) | 3600 (1 hour) | Interval between retry attempts for certificates that previously failed to enroll/renew or returned PENDING from the SCEP server. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you get PENDING will you wait for 5 minutes by default before checking the status? That seems like a long time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PENDING is returned when certificate enrollment requires manual approval from an administrator. Given this, it’s reasonable to expect the process to take at least a few minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the SCEP RFC does not define a recommended polling interval, thus I made it configurable.

| pnac.dhcp.reacquire.max.retries | integer | 4 | 0 | 8 | Maximum number of DHCP reacquire retries after a PNAC (802.1X) port authentication state change. When the network switch reassigns the port to a different access VLAN, EVE retries with exponential backoff (2s, 4s, 8s, ...) until the IP subnet changes or the retry limit is reached. Setting this value to 0 disables DHCP reacquire. |
| dhcp.enable.vendorclassid | bool | true | - | - | Enables sending the DHCP Vendor Class Identifier (Option 60) to identify the device as EVE OS. This allows networks or DHCP servers to apply policies such as VLAN assignment or granting access to the EVE controller. Some badly configured DHCP servers may reject unknown vendor class IDs. Setting this to false disables sending the vendor class ID. |

## Log levels

Expand Down Expand Up @@ -156,3 +159,4 @@ Right now the following agents support per-agent log level settings:
* msrv
* domainmgr
* diag
* scepclient
96 changes: 96 additions & 0 deletions docs/DEVICE-CONNECTIVITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -604,6 +604,102 @@ There are two levels of errors:
- A particular management port could not be used to reach the controller. In that case
the `ErrorInfo` for the particular `DevicePort` is set to indicate the error and timestamp.

## Port-Based Network Access Control (802.1X) and SCEP Certificate Enrollment

EVE supports IEEE 802.1X Port-Based Network Access Control (PNAC), allowing network switches
to restrict port-level access until the device authenticates with a valid certificate.
IEEE 802.1X is a standard for port-based network access control that works at Layer 2
of the network stack. A switch port starts in an unauthorized state and only grants full
network access after the connected device (the supplicant) successfully authenticates
against an authentication server (typically a RADIUS server) via the switch (the authenticator).

To obtain the certificate required for authentication, EVE implements SCEP (Simple Certificate
Enrollment Protocol), a protocol designed for automated certificate enrollment from
a Certificate Authority (CA). SCEP allows a device to generate a key pair, submit
a Certificate Signing Request (CSR) to a SCEP server, and receive a signed certificate
in return.

The 802.1X supplicant is implemented using [wpa_supplicant](https://w1.fi/wpa_supplicant/)
with EAP-TLS as the authentication method. The SCEP client is implemented using
the [github.com/smallstep/scep](https://github.com/smallstep/scep) Go library.

### Bootstrapping workflow

The full workflow from an unauthenticated device to an authenticated network port is:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be default workflow for every port, or just for the ones we marked as PNAC-required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for those with PNAC enabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we notify if "non-PNAC enabled" port has no connectivity because it's most likely in a PNAC-enabled network?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detection of 802.1X is not included in this PR and may be added as a future enhancement, as it was not required for the current scope.
It should be feasible to implement (e.g., by running wpa_supplicant temporarily and inspecting EAP RX metrics), but I chose not to introduce additional complexity in this already sizable PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, was just poking around use case, thanks


1. **DHCP with vendor class identification**: The device sends a DHCP request that includes
a Vendor Class Identifier (DHCP Option 60) set to `LFEDGE-EVE`. This identifies the device
as running EVE OS to the network infrastructure.

2. **Non-authenticated VLAN access**: The network switch places the port into a non-authenticated
(bootstrap) VLAN. Because the switch detects the EVE OS vendor class identifier, it allows
the device to reach the controller and fetch the network configuration including the SCEP
enrollment profile. This step is critical for bootstrapping — the device needs connectivity
to obtain the certificate it will later use for authentication.

3. **SCEP certificate enrollment**: The device follows the SCEP profile received from
the controller to enroll a certificate. It can communicate with the SCEP server in one
of two ways:
- **Directly**: The device contacts the SCEP server URL specified in the profile.
- **Via controller proxy**: The device routes SCEP requests through a controller-provided
SCEP proxy (essentially an HTTP proxy), which is useful when the SCEP server is not
directly reachable from the bootstrap VLAN.

4. **802.1X port authentication**: Once the certificate is enrolled, the device uses it
to authenticate the port via 802.1X EAP-TLS. Upon successful authentication, the switch
moves the port to the authenticated VLAN, granting full network access.
Comment on lines +648 to +650
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the port will be part of the unautenticated VLAN while the 802.1X is doing multiple round trips to authenticate?

Does DHCP get triggered pnac.dhcp.reacquire.delay after the802.1X exchange is successful?

If the port is connected to no VLAN while the 802.1X is progressing, then there will be less concerns about needing to delay the DHCP request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First we need IP from the unautenticated VLAN to access the cloud, get config and enroll certificate from a SCEP server.
Then we start wpa_supplicant configured with the enrolled certificate, and once the authentication succeeds, we wait for pnac.dhcp.reacquire.delay, then obtain a new DHCP lease.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this and replaced the delay interval, which was quite fragile, with pnac.dhcp.reacquire.max.retries: After the port authentication state changes, the device retries the DHCP request with exponential backoff (2s, 4s, 8s, ...) to obtain an IP address from the authenticated VLAN. Retries continue until the IP subnet changes (indicating the VLAN transition completed) or the configured maximum number of retries is reached.


5. **DHCP reacquisition**: After the port authentication state changes, the device retries
the DHCP request with exponential backoff (2s, 4s, 8s, ...) to obtain an IP address from
the authenticated VLAN. Retries continue until the IP subnet changes (indicating the VLAN
transition completed) or the configured maximum number of retries
([`pnac.dhcp.reacquire.max.retries`](CONFIG-PROPERTIES.md), default 4) is reached.

### Configuration

PNAC and SCEP are configured through the controller using the device API:

- **SCEP profiles** are defined in `EdgeDevConfig.ScepProfiles` and specify the SCEP server URL,
whether to use the controller proxy, a challenge password (encrypted), trusted CA certificates,
and CSR parameters (subject DN, SANs, key type, hash algorithm, renewal period).

- **PNAC configurations** are defined in `EdgeDevConfig.Pnacs`, each referencing network adapter
and a SCEP profile by logical names. They specify the EAP method (currently EAP-TLS),
an optional EAP identity. If no EAP identity is configured, EVE will derive the identity from
the enrolled certificate, preferring the subject common name (CN), or the SAN URI if CN is absent.

Relevant [configuration properties](CONFIG-PROPERTIES.md):

| Property | Default | Description |
|---|---|---|
| `scep.retry.interval` | 300s (5 min) | Interval between retry attempts for failed or pending SCEP enrollments |
| `pnac.dhcp.reacquire.max.retries` | 4 | Max DHCP reacquire retries (with exponential backoff) after 802.1X authentication state change. Set to 0 to disable |
| `dhcp.enable.vendorclassid` | true | Enables sending DHCP Vendor Class Identifier (Option 60) as `LFEDGE-EVE` |

### Certificate lifecycle

The enrolled certificate is stored on the device along with its private key (kept in the vault
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "the vault" mean /persist/vault?

Can the private key be needed immediately after a reboot to (re)authenticiate over 802.1X?

Copy link
Contributor Author

@milan-zededa milan-zededa Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "the vault" mean /persist/vault?

Yes

Port authentication is needed for application connectivity.
Controller is accessible even through the non-authenticated management ports.
So we do not need to perform 802.1X until we are about to start applications. Therefore we do not need to access the private key before the vault is unlocked.

for protection). EVE monitors the certificate's validity and automatically initiates renewal
when the configured percentage of the certificate's lifetime has elapsed
(controlled by `RenewPeriodPercent` in the CSR profile). If the SCEP server or CSR profile
configuration changes, EVE will re-enroll the certificate against the new parameters.

### Status and metrics reporting

EVE publishes the following information to the controller:

- **PNAC status** (per-port): Whether 802.1X is enabled, the current supplicant state
(e.g. connecting, authenticating, authenticated, failed), the timestamp of the last
successful authentication, and any authentication errors.

- **Enrolled certificate status**: Details of the installed certificate including subject,
issuer, SANs, validity period, SHA-256 fingerprint, key type, and current certificate status
(e.g. valid, expired, pending enrollment).

- **PNAC metrics** (per-port): EAPOL frame counters including frames received/transmitted,
EAPOL-Start and EAPOL-Logoff frames, EAP-Request/Response frames, and counts of invalid
or malformed frames.

## Air-Gap Mode

Air-Gap mode allows a device to operate without connectivity to the main controller,
Expand Down
2 changes: 1 addition & 1 deletion pkg/edgeview/src/network.go
Original file line number Diff line number Diff line change
Expand Up @@ -1415,7 +1415,7 @@ func runWireless() {
_, _ = runCmd(prog, args, true)

retbytes, err = os.ReadFile(
fmt.Sprintf("/run/nim/wpa_supplicant.%s.conf", port.IfName))
fmt.Sprintf("/run/nim/wpa_supplicant-%s.conf", port.IfName))
if err != nil {
continue
}
Expand Down
1 change: 1 addition & 0 deletions pkg/pillar/cipher/cipher.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ func getEncryptionBlock(
decBlock.ProtectedUserData = zconfigDecBlockPtr.ProtectedUserData
decBlock.ClusterToken = zconfigDecBlockPtr.ClusterToken
decBlock.GzipRegistrationManifestYaml = zconfigDecBlockPtr.GzipRegistrationManifestYaml
decBlock.SCEPChallengePassword = zconfigDecBlockPtr.ScepChallengePassword
return decBlock
}

Expand Down
125 changes: 119 additions & 6 deletions pkg/pillar/cmd/nim/nim.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ import (
"strings"
"time"

eveinfo "github.com/lf-edge/eve-api/go/info"
"github.com/lf-edge/eve/pkg/pillar/agentbase"
"github.com/lf-edge/eve/pkg/pillar/agentlog"
"github.com/lf-edge/eve/pkg/pillar/base"
Expand Down Expand Up @@ -80,6 +81,8 @@ type nim struct {
subNetworkInstanceConfig pubsub.Subscription
subEdgeNodeClusterStatus pubsub.Subscription
subKubeUserServices pubsub.Subscription
subVaultStatus pubsub.Subscription
subEnrolledCertStatus pubsub.Subscription

// Publications
pubDummyDevicePortConfig pubsub.Publication // For logging
Expand All @@ -91,10 +94,13 @@ type nim struct {
pubCipherMetrics pubsub.Publication
pubCachedResolvedIPs pubsub.Publication
pubWwanConfig pubsub.Publication
pubPNACMetrics pubsub.Publication

// Metrics
agentMetrics *controllerconn.AgentMetrics
cipherMetrics *cipher.AgentMetrics
agentMetrics *controllerconn.AgentMetrics
cipherMetrics *cipher.AgentMetrics
metricInterval uint32 // In seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically it's a limitation but I don't think anyone would setup to collect metrics less frequent than 136 years :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Max allowed value for timer.metric.interval is 3600 (1 hour)

publishTicker *flextimer.FlexTickerHandle

// Configuration
globalConfig types.ConfigItemValueMap
Expand Down Expand Up @@ -219,11 +225,12 @@ func (n *nim) run(ctx context.Context) (err error) {
stillRunning := time.NewTicker(stillRunTime)
n.PubSub.StillRunning(agentName, warningTime, errorTime)

// Publish metrics for zedagent every 10 seconds
interval := 10 * time.Second
// Publish network metrics
interval := time.Duration(n.metricInterval) * time.Second
max := float64(interval)
min := max * 0.3
publishTimer := flextimer.NewRangeTicker(time.Duration(min), time.Duration(max))
publishTicker := flextimer.NewRangeTicker(time.Duration(min), time.Duration(max))
n.publishTicker = &publishTicker

// Periodically resolve the controller hostname to keep its DNS entry cached,
// reducing the need for DNS lookups on every controller API request.
Expand All @@ -243,6 +250,8 @@ func (n *nim) run(ctx context.Context) (err error) {
n.subWwanStatus,
n.subNetworkInstanceConfig,
n.subKubeUserServices,
n.subVaultStatus,
n.subEnrolledCertStatus,
}
for _, sub := range inactiveSubs {
if err = sub.Activate(); err != nil {
Expand Down Expand Up @@ -292,8 +301,16 @@ func (n *nim) run(ctx context.Context) (err error) {
case change := <-n.subKubeUserServices.MsgChan():
n.subKubeUserServices.ProcessChange(change)

case <-publishTimer.C:
case change := <-n.subVaultStatus.MsgChan():
n.subVaultStatus.ProcessChange(change)

case change := <-n.subEnrolledCertStatus.MsgChan():
n.subEnrolledCertStatus.ProcessChange(change)
n.handleEnrolledCertUpdate()

case <-publishTicker.C:
start := time.Now()
n.publishPNACMetrics()
err = n.cipherMetrics.Publish(n.Log, n.pubCipherMetrics, "global")
if err != nil {
n.Log.Error(err)
Expand Down Expand Up @@ -408,6 +425,14 @@ func (n *nim) initPublications() (err error) {
if err != nil {
return err
}

n.pubPNACMetrics, err = n.PubSub.NewPublication(pubsub.PublicationOptions{
AgentName: agentName,
TopicType: types.PNACMetricsList{},
})
if err != nil {
return err
}
return nil
}

Expand Down Expand Up @@ -613,6 +638,27 @@ func (n *nim) initSubscriptions() (err error) {
if err != nil {
return err
}

n.subVaultStatus, err = n.PubSub.NewSubscription(pubsub.SubscriptionOptions{
AgentName: "vaultmgr",
MyAgentName: agentName,
TopicImpl: types.VaultStatus{},
Activate: false,
CreateHandler: n.handleVaultStatusCreate,
ModifyHandler: n.handleVaultStatusModify,
WarningTime: warningTime,
ErrorTime: errorTime,
})

n.subEnrolledCertStatus, err = n.PubSub.NewSubscription(pubsub.SubscriptionOptions{
AgentName: "scepclient",
MyAgentName: agentName,
TopicImpl: types.EnrolledCertificateStatus{},
Activate: false,
Persistent: true,
WarningTime: warningTime,
ErrorTime: errorTime,
})
return nil
}

Expand Down Expand Up @@ -661,6 +707,17 @@ func (n *nim) applyGlobalConfig(gcp *types.ConfigItemValueMap) {
timeout := gcp.GlobalValueInt(types.NetworkTestTimeout)
n.connTester.TestTimeout = time.Second * time.Duration(timeout)
n.connTester.DiagRemoteEndpoints = types.GetDiagRemoteEndpointURLs(n.Log, gcp)
metricInterval := gcp.GlobalValueInt(types.MetricInterval)
if metricInterval != 0 && n.metricInterval != metricInterval {
if n.publishTicker != nil {
interval := time.Duration(metricInterval) * time.Second
maxTime := float64(interval)
minTime := maxTime * 0.3
n.publishTicker.UpdateRangeTicker(
time.Duration(minTime), time.Duration(maxTime))
}
n.metricInterval = metricInterval
}
n.gcInitialized = true
}

Expand Down Expand Up @@ -850,6 +907,31 @@ func (n *nim) handleKubeUserServicesDelete(_ interface{}, _ string, _ interface{
n.dpcManager.UpdateKubeUserServices(types.KubeUserServices{})
}

func (n *nim) handleVaultStatusCreate(_ interface{}, key string, statusArg interface{}) {
n.handleVaultStatusImpl(key, statusArg)
}

func (n *nim) handleVaultStatusModify(_ interface{}, key string, statusArg, _ interface{}) {
n.handleVaultStatusImpl(key, statusArg)
}

func (n *nim) handleVaultStatusImpl(_ string, statusArg interface{}) {
status := statusArg.(types.VaultStatus)
vaultIsReady := status.Name == types.DefaultVaultName &&
status.ConversionComplete &&
status.Status != eveinfo.DataSecAtRestStatus_DATASEC_AT_REST_ERROR
n.dpcManager.UpdateVaultReadiness(vaultIsReady)
}

func (n *nim) handleEnrolledCertUpdate() {
var enrolledCerts []types.EnrolledCertificateStatus
for _, item := range n.subEnrolledCertStatus.GetAll() {
certStatus := item.(types.EnrolledCertificateStatus)
enrolledCerts = append(enrolledCerts, certStatus)
}
n.dpcManager.UpdateEnrolledCerts(enrolledCerts)
}

func (n *nim) listPublishedDPCs(directory string) (dpcFilePaths []string) {
locations, err := os.ReadDir(directory)
if err != nil {
Expand Down Expand Up @@ -962,3 +1044,34 @@ func (n *nim) ingestDevicePortConfigFile(oldDirname string, newDirname string, n
filename, err)
}
}

func (n *nim) publishPNACMetrics() {
var pnacMetricsList types.PNACMetricsList
dnsObj, err := n.pubDeviceNetworkStatus.Get("global")
if err != nil {
return
}
dns, ok := dnsObj.(types.DeviceNetworkStatus)
if !ok {
return
}
for _, port := range dns.Ports {
if port.IfName == "" || !port.PNAC.Enabled {
continue
}
ifIndex, exists, err := n.networkMonitor.GetInterfaceIndex(port.IfName)
if !exists || err != nil {
continue
}
metrics, err := n.networkMonitor.GetPNACMetrics(ifIndex)
if err != nil {
n.Log.Error(err)
} else {
pnacMetricsList.Ports = append(pnacMetricsList.Ports, metrics)
}
}
err = n.pubPNACMetrics.Publish(pnacMetricsList.Key(), pnacMetricsList)
if err != nil {
n.Log.Error(err)
}
}
Loading
Loading