-
Notifications
You must be signed in to change notification settings - Fork 24
Draft ADR for modular observability pipeline #1665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
2b2895a
7819933
26cf0b3
91ce973
202e5d7
9f350d7
4d6beaa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,256 @@ | ||
| # Design Proposal: Silicon Hardware Metrics Collection for Modular Observability | ||
|
|
||
| Author(s): Christopher Nolan | ||
|
|
||
| Last Updated: 2026-03-30 | ||
|
|
||
| ## Abstract | ||
|
|
||
| Within the Edge Manageability Framework (EMF) stack, there is a Edge Node Observability pipeline that is used | ||
| to collect hardware telemetry, specifically metrics and logs, from edge nodes. This pipeline consists of an agent, | ||
| the Platform Manageability Agent (POA), on the edge node that collects, batches and forwards telemetry from the | ||
| edge node to the Orchestrator. On the Orchestrator, the pipeline consists of a number of services that process | ||
| and store the metrics received from the connected edge node agents. Currently, this workflow only retrieves only | ||
| basic hardware metrics for CPU, memory, disk, etc., and has not been configured to be deployed separately of | ||
| other modular workflows or the full EMF deployment stack. This proposal outlines how this pipeline can be | ||
| modified to collect additional, silicon specific hardware metrics from GPU, PMU, NPU, etc., as well as how it can be | ||
| deployed as a modular workflow similar to the [Out-Of-band Device Management](./vpro-eim-modular-decomposition.md) | ||
| workflow added in the 2026.0 release. | ||
|
|
||
| ## Background | ||
|
|
||
| As outlined above, the current Edge Node Observability pipeline contains two parts: the POA on the Edge Node and the | ||
| orchestrator services that handle the processing and storage of the telemetry from the edge node. Both make use of | ||
| standard open-source telemetry collectors to retrieve, process and store the telemetry data. | ||
|
|
||
| ### Edge Node Agent Services | ||
|
|
||
| The POA is made up of four services: | ||
|
|
||
| 1. **platform-observability-logging**: This service collects logs for all of the agent services installed on | ||
| the edge node from system journal. It runs a [Fluent Bit service](https://docs.fluentbit.io/manual) that | ||
| uses a configuration file provided by the agent at runtime. | ||
| 2. **platform-observability-health-check**: This service performs a periodic health check on the edge node | ||
| agent services to confirm that they are still active on the system. It is also a [Fluent Bit service](https://docs.fluentbit.io/manual) | ||
| that uses a configuration file provided by the agent. | ||
| 3. **platform-observability-metrics**: This is a [Telegraf based service](https://github.com/influxdata/telegraf) | ||
| that runs a set of configured metrics collectors specified by the configuraion file provided by the agent. These | ||
| collectors gather the required hardware based metrics requested by the agent. | ||
| 4. **platform-observability-collector**: This is an [OpenTelemetry Collector service](https://github.com/open-telemetry/opentelemetry-collector) | ||
| that batches all of the metrics and logs received from the other three services, applies the required | ||
| protocol headers, including authentication headers, before forwarding to the orchestrator. | ||
|
|
||
| For more details on the POA and the individual services it installs on the edge node, please see the | ||
| [developer guide](https://docs.openedgeplatform.intel.com/edge-manage-docs/dev/developer_guide/agents/arch/platform_observability.html#) | ||
| for the agent. | ||
|
|
||
| ### Edge Node Observability Pipeline | ||
|
|
||
| In the Orchestrator, the Edge Node Observability pipeline also uses open source components to process and store | ||
| metrics from the connected edge nodes. The services it runs includes: | ||
|
|
||
| 1. [OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector) which has been configured | ||
| to apply and configure labels and metadata for filtering edge node metrics as well as for supporting | ||
| multitenancy environments. | ||
| 2. [Grafana Loki](https://github.com/grafana/loki) is used for the logs backend and storage. | ||
| 3. [Grafana Mimir](https://github.com/grafana/mimir) is used for the metrics backend and storage. | ||
| 4. [Grafana](https://github.com/grafana/grafana) provides a UI for viewing edge node logs and metrics. It also | ||
| is used with the [edgenode-dashboards](https://github.com/open-edge-platform/o11y-charts/tree/main/charts/edgenode-dashboards) | ||
| to provide a default set of edge node metrics dashboards configured for use with the edge node POA. | ||
|
|
||
| For more details on the POA and the individual services it installs on the edge node, please see the | ||
| [developer guide](https://docs.openedgeplatform.intel.com/edge-manage-docs/dev/developer_guide/observability/arch/orchestrator/edgenode-observability.html) | ||
| for the agent. | ||
|
|
||
| ## Proposal | ||
|
|
||
| To support collection of additional HW metrics from GPU, PMU, cache utilization, etc., the current POA implementation | ||
| will be expanded to include new metrics collectors for these HW components. Also, modifications will be made to | ||
| the Edge Node Observability pipeline deployment in the orchestrator to allow it to be deployable as a standalone | ||
| pipeline without requiring other components from the EMF stack. | ||
|
|
||
| ### Scope | ||
|
|
||
| - Proposal will cover both the POA on the edge node and the Edge Node Observability pipeline in the orchestrator | ||
| that are currently used for metrics collection. | ||
| - Will cover what metrics will be collected and what collectors will be used to gather them. | ||
| - It will also cover how the pipeline and its deployment will be updated to work in a modular environment, | ||
| including any changes to be made to the current pipeline in order to support this. | ||
| - Modularization changes outlined below are designed to work with all use cases outlined in the | ||
| [Modular Decomposition documentation](./eim-modular-decomposition.md). | ||
|
|
||
| ### Design | ||
|
|
||
| #### Metric Collectors | ||
|
|
||
| On the edge node, the POA metrics service currently provides a number of metrics by default as well as some that are | ||
| configured but disabled by default. For this workflow, there are also additional metrics to be added. | ||
|
|
||
| ##### Configured and Enabled Metrics | ||
|
|
||
| - **CPU Utilization and Performance Metrics**: CPU usage metrics can be retrieved using the [Telegraf cpu collector](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/cpu). | ||
| - **Memory Utilization and Performance Metrics**: The [Telegraf mem collector](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/mem) | ||
| will provide memory utilization metrics for an edge node. | ||
| - **Storage Utilization and Performance Metrics**: For storage performance metrics, there are two collectors that | ||
| provide a variety of metrics. The [Telegraf disk collector](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/disk) | ||
| gathers utilization metrics while the [diskio collector](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/diskio) | ||
| reports the read and write counts to the edge node storage devices. The POA also | ||
| provides a [script](https://github.com/open-edge-platform/edge-node-agents/blob/main/platform-observability-agent/scripts/collect_disk_info.sh) | ||
| that can be run by the [exec plugin](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec) | ||
| in Telegraf. | ||
| - **Network Interface Utilization and Performance Metrics**: Telegraf provides the [net collector](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/net) | ||
| which provides a per interface view of the network traffic sent and received on the edge node. | ||
| - **SRIOV VF Utilization and Performance Metrics**: In Linux, SRIOV VFs created on the system are seen as network | ||
| interfaces alongside any physical interfaces. In this case, they would also appear in the output from Telegraf's | ||
| net collector. | ||
|
|
||
| ##### Configured and Disabled Metrics | ||
|
|
||
| - **Logical Volume Manager (LVM) Utilization and Performance Metrics**: For these metrics, the [Telegraf lvm collector](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/lvm) | ||
| will provide the required metrics. | ||
| - **Storage Utilization and Performance Metrics**: Telegraf also provides the [smart collector](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/smart) | ||
| which, when run on an edge node that has storage devices that support it, will provide additional utilization metrics. | ||
| - **dGPU Utilization and Performance Metrics**: Currently, the POA metrics service provides a [script](https://github.com/open-edge-platform/edge-node-agents/blob/main/platform-observability-agent/scripts/collect_gpu_metrics.sh) | ||
| that can collect metrics from dGPU devices on an edge node. It requires the | ||
| [XPU System Management Interface](https://github.com/intel/xpumanager) package to be installed on the edge node. | ||
| - **Performance Monitoring Unit (PMU) Metrics**: These are metrics specific to Intel CPUs and can be read using the | ||
| [intel_pmu collector](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/intel_pmu) in Telegraf. | ||
| - **BIOS Metrics**: One option for these metrics is to use the [Telegraf redfish collector](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/redfish) | ||
| to retrieve thermal and power settings. | ||
|
|
||
| ##### New Metrics to Configure and Enable | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Each of these will require additional permissions to be enabled on the device side and will likely increase resource utilization. Is there any QoS currently enabled to support existing and additional metrics collection? e.g., best-effort collection? |
||
|
|
||
| - **CPU Utilization and Performance Metrics**: To retrieve frequency and throttling CPU metrics, the | ||
| [Telegraf linux_cpu collector](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/linux_cpu) | ||
| can be used. | ||
| - **iGPU Utilization and Performance Metrics**: To retrieve iGPU metrics on the edge node, the XPU System Management | ||
| Interface package needs to be installed on the edge node along with the intel-level-zero-gpu package. Using these packages | ||
| with the script currently used for dGPU metrics in the POA metrics service will allow it to also retrieve iGPU metrics. | ||
| - **Cache Utilization and Performance Metrics**: The primary collector for this will be the [intel_rdt collector](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/intel_rdt) | ||
| in Telegraf, which uses [Intel Resource Director Technology](https://github.com/intel/intel-cmt-cat) to report the | ||
| utilization of the L3 cache. As well as this collector, the intel_pmu collector above also provides some cache performance | ||
| metrics as does the [intel_pmt collector](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/intel_pmt) | ||
| in Telegraf when used with newer Intel processors. | ||
| - **BIOS Metrics**: For other BIOS settings, dmidecode can be run using the Telegraf exec collector to gather these. | ||
| - **VPU Utilization and Performance Metrics**: This will require a new collector to retrieve metrics from any | ||
| VPUs on an edge node. | ||
|
|
||
| To view the current POA metrics service configuration, please see the [configuration file](https://github.com/open-edge-platform/edge-node-agents/blob/main/platform-observability-agent/configs/poa-telegraf.conf) | ||
| for the service. | ||
|
|
||
| #### Workflow Design | ||
|
|
||
| For the pipeline, the pipeline will remain as it currently is when deploying the full EMF stack, however the | ||
| modular workflow will not deploy the Grafana UI and dashboards when it is deployed without the full stack. | ||
| Instead the Orchestrator Command Line Interface (CLI) tool will be extended to provide commands for a user | ||
| to run to query the Mimir backend for metrics. | ||
|
|
||
| The CLI will receive a command containing the metric to be queried for, the edge node to be checked as well as | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To clarify this, is the query going to the orchestrator or to the edge node? Or if the requested data is not found on the orchestrator side, then it will be sent to the edge node? |
||
| any time range required by user. If a time range is not provided, then the CLI should use a default time range, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe a concern that is already addressed, but how is the clock synchronization is ensured across orchestrator and edge devices to ensure that the requested time range is the same across devices and there is no offset? |
||
| such as the last 5 minutes. The CLI should also support retrieving both averages and sums for metrics over set time | ||
| periods. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the request made for a single node or for all the nodes? how does this scale? |
||
|
|
||
| Within the CLI, it should convert the received query into the PromQL format needed for querying Mimir | ||
| and then send the PromQL query to the Mimir API. When the CLI receives the metrics back from Mimir, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens if, for some reason, metrics collection fails or becomes unavailable during the requested time range? What if the data is only partially available? |
||
| it should convert it into a easy read format before returning it to the user. | ||
|
|
||
| ```mermaid | ||
| sequenceDiagram | ||
| autonumber | ||
| participant US as User | ||
| box rgb(191, 223, 255) Orchestrator Components | ||
| participant CLI as Orchestrator CLI | ||
| participant STO as Storage | ||
| participant MIM as Grafana Mimir | ||
| participant COL as OpenTelemetry Collector | ||
| end | ||
| box rgb(37, 182, 21) Platform Observability Agent | ||
| participant POC as Platform Observability Collector Service | ||
| participant MET as Platform Observability Metrics Service | ||
| end | ||
| box rgb(219, 175, 103) Edge Node Hardware | ||
| participant CPU as Edge Node CPU | ||
| participant GPU as Edge Node GPU | ||
| participant MEM as Edge Node Memory | ||
| participant PMU as Performance Monitoring Unit (PMU) | ||
| participant CHE as Edge Node Cache | ||
| participant NET as Edge Node Network Interfaces | ||
| end | ||
| alt Collect metrics from Edge Node | ||
| activate MET | ||
| MET ->> CPU: Query CPU metrics | ||
| CPU ->> MET: Return CPU metrics | ||
| MET ->> GPU: Query GPU metrics | ||
| GPU ->> MET: Return GPU metrics | ||
| MET ->> MEM: Query Memory metrics | ||
| MEM ->> MET: Return memory metrics | ||
| MET ->> PMU: Query PMU metrics | ||
| PMU ->> MET: Return PMU metrics | ||
| MET ->> CHE: Query Cache metrics | ||
| CHE ->> MET: Return Cache metrics | ||
| MET ->> NET: Query Network Interface metrics | ||
| NET ->> MET: Return Network Interface metrics | ||
| deactivate MET | ||
| end | ||
| alt Send Edge Node metrics to Mimir | ||
| MET ->> POC: Send metrics to Collector service | ||
| activate POC | ||
| POC ->> POC: Batch metrics for forwarding | ||
| POC ->> POC: Apply required headers http request | ||
| POC ->> COL: Send batched metrics to orchestrator collector service | ||
| deactivate POC | ||
| activate COL | ||
| COL ->> COL: Apply additional hostID label to all metrics | ||
| COL ->> COL: Group metrics based on host | ||
| COL ->> COL: Batch metrics based on host details | ||
| COL ->> MIM: Send metrics to Grafana Mimir | ||
| deactivate COL | ||
| MIM ->> STO: Send metrics to storage | ||
| end | ||
| alt Query metrics from Mimir | ||
| US ->> CLI: Submit metrics query to CLI | ||
| activate CLI | ||
| CLI ->> CLI: Convert CLI metrics query into PromQL query format | ||
| CLI ->> MIM: Send metrics PromQL query | ||
| activate MIM | ||
| MIM ->> MIM: Process PromQL query | ||
| MIM ->> STO: Retrieve requested metrics from storage | ||
| STO ->> MIM: Return requested metrics from storage | ||
| MIM ->> CLI: Return requested metrics | ||
| deactivate MIM | ||
| CLI ->> CLI: Convert metrics response from Mimir into user readable format | ||
| CLI ->> US: Return requested metrics in user readable format | ||
| deactivate CLI | ||
| end | ||
| ``` | ||
|
|
||
| ## Implementation Plan | ||
|
|
||
| - Hardware Metrics Collection. | ||
| - Identify the new hardware metrics collectors to be added to the current edge node metrics service. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the performance implications when additional metrics are collected, both on the orchestrator side and across the network? |
||
| - Extend the current GPU metrics script to also collect iGPU metrics using the Telegraf exec plugin. | ||
| - Develop a new collector to retrieve VPU metrics. | ||
| - Add additional Telegraf plugins to metrics service configuration. | ||
| - Test deployment of updated metrics sevice on edge node and check the metrics being retrieved. | ||
| - Update documentation for the edge node observability agent. | ||
| - Modular observability workflow. | ||
| - Identify the services needed for a modular observability workflow. | ||
| - Modify the deployment profiles to include an observability only modular workflow. | ||
| - Test the deployment of the new modular workflow. | ||
| - Test deployment with the updated edge node observability agent. | ||
| - Extend Orchestrator CLI to retrieve metrics from the observability pipeline. | ||
| - Test the updated CLI with the modular observability pipeline and confirm that new metrics can be retrieved. | ||
| - Provide documentation on how to install the modular observability workflow. | ||
| - Extend Orchestrator CLI documentation with new commands for metrics querying. | ||
|
|
||
| ## Opens | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Has support for multi-vendor environments been considered? |
||
|
|
||
| - Grafana dashboards will not be used in modular flow, instead metrics will be retreived using the CLI. Do | ||
| we still require the dashboards to be maintained? | ||
| - Investigate the current support in CLI for retrieving metrics from Mimir | ||
| - Not included in this proposal is the telemetry management pipeline that runs parallel to the observability | ||
| pipeline and can be used to configure what metrics an edge node reports after it has been deployed without | ||
| requiring a full redeployment or access to the edge node. For modular deployments, should this also be included | ||
| and used for this purpose or should it be exlcuded? | ||
| - Investigate the [Intel Performance Counter Monitor(PCM)](https://github.com/intel/pcm) tool as there may be | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens if the hardware metrics collection fails? i.e., there is some hardware malfunction, or due to misconfigurations, or there is some disruption to sensors doing readings |
||
| overlap between what it is reporting and what the modular workflow will report. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which are these components?