Add metrics integration for Fusion Access SAN #2438

weirdwiz · 2025-12-15T12:23:34Z

Add GPFS Prometheus queries for capacity, IOPS, throughput, and latency metrics
Create GrafanaBridge CR when SAN system is created to enable Prometheus metrics export
Label openshift-user-workload-monitoring namespace with network policy for metrics scraping
Add GrafanaBridgeModel to shared models
Use LocalCluster CR conditions for SAN system status in external-systems list
Query deduplication using max by to handle pod replicas

openshift-ci · 2025-12-15T12:23:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: weirdwiz
Once this PR has been reviewed and has the lgtm label, please assign gowthamshanmugam for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

packages/odf/OWNERS
packages/shared/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Madhu-1 · 2025-12-15T12:54:30Z

packages/odf/components/create-storage-system/external-systems/CreateSANSystem/payload.ts

  });
 };
+
+export const createGrafanaBridge = () => {


we should not create this one, SCALE is already exposing a configuration in cluster CR to deploy and configure grafana bridge.

Madhu-1 · 2025-12-15T12:55:31Z

packages/odf/queries/system-list.ts

+export const GPFS_QUERIES: { [key in GPFSQueries]: string } = {
+  [GPFSQueries.RAW_CAPACITY]:
+    'sum by (gpfs_cluster_name) (max by (gpfs_cluster_name, gpfs_diskpool_name) (gpfs_pool_total_dataKB)) * 1024',
+  [GPFSQueries.USED_CAPACITY]:
+    'sum by (gpfs_cluster_name) (max by (gpfs_cluster_name, gpfs_diskpool_name) (gpfs_pool_total_dataKB - gpfs_pool_free_dataKB)) * 1024',
+  [GPFSQueries.IOPS]:
+    'sum by (gpfs_cluster_name) (max by (gpfs_cluster_name, gpfs_fs_name) (gpfs_fs_read_ops + gpfs_fs_write_ops))',
+  [GPFSQueries.THROUGHPUT]:
+    'sum by (gpfs_cluster_name) (max by (gpfs_cluster_name, gpfs_fs_name) (gpfs_fs_bytes_read + gpfs_fs_bytes_written))',
+  [GPFSQueries.LATENCY]:
+    'avg by (gpfs_cluster_name) (max by (gpfs_cluster_name, gpfs_fs_name) (gpfs_fs_tot_disk_wait_rd + gpfs_fs_tot_disk_wait_wr))',


we need to check with scale team about this values are the right one that we are looking for

- Add GPFS Prometheus queries for capacity, IOPS, throughput, and latency - Create GrafanaBridge CR when SAN system is created to enable metrics - Label openshift-user-workload-monitoring namespace for network policy - Add GrafanaBridgeModel to shared models - Use LocalCluster CR conditions for SAN system status in external-systems list - Query deduplication using max by clause to handle HA pod replicas Signed-off-by: Divyansh Kamboj <[email protected]>

SanjalKatiyar

I know it's unrelated to your PR, but I see lot of unnecessary API calls being made in external-systems-list.tsx.

I will leave this upto you, ur call, if u think it make sense, plz remove all these calls in this PR itself, else we will send a follow-up later, that's fine too.

SanjalKatiyar · 2025-12-18T09:23:12Z

...es/odf/components/create-storage-system/external-systems/CreateSANSystem/CreateSANSystem.tsx

      if (!isLocalClusterConfigured) {
        await labelNodes(componentState.selectedNodes)();
        await createScaleLocalClusterPayload(false)();
+        await labelUserWorkloadMonitoringNamespace();


what if users deploy Scale first and then want SAN ?? will this labelling be a manual action in that case ??

should we also label while Scale creation: https://github.com/red-hat-storage/odf-console/blob/master/packages/odf/components/create-storage-system/external-systems/CreateScaleSystem/CreateScaleSystem.tsx#L155-L161 ??

We can label for Scale creation when we get the confirmation that metrics is being passed from the remote Scale system. For now let's label it regardless of local cluster is already configured or not.

SanjalKatiyar · 2025-12-18T09:28:28Z

packages/odf/components/create-storage-system/external-systems/common/payload.ts

+export const labelUserWorkloadMonitoringNamespace = () => {
+  const patch: Patch[] = [
+    {
+      op: 'add',
+      path: '/metadata/labels/scale.spectrum.ibm.com~1networkpolicy',
+      value: 'allow',
+    },
+  ];
+  return k8sPatch({
+    model: NamespaceModel,
+    resource: {
+      metadata: {
+        name: OPENSHIFT_USER_WORKLOAD_MONITORING_NAMESPACE,
+      },
+    },
+    data: patch,
+  });
+};


Just asking for my ref:

Who creates the NetworkPolicy ?? In which namespace ??

Is there any use case where we need to support FDF on ROSA ?? If so, then this patch might not work. Else, it should be fine.

we add the scale.spectrum.ibm.com/networkpolicy=allow label to the openshift user workload namespace
for grafana bridge to be able to communicate with the user workload monitorign instance of prometheus

regarding ROSA, i'm not entirely sure @Madhu-1 might be able to answer better on that, if that's a possible scenario that we need to think about at all or not

SanjalKatiyar · 2025-12-18T09:31:49Z