Skip to content

Conversation

@weirdwiz
Copy link
Contributor

@weirdwiz weirdwiz commented Dec 15, 2025

  • Add GPFS Prometheus queries for capacity, IOPS, throughput, and latency metrics
  • Create GrafanaBridge CR when SAN system is created to enable Prometheus metrics export
  • Label openshift-user-workload-monitoring namespace with network policy for metrics scraping
  • Add GrafanaBridgeModel to shared models
  • Use LocalCluster CR conditions for SAN system status in external-systems list
  • Query deduplication using max by to handle pod replicas
Screenshot 2025-12-15 at 5 47 20 PM

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 15, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: weirdwiz
Once this PR has been reviewed and has the lgtm label, please assign gowthamshanmugam for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

});
};

export const createGrafanaBridge = () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not create this one, SCALE is already exposing a configuration in cluster CR to deploy and configure grafana bridge.

Comment on lines +28 to +38
export const GPFS_QUERIES: { [key in GPFSQueries]: string } = {
[GPFSQueries.RAW_CAPACITY]:
'sum by (gpfs_cluster_name) (max by (gpfs_cluster_name, gpfs_diskpool_name) (gpfs_pool_total_dataKB)) * 1024',
[GPFSQueries.USED_CAPACITY]:
'sum by (gpfs_cluster_name) (max by (gpfs_cluster_name, gpfs_diskpool_name) (gpfs_pool_total_dataKB - gpfs_pool_free_dataKB)) * 1024',
[GPFSQueries.IOPS]:
'sum by (gpfs_cluster_name) (max by (gpfs_cluster_name, gpfs_fs_name) (gpfs_fs_read_ops + gpfs_fs_write_ops))',
[GPFSQueries.THROUGHPUT]:
'sum by (gpfs_cluster_name) (max by (gpfs_cluster_name, gpfs_fs_name) (gpfs_fs_bytes_read + gpfs_fs_bytes_written))',
[GPFSQueries.LATENCY]:
'avg by (gpfs_cluster_name) (max by (gpfs_cluster_name, gpfs_fs_name) (gpfs_fs_tot_disk_wait_rd + gpfs_fs_tot_disk_wait_wr))',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to check with scale team about this values are the right one that we are looking for

- Add GPFS Prometheus queries for capacity, IOPS, throughput, and latency
- Create GrafanaBridge CR when SAN system is created to enable metrics
- Label openshift-user-workload-monitoring namespace for network policy
- Add GrafanaBridgeModel to shared models
- Use LocalCluster CR conditions for SAN system status in external-systems list
- Query deduplication using max by clause to handle HA pod replicas

Signed-off-by: Divyansh Kamboj <[email protected]>
@weirdwiz weirdwiz force-pushed the RHSTOR-7987-san-alerts-metrics branch from eba585f to 5b20718 Compare December 15, 2025 13:39
Copy link
Collaborator

@SanjalKatiyar SanjalKatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's unrelated to your PR, but I see lot of unnecessary API calls being made in external-systems-list.tsx.

  1. https://github.com/red-hat-storage/odf-console/blob/master/packages/odf/components/system-list/external-systems-list.tsx#L411-L413
  2. https://github.com/red-hat-storage/odf-console/blob/master/packages/odf/components/system-list/external-systems-list.tsx#L394-L396
  3. https://github.com/red-hat-storage/odf-console/blob/master/packages/odf/components/system-list/external-systems-list.tsx#L390-L393

I will leave this upto you, ur call, if u think it make sense, plz remove all these calls in this PR itself, else we will send a follow-up later, that's fine too.

if (!isLocalClusterConfigured) {
await labelNodes(componentState.selectedNodes)();
await createScaleLocalClusterPayload(false)();
await labelUserWorkloadMonitoringNamespace();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if users deploy Scale first and then want SAN ?? will this labelling be a manual action in that case ??

should we also label while Scale creation: https://github.com/red-hat-storage/odf-console/blob/master/packages/odf/components/create-storage-system/external-systems/CreateScaleSystem/CreateScaleSystem.tsx#L155-L161 ??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can label for Scale creation when we get the confirmation that metrics is being passed from the remote Scale system. For now let's label it regardless of local cluster is already configured or not.

Comment on lines +70 to +87
export const labelUserWorkloadMonitoringNamespace = () => {
const patch: Patch[] = [
{
op: 'add',
path: '/metadata/labels/scale.spectrum.ibm.com~1networkpolicy',
value: 'allow',
},
];
return k8sPatch({
model: NamespaceModel,
resource: {
metadata: {
name: OPENSHIFT_USER_WORKLOAD_MONITORING_NAMESPACE,
},
},
data: patch,
});
};
Copy link
Collaborator

@SanjalKatiyar SanjalKatiyar Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just asking for my ref:

  1. Who creates the NetworkPolicy ?? In which namespace ??
  2. Is there any use case where we need to support FDF on ROSA ?? If so, then this patch might not work. Else, it should be fine.

Copy link
Contributor Author

@weirdwiz weirdwiz Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we add the scale.spectrum.ibm.com/networkpolicy=allow label to the openshift user workload namespace
for grafana bridge to be able to communicate with the user workload monitorign instance of prometheus

regarding ROSA, i'm not entirely sure @Madhu-1 might be able to answer better on that, if that's a possible scenario that we need to think about at all or not

Comment on lines +495 to +498
const [localClusters] = useK8sWatchResource<ClusterKind[]>({
kind: referenceForModel(ClusterModel),
isList: true,
namespace: IBM_SCALE_NAMESPACE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't this it's needed, check response of useWatchStorageSystems once, IIRC we convert all systems (Ceph/Scale/FileSystem/SAN) to StorageSystem CR format.

Suggested change
const [localClusters] = useK8sWatchResource<ClusterKind[]>({
kind: referenceForModel(ClusterModel),
isList: true,
namespace: IBM_SCALE_NAMESPACE,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +469 to +493
const [gpfsLatency] = useCustomPrometheusPoll({
endpoint: PrometheusEndpoint.QUERY,
query: GPFS_QUERIES[GPFSQueries.LATENCY],
basePath: prometheusBasePath,
});
const [gpfsIops] = useCustomPrometheusPoll({
endpoint: PrometheusEndpoint.QUERY,
query: GPFS_QUERIES[GPFSQueries.IOPS],
basePath: prometheusBasePath,
});
const [gpfsThroughput] = useCustomPrometheusPoll({
endpoint: PrometheusEndpoint.QUERY,
query: GPFS_QUERIES[GPFSQueries.THROUGHPUT],
basePath: prometheusBasePath,
});
const [gpfsRawCapacity] = useCustomPrometheusPoll({
endpoint: PrometheusEndpoint.QUERY,
query: GPFS_QUERIES[GPFSQueries.RAW_CAPACITY],
basePath: prometheusBasePath,
});
const [gpfsUsedCapacity] = useCustomPrometheusPoll({
endpoint: PrometheusEndpoint.QUERY,
query: GPFS_QUERIES[GPFSQueries.USED_CAPACITY],
basePath: prometheusBasePath,
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like for us to always do so many API calls when majority of users are not even using FDF...
u only need these calls in case when isFDF flag is true...
eg: https://github.com/red-hat-storage/odf-console/blob/master/packages/shared/src/hooks/useWatchStorageClusters.ts#L38

Comment on lines +377 to +385
const renderStatus = () => {
if (obj?.metadata?.deletionTimestamp) {
return <Status status="Terminating" />;
}
if (isSANSystem && sanStatus) {
return <Status status={sanStatus} />;
}
return <OperandStatus operand={obj} />;
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: move outside the FC...

Comment on lines +373 to +375
const isSANSystem = kind === ClusterModel.kind.toLowerCase();
const localCluster = isSANSystem ? localClusters?.[0] : null;
const sanStatus = getLocalClusterStatus(localCluster);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed... output of useWatchStorageSystems should already have the respective CR...
if any info like "status" etc is missing we can update useWatchStorageSystems to include that as well.

directly use obj to determine all info... no need to pass localClusters list...

Copy link
Collaborator

@SanjalKatiyar SanjalKatiyar Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not have checks like localClusters?.[0] in a generic row component...

(item) => item?.metric?.managedBy === system.spec.name
)?.value?.[1];
const { kind } = getGVK(system.spec.kind);
const isGPFSSystem = kind === ClusterModel.kind.toLowerCase();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "GPFS" ??
we only use scale or san terminology in this repo...

usedCapacity,
iops
odfMetrics,
gpfsMetrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "GPFS" ??
we only use scale or san terminology in this repo...

export const IBM_SCALE_OPERATOR_NAME = 'ibm-spectrum-scale-operator';
export const IBM_SCALE_LOCAL_CLUSTER_NAME = 'ibm-spectrum-scale';
export const SAN_STORAGE_SYSTEM_NAME = 'SAN_Storage';
export const OPENSHIFT_USER_WORKLOAD_MONITORING_NAMESPACE =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: more of a shared/constants...

Comment on lines +75 to +85
export const GrafanaBridgeModel: K8sModel = {
apiVersion: 'v1beta1',
apiGroup: 'scale.spectrum.ibm.com',
kind: 'GrafanaBridge',
plural: 'grafanabridges',
label: 'GrafanaBridge',
labelPlural: 'GrafanaBridges',
crd: true,
abbr: 'GB',
namespaced: true,
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this anymore, right ??

Suggested change
export const GrafanaBridgeModel: K8sModel = {
apiVersion: 'v1beta1',
apiGroup: 'scale.spectrum.ibm.com',
kind: 'GrafanaBridge',
plural: 'grafanabridges',
label: 'GrafanaBridge',
labelPlural: 'GrafanaBridges',
crd: true,
abbr: 'GB',
namespaced: true,
};

const patch: Patch[] = [
{
op: 'add',
path: '/metadata/labels/scale.spectrum.ibm.com~1networkpolicy',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
path: '/metadata/labels/scale.spectrum.ibm.com~1networkpolicy',
path: '/metadata/labels/scale.spectrum.ibm.com~1networkpolicy',

Is it really 1 or should it be just networkpolicy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants