-
Notifications
You must be signed in to change notification settings - Fork 35
Add metrics integration for Fusion Access SAN #2438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add metrics integration for Fusion Access SAN #2438
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: weirdwiz The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| }); | ||
| }; | ||
|
|
||
| export const createGrafanaBridge = () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not create this one, SCALE is already exposing a configuration in cluster CR to deploy and configure grafana bridge.
| export const GPFS_QUERIES: { [key in GPFSQueries]: string } = { | ||
| [GPFSQueries.RAW_CAPACITY]: | ||
| 'sum by (gpfs_cluster_name) (max by (gpfs_cluster_name, gpfs_diskpool_name) (gpfs_pool_total_dataKB)) * 1024', | ||
| [GPFSQueries.USED_CAPACITY]: | ||
| 'sum by (gpfs_cluster_name) (max by (gpfs_cluster_name, gpfs_diskpool_name) (gpfs_pool_total_dataKB - gpfs_pool_free_dataKB)) * 1024', | ||
| [GPFSQueries.IOPS]: | ||
| 'sum by (gpfs_cluster_name) (max by (gpfs_cluster_name, gpfs_fs_name) (gpfs_fs_read_ops + gpfs_fs_write_ops))', | ||
| [GPFSQueries.THROUGHPUT]: | ||
| 'sum by (gpfs_cluster_name) (max by (gpfs_cluster_name, gpfs_fs_name) (gpfs_fs_bytes_read + gpfs_fs_bytes_written))', | ||
| [GPFSQueries.LATENCY]: | ||
| 'avg by (gpfs_cluster_name) (max by (gpfs_cluster_name, gpfs_fs_name) (gpfs_fs_tot_disk_wait_rd + gpfs_fs_tot_disk_wait_wr))', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to check with scale team about this values are the right one that we are looking for
- Add GPFS Prometheus queries for capacity, IOPS, throughput, and latency - Create GrafanaBridge CR when SAN system is created to enable metrics - Label openshift-user-workload-monitoring namespace for network policy - Add GrafanaBridgeModel to shared models - Use LocalCluster CR conditions for SAN system status in external-systems list - Query deduplication using max by clause to handle HA pod replicas Signed-off-by: Divyansh Kamboj <[email protected]>
eba585f to
5b20718
Compare
SanjalKatiyar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's unrelated to your PR, but I see lot of unnecessary API calls being made in external-systems-list.tsx.
- https://github.com/red-hat-storage/odf-console/blob/master/packages/odf/components/system-list/external-systems-list.tsx#L411-L413
- https://github.com/red-hat-storage/odf-console/blob/master/packages/odf/components/system-list/external-systems-list.tsx#L394-L396
- https://github.com/red-hat-storage/odf-console/blob/master/packages/odf/components/system-list/external-systems-list.tsx#L390-L393
I will leave this upto you, ur call, if u think it make sense, plz remove all these calls in this PR itself, else we will send a follow-up later, that's fine too.
| if (!isLocalClusterConfigured) { | ||
| await labelNodes(componentState.selectedNodes)(); | ||
| await createScaleLocalClusterPayload(false)(); | ||
| await labelUserWorkloadMonitoringNamespace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if users deploy Scale first and then want SAN ?? will this labelling be a manual action in that case ??
should we also label while Scale creation: https://github.com/red-hat-storage/odf-console/blob/master/packages/odf/components/create-storage-system/external-systems/CreateScaleSystem/CreateScaleSystem.tsx#L155-L161 ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can label for Scale creation when we get the confirmation that metrics is being passed from the remote Scale system. For now let's label it regardless of local cluster is already configured or not.
| export const labelUserWorkloadMonitoringNamespace = () => { | ||
| const patch: Patch[] = [ | ||
| { | ||
| op: 'add', | ||
| path: '/metadata/labels/scale.spectrum.ibm.com~1networkpolicy', | ||
| value: 'allow', | ||
| }, | ||
| ]; | ||
| return k8sPatch({ | ||
| model: NamespaceModel, | ||
| resource: { | ||
| metadata: { | ||
| name: OPENSHIFT_USER_WORKLOAD_MONITORING_NAMESPACE, | ||
| }, | ||
| }, | ||
| data: patch, | ||
| }); | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just asking for my ref:
- Who creates the NetworkPolicy ?? In which namespace ??
- Is there any use case where we need to support FDF on ROSA ?? If so, then this patch might not work. Else, it should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we add the scale.spectrum.ibm.com/networkpolicy=allow label to the openshift user workload namespace
for grafana bridge to be able to communicate with the user workload monitorign instance of prometheus
regarding ROSA, i'm not entirely sure @Madhu-1 might be able to answer better on that, if that's a possible scenario that we need to think about at all or not
| const [localClusters] = useK8sWatchResource<ClusterKind[]>({ | ||
| kind: referenceForModel(ClusterModel), | ||
| isList: true, | ||
| namespace: IBM_SCALE_NAMESPACE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't this it's needed, check response of useWatchStorageSystems once, IIRC we convert all systems (Ceph/Scale/FileSystem/SAN) to StorageSystem CR format.
| const [localClusters] = useK8sWatchResource<ClusterKind[]>({ | |
| kind: referenceForModel(ClusterModel), | |
| isList: true, | |
| namespace: IBM_SCALE_NAMESPACE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| const [gpfsLatency] = useCustomPrometheusPoll({ | ||
| endpoint: PrometheusEndpoint.QUERY, | ||
| query: GPFS_QUERIES[GPFSQueries.LATENCY], | ||
| basePath: prometheusBasePath, | ||
| }); | ||
| const [gpfsIops] = useCustomPrometheusPoll({ | ||
| endpoint: PrometheusEndpoint.QUERY, | ||
| query: GPFS_QUERIES[GPFSQueries.IOPS], | ||
| basePath: prometheusBasePath, | ||
| }); | ||
| const [gpfsThroughput] = useCustomPrometheusPoll({ | ||
| endpoint: PrometheusEndpoint.QUERY, | ||
| query: GPFS_QUERIES[GPFSQueries.THROUGHPUT], | ||
| basePath: prometheusBasePath, | ||
| }); | ||
| const [gpfsRawCapacity] = useCustomPrometheusPoll({ | ||
| endpoint: PrometheusEndpoint.QUERY, | ||
| query: GPFS_QUERIES[GPFSQueries.RAW_CAPACITY], | ||
| basePath: prometheusBasePath, | ||
| }); | ||
| const [gpfsUsedCapacity] = useCustomPrometheusPoll({ | ||
| endpoint: PrometheusEndpoint.QUERY, | ||
| query: GPFS_QUERIES[GPFSQueries.USED_CAPACITY], | ||
| basePath: prometheusBasePath, | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like for us to always do so many API calls when majority of users are not even using FDF...
u only need these calls in case when isFDF flag is true...
eg: https://github.com/red-hat-storage/odf-console/blob/master/packages/shared/src/hooks/useWatchStorageClusters.ts#L38
| const renderStatus = () => { | ||
| if (obj?.metadata?.deletionTimestamp) { | ||
| return <Status status="Terminating" />; | ||
| } | ||
| if (isSANSystem && sanStatus) { | ||
| return <Status status={sanStatus} />; | ||
| } | ||
| return <OperandStatus operand={obj} />; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits: move outside the FC...
| const isSANSystem = kind === ClusterModel.kind.toLowerCase(); | ||
| const localCluster = isSANSystem ? localClusters?.[0] : null; | ||
| const sanStatus = getLocalClusterStatus(localCluster); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed... output of useWatchStorageSystems should already have the respective CR...
if any info like "status" etc is missing we can update useWatchStorageSystems to include that as well.
directly use obj to determine all info... no need to pass localClusters list...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not have checks like localClusters?.[0] in a generic row component...
| (item) => item?.metric?.managedBy === system.spec.name | ||
| )?.value?.[1]; | ||
| const { kind } = getGVK(system.spec.kind); | ||
| const isGPFSSystem = kind === ClusterModel.kind.toLowerCase(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is "GPFS" ??
we only use scale or san terminology in this repo...
| usedCapacity, | ||
| iops | ||
| odfMetrics, | ||
| gpfsMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is "GPFS" ??
we only use scale or san terminology in this repo...
| export const IBM_SCALE_OPERATOR_NAME = 'ibm-spectrum-scale-operator'; | ||
| export const IBM_SCALE_LOCAL_CLUSTER_NAME = 'ibm-spectrum-scale'; | ||
| export const SAN_STORAGE_SYSTEM_NAME = 'SAN_Storage'; | ||
| export const OPENSHIFT_USER_WORKLOAD_MONITORING_NAMESPACE = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits: more of a shared/constants...
| export const GrafanaBridgeModel: K8sModel = { | ||
| apiVersion: 'v1beta1', | ||
| apiGroup: 'scale.spectrum.ibm.com', | ||
| kind: 'GrafanaBridge', | ||
| plural: 'grafanabridges', | ||
| label: 'GrafanaBridge', | ||
| labelPlural: 'GrafanaBridges', | ||
| crd: true, | ||
| abbr: 'GB', | ||
| namespaced: true, | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for this anymore, right ??
| export const GrafanaBridgeModel: K8sModel = { | |
| apiVersion: 'v1beta1', | |
| apiGroup: 'scale.spectrum.ibm.com', | |
| kind: 'GrafanaBridge', | |
| plural: 'grafanabridges', | |
| label: 'GrafanaBridge', | |
| labelPlural: 'GrafanaBridges', | |
| crd: true, | |
| abbr: 'GB', | |
| namespaced: true, | |
| }; |
| const patch: Patch[] = [ | ||
| { | ||
| op: 'add', | ||
| path: '/metadata/labels/scale.spectrum.ibm.com~1networkpolicy', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| path: '/metadata/labels/scale.spectrum.ibm.com~1networkpolicy', | |
| path: '/metadata/labels/scale.spectrum.ibm.com~1networkpolicy', |
Is it really 1 or should it be just networkpolicy
openshift-user-workload-monitoringnamespace with network policy for metrics scrapingmax byto handle pod replicas