Skip to content

Conversation

@H-M-Quang-Ngo
Copy link
Contributor

@H-M-Quang-Ngo H-M-Quang-Ngo commented Jun 19, 2025

Proposed changes to OpenStack Cloud Usage and OpenStack Compute Overview dashboards to address #154

This solution adds a template variable ceph_without_ephemeral for these dashboards to check these conditions:

  • Ceph storage is detected
  • All compute nodes report same local_gb storage (stddev < 1MB)
  • That local_gb storage is very close to the total Ceph total storage (difference < 5GB)

If all checks are true, querying ceph_without_ephemeral will return true (1), which indicates Ceph is fully used as the storage backend on this deployment and there are likely no ephemeral/instance storage available. In that case, we need to use another expressions for disk calculation, otherwise (return 0) we fallback to the existing expressions.

Proposed panels to modify expression are: Free Disk by Aggregate (in OpenStack Cloud Usage), Hypervisor Disk Subscription and its Total (from OpenStack Compute Overview)

Some sample calculations

Case 1 : Current bug

Input:  20 nodes × 257TB each = 4.91 PiB (WRONG)
Detection: TRUE (Ceph exists + identical values + values match)
Output: 257TB (Correct)
Result: Correctly report storage

Case 2: Identical local storage

Input:  20 nodes × 1TB each = 20TB  
Detection: FALSE (no Ceph metrics)
Output: sum(20TB) (UNCHANGED)
Result: No false positive

Case 3: Mixed Ceph + Local

Input:  Ceph exists [1TB], nodes report [520GB, 500GB, 480GB]
Detection: FALSE (nodes' storage values not identical - stddev > 1MB)  
Output: sum(1500GB) (UNCHANGED)
Result: Correctly preserves mixed storage

Testing

Default configuration Environment (no Ceph)

  • Jammy Yoga Openstack with local ephemeral storage, with 3 compute nodes from stsstack-bundles
  • openstack-exporter-operator with grafana-agent subordinate.
  • COS-lite on MicroK8s.
  • 3 VM instances each with 20GiB disk created.

Brief check

$ openstack hypervisor list
+----+---------------------------+-----------------+--------------+-------+
| ID | Hypervisor Hostname       | Hypervisor Type | Host IP      | State |
+----+---------------------------+-----------------+--------------+-------+
|  1 | juju-1d5215-yoga-local-9  | QEMU            | 10.149.54.33 | up    |
|  2 | juju-1d5215-yoga-local-8  | QEMU            | 10.149.54.56 | up    |
|  3 | juju-1d5215-yoga-local-10 | QEMU            | 10.149.54.10 | up    |
+----+---------------------------+-----------------+--------------+-------+
openstack hypervisor show -c local_gb juju-1d5215-yoga-local-9
+----------+-------+
| Field    | Value |
+----------+-------+
| local_gb | 49    |
+----------+-------+
openstack hypervisor show -c local_gb juju-1d5215-yoga-local-8
+----------+-------+
| Field    | Value |
+----------+-------+
| local_gb | 49    |
+----------+-------+
openstack hypervisor show -c local_gb juju-1d5215-yoga-local-10
+----------+-------+
| Field    | Value |
+----------+-------+
| local_gb | 49    |
+----------+-------+

From Ceph storage panel (OpenStack Cloud Usage):
image

Result:

No Ceph shared storage back end is detected as:
image

OpenStack Cloud Usage

Unchanged:
image

OpenStack Compute Overview

Unchanged:
image

Ceph-shared storage Environment (the issue's case)

  • Jammy Yoga Openstack with Ceph as the shared storage backend, with 3 compute nodes from stsstack-bundles

  • openstack-exporter-operator with grafana-agent subordinate.

  • COS-lite on MicroK8s.

  • 3 VM instances each with 1GiB disk created.

  • Add additional relations to scrape for Ceph metrics:
    - From OpenStack model:
    juju offer ceph-mon:metrics-endpoint

    From COS model:
    juju consume yoga-ceph.ceph-mon
    juju relate ceph-mon:metrics-endpoint prometheus-k8s:metrics-endpoint

    (May need to workaround this bug)

Brief check

openstack volume service list -c Binary -c State -c Status -c Host  
+------------------+-----------------------------+---------+-------+
| Binary           | Host                        | Status  | State |
+------------------+-----------------------------+---------+-------+
| cinder-scheduler | juju-b7dbae-yoga-ceph-4     | enabled | down  |
| cinder-volume    | juju-b7dbae-yoga-ceph-4@LVM | enabled | down  |
| cinder-volume    | cinder@cinder-ceph          | enabled | up    |
| cinder-scheduler | cinder                      | enabled | up    |
+------------------+-----------------------------+---------+-------+
$ openstack hypervisor list
+----+--------------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname      | Hypervisor Type | Host IP       | State |
+----+--------------------------+-----------------+---------------+-------+
|  1 | juju-b7dbae-yoga-ceph-12 | QEMU            | 10.149.54.52  | up    |
|  2 | juju-b7dbae-yoga-ceph-21 | QEMU            | 10.149.54.17  | up    |
|  3 | juju-b7dbae-yoga-ceph-20 | QEMU            | 10.149.54.123 | up    |
+----+--------------------------+-----------------+---------------+-------+
$ juju run ceph-mon/0 pool-statistics
CLASS     SIZE    AVAIL    USED    RAW USED    %RAW USED
hdd       30 GiB  25 GiB   5.1 GiB  5.1 GiB    17.07%
TOTAL     30 GiB  25 GiB   5.1 GiB  5.1 GiB    17.07%
$ openstack hypervisor show -c local_gb juju-b7dbae-yoga-ceph-12 
+----------+-------+
| Field    | Value |
+----------+-------+
| local_gb | 29    |
+----------+-------+
$ openstack hypervisor show -c local_gb juju-b7dbae-yoga-ceph-21
+----------+-------+
| Field    | Value |
+----------+-------+
| local_gb | 29    |
+----------+-------+
$ openstack hypervisor show -c local_gb juju-b7dbae-yoga-ceph-20
+----------+-------+
| Field    | Value |
+----------+-------+
| local_gb | 29    |
+----------+-------+

From Ceph storage panel (OpenStack Cloud Usage):
image

Result:

Ceph shared storage back end is correctly detected for each dashboard, as:
image

OpenStack Cloud Usage

Pre:

Free disk usage is incorrectly reported: x3 (total 29 x 3 - 3 allocated = 84 GiB)
image

After:

Free Disk is correctly reported as 24.8 (29.9 - 5.10 used = 24.8 GiB)
image

OpenStack Compute Overview

Pre:

Total Disk is incorrectly reported: x3 from actual Ceph cluster storage (29x3 = 87 GiB)
Subscription percentage is incorrect accordingly: 3/87 = 3.45%
image

After:

Total is correct from actual Ceph cluster storage (29.9 GiB)
Subscription is 5.10/29.9 = 17% (same from the Ceph storage dashboard)
image

Note:

Actual changes should be made from sunbeam

@H-M-Quang-Ngo H-M-Quang-Ngo marked this pull request as ready for review June 19, 2025 06:45
@H-M-Quang-Ngo H-M-Quang-Ngo requested a review from a team as a code owner June 19, 2025 06:45
@jneo8
Copy link
Contributor

jneo8 commented Jun 20, 2025

Hi @H-M-Quang-Ngo thanks for the PR.

It will be more nicer if you can:

  • provide the screen shot of the dashboard
  • (optional) provide some example calculation to the PR description to show this can work on both cases when bug exists/no-exists.

@H-M-Quang-Ngo H-M-Quang-Ngo marked this pull request as draft June 24, 2025 01:15
@jneo8
Copy link
Contributor

jneo8 commented Jun 25, 2025

Hi @H-M-Quang-Ngo,

I'm okay with the changes now—thanks for providing the calculation examples.

One thing I’d like to request is an update to the commit message. The current message makes it a bit difficult to trace the context of the change. Could you please include more details to make it clearer? Thanks!

- Add a variable to (heuristically) detect a shared-Ceph-backend condition into compute.json & cloud.json dashboards
- Change to expression of `Hypervisor Disk Subscription` panel in `compute.json`
- Change to expression of `Total` panel in `compute.json`
- Change to expression of `Free Disk by aggregate` panel in `cloud.json`
@H-M-Quang-Ngo H-M-Quang-Ngo force-pushed the update-grafana-dashboard-for-ceph-backend branch from 0de7948 to 20c166c Compare June 26, 2025 02:57
@H-M-Quang-Ngo H-M-Quang-Ngo force-pushed the update-grafana-dashboard-for-ceph-backend branch 2 times, most recently from e14467f to acc97e0 Compare July 18, 2025 09:45
@H-M-Quang-Ngo H-M-Quang-Ngo force-pushed the update-grafana-dashboard-for-ceph-backend branch from acc97e0 to 14363a9 Compare July 21, 2025 06:45
thejjw pushed a commit to thejjw/sunbeam-charms that referenced this pull request Jul 29, 2025
having Ceph-shared storage backend

Add logic to some Openstack-Exporter-Operator dashboards to correctly
report storage status when Ceph is used as the storage backend and
no ephemeral storage available. The issue is reported in GitHub:
canonical/openstack-exporter-operator#154

This patch add a workaround logic to detect whether the Openstack
storage is likely backed by Ceph and nodes' ephemeral storages aren't
used. Calculations are also updated for several dashboard parameters
such as total storage or free storage, so that they are correctly
properly when the condition is met. If the condition is not met, the
logic will simply fallback to existing expression.

A more detailed explanation and testing proof can be found in:
canonical/openstack-exporter-operator#157

Change-Id: Id14fd0acd3f5b9601a08be35cc078f00d8035ef8
Signed-off-by: Quang Ngo <[email protected]>
- Replace the expression for `ceph_cluster_total_bytes` and `ceph_cluster_total_used_bytes` to be 0 when Ceph is not used.
@H-M-Quang-Ngo H-M-Quang-Ngo force-pushed the update-grafana-dashboard-for-ceph-backend branch from 62f0997 to d4f1059 Compare July 30, 2025 00:39
Previous commit has the `Ceph Shared Storage` returned both values 0 and 1 when
shared Ceph is detected, since the `or` operator in promql returns both when
evaluating between two terms both have data. This can be simply fixed by wrap a
max() around the expression.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants