verification of osd_scrub_load_threshold with high CPU load using stress-ng #5549

SrinivasaBharath · 2026-01-07T03:57:03Z

Description

The Fix includes the -

Automation to verify the osd_scrub_load_threshold with high CPU load using stress-ng.
a. Install stress-ng package on all rados nodes
b. Start stress-ng to continuously generate load
c. Monitoring load continuously to verify it increases and calculate load threshold.The formula used to calculate the load is-
Load_threshold = load_avgerage / online_cpu_count
Where
Load average is clculated by executing the "/proc/loadavg" command
online_cpu_count is found by executing the nproc command
d. Wait for load threshold (loadavg / online CPUs) to exceed 15
e. Set the osd_scrub_load_threshold to 10(Default)
f. Set the debug_osd and debug_mon to 20
g. Start the user initiated scrub and check that the scrub should not get start if the load is high
h. Check logs for scrub_load_below_threshold messages
Divided the scenarios in to separate tests
Uncommented the Verification of the osd_scrub_load_threshold parameter with the default value which is 10.0

Please include Automation development guidelines. Source of Test case - New Feature/Regression Test/Close loop of customer BZs

click to expand checklist

Create a test case in Polarion reviewed and approved.
Create a design/automation approach doc. Optional for tests with similar tests already automated.
Review the automation design
Implement the test script and perform test runs
Submit PR for code review and approve
Update Polarion Test with Automation script details and update automation fields
If automation is part of Close loop, update BZ flag qe-test_coverage “+” and link Polarion test

SrinivasaBharath · 2026-01-07T17:00:23Z

Pass Logs-

SrinivasaBharath · 2026-01-08T03:28:39Z

Pass log-

pdhiran · 2026-01-08T05:13:39Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

+
+       Case 3: Default threshold value (10.0) - scrub enabled
+       - Creates inconsistent objects
+       - Sets osd_scrub_load_threshold to default (10.0)


check the load , if below 10, and only then proceed.

Implemented the code.

pdhiran · 2026-01-08T05:15:39Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

+
+    Test Execution Flow:
+    1. Initial Setup:
+       - Enables file logging for OSD daemons


Only Enable debug logs on OSDs that are needed

This is initial setup, I deleted this code. Enable logging to file is included in the suite file.
The debug osd logs with 20 is enabled on required osds.

pdhiran · 2026-01-08T05:17:01Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

+    1. Initial Setup:
+       - Enables file logging for OSD daemons
+       - Disables PG autoscaler
+       - Creates a replicated pool and writes test data (10,000 objects, 5K each)


Extend the coverage for EC pools as well, with next PR

The following Jira ticket has been created, and the work will be taken up after the PR is merged.
https://issues.redhat.com/browse/RHCEPHQE-22851

pdhiran · 2026-01-08T05:20:48Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

-        )
-        # Check that the scrubbing is progress or not
-        scrub_object.wait_for_pg_scrub_state(pg_id, wait_time=wait_time)
+        # Get case_to_run from config, default to all cases if not specified


line no 162. Get the Acting set by passing the PGID

Logic implemented.

pdhiran · 2026-01-08T05:24:22Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

-        # Check that the scrubbing is progress or not
-        scrub_object.wait_for_pg_scrub_state(pg_id, wait_time=wait_time)
+        # Get case_to_run from config, default to all cases if not specified
+        case_to_run = config.get("case_to_run")


If no case is passed, add a debug line , and update code to run none of the scenarios.

pdhiran · 2026-01-08T05:28:37Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

+            mon_obj.set_config(section="osd", name="osd_scrub_load_threshold", value=0)
+
+            # Truncate the osd logs
+            rados_object.remove_log_file_content(osd_nodes, daemon_type="osd")


Add a new method to rotate the logs.

if enable_debug_logs: log.info("Forcing log rotation on OSD nodes...") try: fsid_result = rados_obj.run_ceph_command(cmd="ceph fsid") fsid = ( fsid_result.get("fsid") if isinstance(fsid_result, dict) else fsid_result ) if fsid: logrotate_cmd = f"logrotate -f /etc/logrotate.d/ceph-{fsid}" osd_hosts = rados_obj.get_osd_hosts() for host in osd_hosts: try: host.exec_command( cmd=logrotate_cmd, sudo=True, check_ec=False, long_running=True, ) time.sleep(10) log.debug(f"Log rotation completed on {host.hostname}") except Exception: pass log.info(f"Log rotation triggered on {len(osd_hosts)} OSD host(s)") except Exception as e: log.warning(f"Failed to rotate logs: {e}")

Created method in core_workflows.py

pdhiran · 2026-01-08T05:33:43Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

+       Case 1: Threshold value 0 (scrub blocking)
+       - Sets osd_scrub_load_threshold to 0
+       - Verifies scrub operations do not proceed
+       - Checks for log message: "scrub_load_below_threshold:.* = no"


scrub_load_below_threshold should be reverted to default after each case, and add the same in finally block as well.

Reverted the scrub_load_below_threshold paramter value in the finally block

pdhiran · 2026-01-08T05:37:07Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

-        if not chk_auto_repair_param:
-            log.error(
-                "The repair parameters are not set. No executing the further tests"
+            configure_log_level(mon_obj, acting_pg_set, set_to_default=True)


enable debug logging at the beginning of the scenario itself.

remove_log_file_content() -> also call log rotate before setting up debug logs

pdhiran · 2026-01-08T05:39:07Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

+    This test verifies the functionality of the osd_scrub_load_threshold parameter
+    which controls when scrub operations are allowed based on system load.
+
+    Test Execution Flow:


Keep each case in a try except block

pdhiran · 2026-01-08T05:40:47Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

-                daemon_type="osd",
-                daemon_id=acting_pg_set[0],
-                search_string=log_line_no_scrub,
+        if "case3" in case_to_run:


We're not setting up debug logging in case3. Set it up, and then verify log lines

Included the debug logging and verifying the logs.

harshkumarRH · 2026-01-08T06:14:20Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

+                if initial_load is None:
+                    initial_load = load_avg
+                if load_avg > max_load_observed:
+                    max_load_observed = load_avg
+                if load_avg > initial_load * 1.5:  # Load increased by at least 50%
+                    load_increasing = True


Please revise this logic.
As in line 782, we are initializing initial_load as load_avg ; the conditional check at line 785 will never be true => load_increasing will never be set to True

Modified the logic.

harshkumarRH · 2026-01-08T06:15:06Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

+                    load_increasing = True
+
+                # Calculate load threshold = loadavg / number of online CPUs
+                load_threshold = load_avg / online_cpu_count


Please change the terminology here and everywhere below
load_threshold => load_avg_per_cpu

I felt this is fine because the paramter calculating the scrub load threshold value. Please let me know if change is required here.

harshkumarRH · 2026-01-08T06:15:39Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

+            if not load_increasing and max_load_observed < initial_load * 1.2:
+                log.warning(
+                    "Scenario 6: Load did not increase significantly. This may indicate stress-ng is "
+                    "not working properly."
+                )
+
+            if not load_threshold_achieved:
+                log.warning(
+                    "Scenario 6: Load threshold did not exceed 15 within {} seconds. "
+                    "Last calculated threshold: {:.3f}. Continuing with test.".format(
+                        max_wait_time, load_threshold
+                    )


Please revise the logical checks here and fail the test if necessary

I modified the logic in such a way that , if load is not inreasing continiously for one minute then marking the test case as fail.

harshkumarRH · 2026-01-08T06:16:42Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

+            assert mon_obj.set_config(
+                section="osd", name="osd_scrub_load_threshold", value=10
+            ), "Could not set osd_scrub_load_threshold value to 10"


Should not be necessary as we are checking the default value earlier and if we really want to ensure it is using default value, we should rm the config instead of setting it to 10

Removed the code.

harshkumarRH · 2026-01-08T06:17:27Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

+            wait_time = 300
+            try:
+                if rados_object.start_check_deep_scrub_complete(
+                    pg_id=pg_id, user_initiated=False, wait_time=wait_time
+                ):


Please ensure scrub parameters are in place to get scheduled scrub started within the given timeout of 300 secs

The min interval is 2 minutes and maximum is 15 minutes. I increased the wait time 600 seconds.

harshkumarRH · 2026-01-08T06:17:52Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

+                )
+
+            # Stop stress-ng after Step 6 completion
+            log.info("Scenario 6: Stopping stress-ng process after Step 7 completion")


nit pick: step 7 -> step 6

Corrected the log information.

harshkumarRH · 2026-01-08T06:18:18Z

tests/rados/test_osd_scrub_load_threshold_parameter.py

+                primary_osd_host.exec_command(
+                    sudo=True, cmd="pkill -f 'stress-ng.*cpu'", check_ec=False
+                )
+                time.sleep(2)
+                log.info("Scenario 6: stress-ng process stopped")
+            except Exception as err:
+                log.warning("Scenario 6: Error stopping stress-ng: {}".format(err))


Can we please move stress-ng kill code section into a finally block

In all scenarios, the stress-ng process is explicitly terminated after test execution, and as part of the cleanup phase, any remaining stress-ng processes are forcibly terminated across all OSD nodes.

SrinivasaBharath · 2026-01-12T03:25:53Z

Pass logs-

Logs are copied at - https://ibm.box.com/s/21dmnw8ckk1pvmm3dftlyxqavpyh34tu

openshift-ci · 2026-01-13T05:24:19Z

New changes are detected. LGTM label has been removed.

Signed-off-by: Rahul Lepakshi <[email protected]>

Signed-off-by: jprakash-redhat <[email protected]>

…ess-ng

openshift-ci · 2026-01-13T10:18:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SrinivasaBharath
Once this PR has been reviewed and has the lgtm label, please ask for approval from pdhiran and additionally assign vamahaja for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SrinivasaBharath · 2026-01-13T10:40:24Z

Abandoning for now due to history issues

SrinivasaBharath requested review from neha-gangadhar and pdhiran as code owners January 7, 2026 03:57

openshift-ci bot added RADOS Rados Core ceph-ci-review waiting review from cephci do-not-merge/hold Do not merge labels Jan 7, 2026

SrinivasaBharath force-pushed the wip_scrub_load branch from 18791d5 to ca71b00 Compare January 7, 2026 16:56

SrinivasaBharath force-pushed the wip_scrub_load branch from 89206cd to 92d6002 Compare January 8, 2026 03:26

pdhiran reviewed Jan 8, 2026

View reviewed changes

harshkumarRH reviewed Jan 8, 2026

View reviewed changes

SrinivasaBharath force-pushed the wip_scrub_load branch from e7b7bd6 to a6ba683 Compare January 11, 2026 13:59

SrinivasaBharath requested review from harshkumarRH and pdhiran January 12, 2026 03:43

openshift-ci bot added lgtm Add this label when the PR is good to be merged approved Override label till owners file is created labels Jan 13, 2026

openshift-merge-robot added the needs-rebase label Jan 13, 2026

SrinivasaBharath dismissed pdhiran’s stale review via 1079bb8 January 13, 2026 05:24

openshift-ci bot removed the lgtm Add this label when the PR is good to be merged label Jan 13, 2026

openshift-ci bot added the do-not-merge/hold Do not merge label Jan 13, 2026

openshift-merge-robot removed the needs-rebase label Jan 13, 2026

openshift-ci bot removed the approved Override label till owners file is created label Jan 13, 2026

SrinivasaBharath force-pushed the wip_scrub_load branch from 3180257 to 7cb990b Compare January 13, 2026 08:31

SrinivasaBharath requested review from HaruChebrolu and Manohar-Murthy as code owners January 13, 2026 08:31

openshift-ci bot added the RBD Rados Bock Device label Jan 13, 2026

Rahul Lepakshi and others added 3 commits January 13, 2026 15:10

fix NS masking automation issue

e0c7afa

Signed-off-by: Rahul Lepakshi <[email protected]>

Adopted new framework changes to test_ceph_nvmeof_high_availability test

455ebe6

Signed-off-by: jprakash-redhat <[email protected]>

Verification of osd_scrub_load_threshold with high CPU load using str…

1d5b3b8

…ess-ng

SrinivasaBharath removed RBD Rados Bock Device do-not-merge/hold Do not merge labels Jan 13, 2026

SrinivasaBharath force-pushed the wip_scrub_load branch from efb6d46 to 1d5b3b8 Compare January 13, 2026 10:18

SrinivasaBharath requested review from a team as code owners January 13, 2026 10:18

openshift-ci bot added RBD Rados Bock Device do-not-merge/hold Do not merge labels Jan 13, 2026

SrinivasaBharath closed this Jan 13, 2026

SrinivasaBharath reopened this Jan 13, 2026

SrinivasaBharath closed this Jan 13, 2026

SrinivasaBharath deleted the wip_scrub_load branch January 13, 2026 10:40

SrinivasaBharath mentioned this pull request Jan 13, 2026

verification of osd_scrub_load_threshold with high CPU load using stress-ng #5565

Open

7 tasks

verification of osd_scrub_load_threshold with high CPU load using stress-ng #5549

verification of osd_scrub_load_threshold with high CPU load using stress-ng #5549

Uh oh!

Conversation

SrinivasaBharath commented Jan 7, 2026

Description

Uh oh!

SrinivasaBharath commented Jan 7, 2026

Uh oh!

SrinivasaBharath commented Jan 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SrinivasaBharath Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

SrinivasaBharath Jan 9, 2026 •

edited

Loading

SrinivasaBharath Jan 9, 2026 •

edited

Loading

SrinivasaBharath commented Jan 12, 2026 •

edited

Loading