Skip to content

Conversation

@SrinivasaBharath
Copy link
Contributor

Description

The Fix includes the -

  1. Automation to verify the osd_scrub_load_threshold with high CPU load using stress-ng.
    a. Install stress-ng package on all rados nodes
    b. Start stress-ng to continuously generate load
    c. Monitoring load continuously to verify it increases and calculate load threshold.The formula used to calculate the load is-
    Load_threshold = load_avgerage / online_cpu_count
    Where
    Load average is clculated by executing the "/proc/loadavg" command
    online_cpu_count is found by executing the nproc command
    d. Wait for load threshold (loadavg / online CPUs) to exceed 15
    e. Set the osd_scrub_load_threshold to 10(Default)
    f. Set the debug_osd and debug_mon to 20
    g. Start the user initiated scrub and check that the scrub should not get start if the load is high
    h. Check logs for scrub_load_below_threshold messages

  2. Divided the scenarios in to separate tests

  3. Uncommented the Verification of the osd_scrub_load_threshold parameter with the default value which is 10.0

Please include Automation development guidelines. Source of Test case - New Feature/Regression Test/Close loop of customer BZs

click to expand checklist
  • Create a test case in Polarion reviewed and approved.
  • Create a design/automation approach doc. Optional for tests with similar tests already automated.
  • Review the automation design
  • Implement the test script and perform test runs
  • Submit PR for code review and approve
  • Update Polarion Test with Automation script details and update automation fields
  • If automation is part of Close loop, update BZ flag qe-test_coverage “+” and link Polarion test

@openshift-ci openshift-ci bot added RADOS Rados Core ceph-ci-review waiting review from cephci do-not-merge/hold Do not merge labels Jan 7, 2026
@SrinivasaBharath
Copy link
Contributor Author

Pass Logs-
image

@SrinivasaBharath
Copy link
Contributor Author

Pass log-

image

Case 3: Default threshold value (10.0) - scrub enabled
- Creates inconsistent objects
- Sets osd_scrub_load_threshold to default (10.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the load , if below 10, and only then proceed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the code.

Test Execution Flow:
1. Initial Setup:
- Enables file logging for OSD daemons
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only Enable debug logs on OSDs that are needed

Copy link
Contributor Author

@SrinivasaBharath SrinivasaBharath Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is initial setup, I deleted this code. Enable logging to file is included in the suite file.
The debug osd logs with 20 is enabled on required osds.

1. Initial Setup:
- Enables file logging for OSD daemons
- Disables PG autoscaler
- Creates a replicated pool and writes test data (10,000 objects, 5K each)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extend the coverage for EC pools as well, with next PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following Jira ticket has been created, and the work will be taken up after the PR is merged.
https://issues.redhat.com/browse/RHCEPHQE-22851

)
# Check that the scrubbing is progress or not
scrub_object.wait_for_pg_scrub_state(pg_id, wait_time=wait_time)
# Get case_to_run from config, default to all cases if not specified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line no 162. Get the Acting set by passing the PGID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic implemented.

# Check that the scrubbing is progress or not
scrub_object.wait_for_pg_scrub_state(pg_id, wait_time=wait_time)
# Get case_to_run from config, default to all cases if not specified
case_to_run = config.get("case_to_run")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no case is passed, add a debug line , and update code to run none of the scenarios.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

mon_obj.set_config(section="osd", name="osd_scrub_load_threshold", value=0)

# Truncate the osd logs
rados_object.remove_log_file_content(osd_nodes, daemon_type="osd")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new method to rotate the logs.

        if enable_debug_logs:
            log.info("Forcing log rotation on OSD nodes...")
            try:
                fsid_result = rados_obj.run_ceph_command(cmd="ceph fsid")
                fsid = (
                    fsid_result.get("fsid")
                    if isinstance(fsid_result, dict)
                    else fsid_result
                )
                if fsid:
                    logrotate_cmd = f"logrotate -f /etc/logrotate.d/ceph-{fsid}"
                    osd_hosts = rados_obj.get_osd_hosts()
                    for host in osd_hosts:
                        try:
                            host.exec_command(
                                cmd=logrotate_cmd,
                                sudo=True,
                                check_ec=False,
                                long_running=True,
                            )
                            time.sleep(10)
                            log.debug(f"Log rotation completed on {host.hostname}")
                        except Exception:
                            pass
                    log.info(f"Log rotation triggered on {len(osd_hosts)} OSD host(s)")
            except Exception as e:
                log.warning(f"Failed to rotate logs: {e}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created method in core_workflows.py

Case 1: Threshold value 0 (scrub blocking)
- Sets osd_scrub_load_threshold to 0
- Verifies scrub operations do not proceed
- Checks for log message: "scrub_load_below_threshold:.* = no"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scrub_load_below_threshold should be reverted to default after each case, and add the same in finally block as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the scrub_load_below_threshold paramter value in the finally block

if not chk_auto_repair_param:
log.error(
"The repair parameters are not set. No executing the further tests"
configure_log_level(mon_obj, acting_pg_set, set_to_default=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable debug logging at the beginning of the scenario itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove_log_file_content() -> also call log rotate before setting up debug logs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

This test verifies the functionality of the osd_scrub_load_threshold parameter
which controls when scrub operations are allowed based on system load.
Test Execution Flow:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep each case in a try except block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

daemon_type="osd",
daemon_id=acting_pg_set[0],
search_string=log_line_no_scrub,
if "case3" in case_to_run:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not setting up debug logging in case3. Set it up, and then verify log lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included the debug logging and verifying the logs.

Comment on lines 781 to 786
if initial_load is None:
initial_load = load_avg
if load_avg > max_load_observed:
max_load_observed = load_avg
if load_avg > initial_load * 1.5: # Load increased by at least 50%
load_increasing = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revise this logic.
As in line 782, we are initializing initial_load as load_avg ; the conditional check at line 785 will never be true => load_increasing will never be set to True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the logic.

load_increasing = True

# Calculate load threshold = loadavg / number of online CPUs
load_threshold = load_avg / online_cpu_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the terminology here and everywhere below
load_threshold => load_avg_per_cpu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt this is fine because the paramter calculating the scrub load threshold value. Please let me know if change is required here.

Comment on lines 819 to 830
if not load_increasing and max_load_observed < initial_load * 1.2:
log.warning(
"Scenario 6: Load did not increase significantly. This may indicate stress-ng is "
"not working properly."
)

if not load_threshold_achieved:
log.warning(
"Scenario 6: Load threshold did not exceed 15 within {} seconds. "
"Last calculated threshold: {:.3f}. Continuing with test.".format(
max_wait_time, load_threshold
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revise the logical checks here and fail the test if necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the logic in such a way that , if load is not inreasing continiously for one minute then marking the test case as fail.

Comment on lines 843 to 845
assert mon_obj.set_config(
section="osd", name="osd_scrub_load_threshold", value=10
), "Could not set osd_scrub_load_threshold value to 10"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not be necessary as we are checking the default value earlier and if we really want to ensure it is using default value, we should rm the config instead of setting it to 10

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the code.

Comment on lines 868 to 872
wait_time = 300
try:
if rados_object.start_check_deep_scrub_complete(
pg_id=pg_id, user_initiated=False, wait_time=wait_time
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure scrub parameters are in place to get scheduled scrub started within the given timeout of 300 secs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The min interval is 2 minutes and maximum is 15 minutes. I increased the wait time 600 seconds.

)

# Stop stress-ng after Step 6 completion
log.info("Scenario 6: Stopping stress-ng process after Step 7 completion")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit pick: step 7 -> step 6

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected the log information.

Comment on lines 883 to 889
primary_osd_host.exec_command(
sudo=True, cmd="pkill -f 'stress-ng.*cpu'", check_ec=False
)
time.sleep(2)
log.info("Scenario 6: stress-ng process stopped")
except Exception as err:
log.warning("Scenario 6: Error stopping stress-ng: {}".format(err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please move stress-ng kill code section into a finally block

Copy link
Contributor Author

@SrinivasaBharath SrinivasaBharath Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all scenarios, the stress-ng process is explicitly terminated after test execution, and as part of the cleanup phase, any remaining stress-ng processes are forcibly terminated across all OSD nodes.

@SrinivasaBharath
Copy link
Contributor Author

SrinivasaBharath commented Jan 12, 2026

Pass logs-
image

Logs are copied at - https://ibm.box.com/s/21dmnw8ckk1pvmm3dftlyxqavpyh34tu

@openshift-ci openshift-ci bot added lgtm Add this label when the PR is good to be merged approved Override label till owners file is created labels Jan 13, 2026
@openshift-ci openshift-ci bot removed the lgtm Add this label when the PR is good to be merged label Jan 13, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 13, 2026

New changes are detected. LGTM label has been removed.

@openshift-ci openshift-ci bot added the do-not-merge/hold Do not merge label Jan 13, 2026
@openshift-ci openshift-ci bot removed the approved Override label till owners file is created label Jan 13, 2026
@openshift-ci openshift-ci bot added the RBD Rados Bock Device label Jan 13, 2026
@SrinivasaBharath SrinivasaBharath removed RBD Rados Bock Device do-not-merge/hold Do not merge labels Jan 13, 2026
@SrinivasaBharath SrinivasaBharath requested review from a team as code owners January 13, 2026 10:18
@openshift-ci openshift-ci bot added RBD Rados Bock Device do-not-merge/hold Do not merge labels Jan 13, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SrinivasaBharath
Once this PR has been reviewed and has the lgtm label, please ask for approval from pdhiran and additionally assign vamahaja for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@SrinivasaBharath
Copy link
Contributor Author

Abandoning for now due to history issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ceph-ci-review waiting review from cephci do-not-merge/hold Do not merge RADOS Rados Core RBD Rados Bock Device

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants