-
Notifications
You must be signed in to change notification settings - Fork 91
verification of osd_scrub_load_threshold with high CPU load using stress-ng #5549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
verification of osd_scrub_load_threshold with high CPU load using stress-ng #5549
Conversation
18791d5 to
ca71b00
Compare
89206cd to
92d6002
Compare
| Case 3: Default threshold value (10.0) - scrub enabled | ||
| - Creates inconsistent objects | ||
| - Sets osd_scrub_load_threshold to default (10.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check the load , if below 10, and only then proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented the code.
| Test Execution Flow: | ||
| 1. Initial Setup: | ||
| - Enables file logging for OSD daemons |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only Enable debug logs on OSDs that are needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is initial setup, I deleted this code. Enable logging to file is included in the suite file.
The debug osd logs with 20 is enabled on required osds.
| 1. Initial Setup: | ||
| - Enables file logging for OSD daemons | ||
| - Disables PG autoscaler | ||
| - Creates a replicated pool and writes test data (10,000 objects, 5K each) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extend the coverage for EC pools as well, with next PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following Jira ticket has been created, and the work will be taken up after the PR is merged.
https://issues.redhat.com/browse/RHCEPHQE-22851
| ) | ||
| # Check that the scrubbing is progress or not | ||
| scrub_object.wait_for_pg_scrub_state(pg_id, wait_time=wait_time) | ||
| # Get case_to_run from config, default to all cases if not specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line no 162. Get the Acting set by passing the PGID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic implemented.
| # Check that the scrubbing is progress or not | ||
| scrub_object.wait_for_pg_scrub_state(pg_id, wait_time=wait_time) | ||
| # Get case_to_run from config, default to all cases if not specified | ||
| case_to_run = config.get("case_to_run") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no case is passed, add a debug line , and update code to run none of the scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| mon_obj.set_config(section="osd", name="osd_scrub_load_threshold", value=0) | ||
|
|
||
| # Truncate the osd logs | ||
| rados_object.remove_log_file_content(osd_nodes, daemon_type="osd") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new method to rotate the logs.
if enable_debug_logs:
log.info("Forcing log rotation on OSD nodes...")
try:
fsid_result = rados_obj.run_ceph_command(cmd="ceph fsid")
fsid = (
fsid_result.get("fsid")
if isinstance(fsid_result, dict)
else fsid_result
)
if fsid:
logrotate_cmd = f"logrotate -f /etc/logrotate.d/ceph-{fsid}"
osd_hosts = rados_obj.get_osd_hosts()
for host in osd_hosts:
try:
host.exec_command(
cmd=logrotate_cmd,
sudo=True,
check_ec=False,
long_running=True,
)
time.sleep(10)
log.debug(f"Log rotation completed on {host.hostname}")
except Exception:
pass
log.info(f"Log rotation triggered on {len(osd_hosts)} OSD host(s)")
except Exception as e:
log.warning(f"Failed to rotate logs: {e}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created method in core_workflows.py
| Case 1: Threshold value 0 (scrub blocking) | ||
| - Sets osd_scrub_load_threshold to 0 | ||
| - Verifies scrub operations do not proceed | ||
| - Checks for log message: "scrub_load_below_threshold:.* = no" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scrub_load_below_threshold should be reverted to default after each case, and add the same in finally block as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted the scrub_load_below_threshold paramter value in the finally block
| if not chk_auto_repair_param: | ||
| log.error( | ||
| "The repair parameters are not set. No executing the further tests" | ||
| configure_log_level(mon_obj, acting_pg_set, set_to_default=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enable debug logging at the beginning of the scenario itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove_log_file_content() -> also call log rotate before setting up debug logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| This test verifies the functionality of the osd_scrub_load_threshold parameter | ||
| which controls when scrub operations are allowed based on system load. | ||
| Test Execution Flow: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep each case in a try except block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| daemon_type="osd", | ||
| daemon_id=acting_pg_set[0], | ||
| search_string=log_line_no_scrub, | ||
| if "case3" in case_to_run: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not setting up debug logging in case3. Set it up, and then verify log lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included the debug logging and verifying the logs.
| if initial_load is None: | ||
| initial_load = load_avg | ||
| if load_avg > max_load_observed: | ||
| max_load_observed = load_avg | ||
| if load_avg > initial_load * 1.5: # Load increased by at least 50% | ||
| load_increasing = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revise this logic.
As in line 782, we are initializing initial_load as load_avg ; the conditional check at line 785 will never be true => load_increasing will never be set to True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified the logic.
| load_increasing = True | ||
|
|
||
| # Calculate load threshold = loadavg / number of online CPUs | ||
| load_threshold = load_avg / online_cpu_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change the terminology here and everywhere below
load_threshold => load_avg_per_cpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt this is fine because the paramter calculating the scrub load threshold value. Please let me know if change is required here.
| if not load_increasing and max_load_observed < initial_load * 1.2: | ||
| log.warning( | ||
| "Scenario 6: Load did not increase significantly. This may indicate stress-ng is " | ||
| "not working properly." | ||
| ) | ||
|
|
||
| if not load_threshold_achieved: | ||
| log.warning( | ||
| "Scenario 6: Load threshold did not exceed 15 within {} seconds. " | ||
| "Last calculated threshold: {:.3f}. Continuing with test.".format( | ||
| max_wait_time, load_threshold | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revise the logical checks here and fail the test if necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the logic in such a way that , if load is not inreasing continiously for one minute then marking the test case as fail.
| assert mon_obj.set_config( | ||
| section="osd", name="osd_scrub_load_threshold", value=10 | ||
| ), "Could not set osd_scrub_load_threshold value to 10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not be necessary as we are checking the default value earlier and if we really want to ensure it is using default value, we should rm the config instead of setting it to 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the code.
| wait_time = 300 | ||
| try: | ||
| if rados_object.start_check_deep_scrub_complete( | ||
| pg_id=pg_id, user_initiated=False, wait_time=wait_time | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ensure scrub parameters are in place to get scheduled scrub started within the given timeout of 300 secs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The min interval is 2 minutes and maximum is 15 minutes. I increased the wait time 600 seconds.
| ) | ||
|
|
||
| # Stop stress-ng after Step 6 completion | ||
| log.info("Scenario 6: Stopping stress-ng process after Step 7 completion") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit pick: step 7 -> step 6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected the log information.
| primary_osd_host.exec_command( | ||
| sudo=True, cmd="pkill -f 'stress-ng.*cpu'", check_ec=False | ||
| ) | ||
| time.sleep(2) | ||
| log.info("Scenario 6: stress-ng process stopped") | ||
| except Exception as err: | ||
| log.warning("Scenario 6: Error stopping stress-ng: {}".format(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please move stress-ng kill code section into a finally block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all scenarios, the stress-ng process is explicitly terminated after test execution, and as part of the cleanup phase, any remaining stress-ng processes are forcibly terminated across all OSD nodes.
e7b7bd6 to
a6ba683
Compare
|
Logs are copied at - https://ibm.box.com/s/21dmnw8ckk1pvmm3dftlyxqavpyh34tu |
|
New changes are detected. LGTM label has been removed. |
3180257 to
7cb990b
Compare
Signed-off-by: Rahul Lepakshi <[email protected]>
Signed-off-by: jprakash-redhat <[email protected]>
efb6d46 to
1d5b3b8
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: SrinivasaBharath The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Abandoning for now due to history issues |



Description
The Fix includes the -
Automation to verify the osd_scrub_load_threshold with high CPU load using stress-ng.
a. Install stress-ng package on all rados nodes
b. Start stress-ng to continuously generate load
c. Monitoring load continuously to verify it increases and calculate load threshold.The formula used to calculate the load is-
Load_threshold = load_avgerage / online_cpu_count
Where
Load average is clculated by executing the "/proc/loadavg" command
online_cpu_count is found by executing the nproc command
d. Wait for load threshold (loadavg / online CPUs) to exceed 15
e. Set the osd_scrub_load_threshold to 10(Default)
f. Set the debug_osd and debug_mon to 20
g. Start the user initiated scrub and check that the scrub should not get start if the load is high
h. Check logs for scrub_load_below_threshold messages
Divided the scenarios in to separate tests
Uncommented the Verification of the osd_scrub_load_threshold parameter with the default value which is 10.0
Please include Automation development guidelines. Source of Test case - New Feature/Regression Test/Close loop of customer BZs
click to expand checklist