-
Notifications
You must be signed in to change notification settings - Fork 115
OCPBUGS-62702: AA: latency-e2e: skip tests on HT-disabled systems #1386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6fbff19
to
ab26087
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching and addressing this.
I left a few comments.
/approve
test/e2e/performanceprofile/functests/5_latency_testing/latency_testing.go
Outdated
Show resolved
Hide resolved
test/e2e/performanceprofile/functests/5_latency_testing/latency_testing.go
Show resolved
Hide resolved
test/e2e/performanceprofile/functests/5_latency_testing/latency_testing.go
Outdated
Show resolved
Hide resolved
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: SargunNarula, shajmakh The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ab26087
to
f77ebb7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the updates lgtm. regarding the commit (and PR title), I think it should highlight that this fix is derived from the fact that there might be different HT configurations: with HT enabled and without, rather than having the max latency missing or not.
/lgtm |
@SargunNarula: This pull request references CNF-18648 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.21." or "openshift-4.21.", but it targets "openshift-4.19" instead. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
LGTM |
/hold |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conceptually OK, but questions about the implementation
workerNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels) | ||
if err != nil { | ||
return false, fmt.Errorf("get worker nodes: %w", err) | ||
} | ||
workerNode := &workerNodes[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to pick a random node which matches the labels? can't we just pick the node by name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By specifying index 0, we fix the node among those that have the appropriate labels. To ensures that if a performance profile has applied any kernel argument, such as nosmt, we can verify it through an actual runtime check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but I still don't follow why we need to use `the node selector labels vs picking a specific node and checking that node
} | ||
cpuID := set.List()[0] | ||
|
||
isHTEnabled := nodes.IsHyperthreadingEnabled(ctx, cpuID, workerNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd check the node settings (possibly /proc/cmdline
) or actually any random CPU. To put it differently, why the first isolated CPU is significant and why is it better than, say, cpu#0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no particular significance in choosing the first isolated CPU. An ID was simply required to perform the check, so selected one from the isolated set. Do you suggest checking any random cpu ?
func IsHyperthreadingEnabled(ctx context.Context, cpuID int, node *corev1.Node) bool { | ||
smtLevel := GetSMTLevel(ctx, cpuID, node) | ||
return smtLevel > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just inline GetSMTLevel
in the one and only calling site
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved, with latest commit.
f77ebb7
to
28fb228
Compare
28fb228
to
30c01e3
Compare
/retest |
The BZ 2094046 test cases for oslat and cyclictest were negative tests expecting to fail on HT-enabled systems, but they passed unexpectedly on HT-disabled systems because the tools executed successfully. Changes: - Add hyperthreading detection in its test execution path - Skip BZ 2094046 tests when HT is disabled to prevent false passes Signed-Off-by: Sargun Narula <[email protected]>
30c01e3
to
41ddbcf
Compare
@SargunNarula: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
LGTM @shajmakh I can now confirm the tests pass on both HT enabled and disabled environments. More specifically pass on HT enabled and gets skipped on HT-disabled ones. Note: Hyperthreading check was performed on a BM node considering more number of online CPUs needed as compared to VM node |
/verified by @SargunNarula |
@SargunNarula: This PR has been marked as verified by In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@SargunNarula: This pull request references Jira Issue OCPBUGS-62702, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/lgtm |
workerNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels) | ||
if err != nil { | ||
return false, fmt.Errorf("get worker nodes: %w", err) | ||
} | ||
workerNode := &workerNodes[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but I still don't follow why we need to use `the node selector labels vs picking a specific node and checking that node
set, err := cpuset.Parse(string(*profile.Spec.CPU.Isolated)) | ||
if err != nil || set.Size() == 0 { | ||
return false, fmt.Errorf("failed to parse isolated CPUs from profile") | ||
} | ||
cpuID := set.List()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't get why this code is better than just checking cpuID 0 (which is much simpler) or the kernel command line arguments (/proc/cmdline
)
This PR addresses an issue with the BZ 2094046 test cases for oslat and cyclictest.
These tests were originally negative tests, expecting to fail on Hyperthreading enabled systems. However, on HT-disabled systems, the tests executed successfully and passed unexpectedly, leading to false positives.
Changes in this PR:
Assisted-by: Cursor v1.24.2
AI Attribution: AIA HAb Ce Hin R Claude-4-sonnet v1.0