Skip to content

Conversation

SargunNarula
Copy link
Contributor

@SargunNarula SargunNarula commented Aug 29, 2025

This PR addresses an issue with the BZ 2094046 test cases for oslat and cyclictest.

These tests were originally negative tests, expecting to fail on Hyperthreading enabled systems. However, on HT-disabled systems, the tests executed successfully and passed unexpectedly, leading to false positives.

Changes in this PR:

  • Added Hyperthreading detection in the test execution path.
  • Skip BZ 2094046 tests when HT is disabled, preventing false passes on systems without Hyperthreading.

Assisted-by: Cursor v1.24.2
AI Attribution: AIA HAb Ce Hin R Claude-4-sonnet v1.0

@openshift-ci openshift-ci bot requested review from jmencak and swatisehgal August 29, 2025 12:36
@SargunNarula SargunNarula force-pushed the latency_test branch 2 times, most recently from 6fbff19 to ab26087 Compare September 25, 2025 11:20
Copy link
Contributor

@shajmakh shajmakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching and addressing this.
I left a few comments.
/approve

Copy link
Contributor

openshift-ci bot commented Sep 25, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SargunNarula, shajmakh
Once this PR has been reviewed and has the lgtm label, please assign jmencak for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@shajmakh shajmakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the updates lgtm. regarding the commit (and PR title), I think it should highlight that this fix is derived from the fact that there might be different HT configurations: with HT enabled and without, rather than having the max latency missing or not.

@SargunNarula SargunNarula changed the title Fixed oslat & cyclictest failure due to missing max latency value AA: e2e: CNF:18648 Fix BZ 2094046 oslat & cyclictest HT tests to prevent false passes on HT-disabled systems Sep 26, 2025
@mrniranjan
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 26, 2025
@SargunNarula SargunNarula changed the title AA: e2e: CNF:18648 Fix BZ 2094046 oslat & cyclictest HT tests to prevent false passes on HT-disabled systems CNF-18648: AA: latency-e2e: skip tests on HT-disabled systems Sep 26, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 26, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Sep 26, 2025

@SargunNarula: This pull request references CNF-18648 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.21." or "openshift-4.21.", but it targets "openshift-4.19" instead.

In response to this:

This PR addresses an issue with the BZ 2094046 test cases for oslat and cyclictest.

These tests were originally negative tests, expecting to fail on Hyperthreading enabled systems. However, on HT-disabled systems, the tests executed successfully and passed unexpectedly, leading to false positives.

Changes in this PR:

  • Added Hyperthreading detection in the test execution path.
  • Skip BZ 2094046 tests when HT is disabled, preventing false passes on systems without Hyperthreading.

Assisted-by: Cursor v1.24.2
AI Attribution: AIA HAb Ce Hin R Claude-4-sonnet v1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@shajmakh
Copy link
Contributor

LGTM
let's please confirm the tests pass on both configurations (HT enabled and disabled) before merging this

@shajmakh
Copy link
Contributor

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 26, 2025
@SargunNarula
Copy link
Contributor Author

/retest

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conceptually OK, but questions about the implementation

Comment on lines +236 to +240
workerNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels)
if err != nil {
return false, fmt.Errorf("get worker nodes: %w", err)
}
workerNode := &workerNodes[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to pick a random node which matches the labels? can't we just pick the node by name?

Copy link
Contributor Author

@SargunNarula SargunNarula Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By specifying index 0, we fix the node among those that have the appropriate labels. To ensures that if a performance profile has applied any kernel argument, such as nosmt, we can verify it through an actual runtime check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but I still don't follow why we need to use `the node selector labels vs picking a specific node and checking that node

}
cpuID := set.List()[0]

isHTEnabled := nodes.IsHyperthreadingEnabled(ctx, cpuID, workerNode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd check the node settings (possibly /proc/cmdline) or actually any random CPU. To put it differently, why the first isolated CPU is significant and why is it better than, say, cpu#0 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no particular significance in choosing the first isolated CPU. An ID was simply required to perform the check, so selected one from the isolated set. Do you suggest checking any random cpu ?

Comment on lines 279 to 281
func IsHyperthreadingEnabled(ctx context.Context, cpuID int, node *corev1.Node) bool {
smtLevel := GetSMTLevel(ctx, cpuID, node)
return smtLevel > 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just inline GetSMTLevel in the one and only calling site

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved, with latest commit.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2025
@SargunNarula
Copy link
Contributor Author

/retest

The BZ 2094046 test cases for oslat and cyclictest were negative tests
expecting to fail on HT-enabled systems, but they passed unexpectedly
on HT-disabled systems because the tools executed successfully.

Changes:
- Add hyperthreading detection in its test execution path
- Skip BZ 2094046 tests when HT is disabled to prevent false passes

Signed-Off-by: Sargun Narula <[email protected]>
Copy link
Contributor

openshift-ci bot commented Oct 2, 2025

@SargunNarula: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 41ddbcf link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@SargunNarula
Copy link
Contributor Author

SargunNarula commented Oct 3, 2025

LGTM
let's please confirm the tests pass on both configurations (HT enabled and disabled) before merging this

@shajmakh I can now confirm the tests pass on both HT enabled and disabled environments. More specifically pass on HT enabled and gets skipped on HT-disabled ones.

Note: Hyperthreading check was performed on a BM node considering more number of online CPUs needed as compared to VM node

@SargunNarula
Copy link
Contributor Author

/verified by @SargunNarula

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 3, 2025
@openshift-ci-robot
Copy link
Contributor

@SargunNarula: This PR has been marked as verified by @SargunNarula.

In response to this:

/verified by @SargunNarula

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@SargunNarula SargunNarula changed the title CNF-18648: AA: latency-e2e: skip tests on HT-disabled systems OCPBUGS-62702: AA: latency-e2e: skip tests on HT-disabled systems Oct 3, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Oct 3, 2025
@openshift-ci-robot
Copy link
Contributor

@SargunNarula: This pull request references Jira Issue OCPBUGS-62702, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @mrniranjan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR addresses an issue with the BZ 2094046 test cases for oslat and cyclictest.

These tests were originally negative tests, expecting to fail on Hyperthreading enabled systems. However, on HT-disabled systems, the tests executed successfully and passed unexpectedly, leading to false positives.

Changes in this PR:

  • Added Hyperthreading detection in the test execution path.
  • Skip BZ 2094046 tests when HT is disabled, preventing false passes on systems without Hyperthreading.

Assisted-by: Cursor v1.24.2
AI Attribution: AIA HAb Ce Hin R Claude-4-sonnet v1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from mrniranjan October 3, 2025 11:38
@shajmakh
Copy link
Contributor

shajmakh commented Oct 3, 2025

/lgtm
Thanks for the updates, I'll leave room for oter reviewers if they still have comments
/hold

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 3, 2025
Comment on lines +236 to +240
workerNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels)
if err != nil {
return false, fmt.Errorf("get worker nodes: %w", err)
}
workerNode := &workerNodes[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but I still don't follow why we need to use `the node selector labels vs picking a specific node and checking that node

Comment on lines +243 to +247
set, err := cpuset.Parse(string(*profile.Spec.CPU.Isolated))
if err != nil || set.Size() == 0 {
return false, fmt.Errorf("failed to parse isolated CPUs from profile")
}
cpuID := set.List()[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't get why this code is better than just checking cpuID 0 (which is much simpler) or the kernel command line arguments (/proc/cmdline)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants