Skip to content

Conversation

ffromani
Copy link
Contributor

@ffromani ffromani commented Jun 26, 2025

Implement reserved cpu (aka infra+control plane) sizing using a the linear programming optimization (gonum/optimize).

The core idea is to model the constraints and let the optimization package compute the desired target.

These changes where AI-Assisted (hence the AA tag), then largely amended by a human (hence the HI tag - Human Intervention).

The initial penalty cost structure was suggested by google Gemini 2.5 flash, and then amended by human intervention.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 26, 2025
@openshift-ci openshift-ci bot requested review from Tal-or and jmencak June 26, 2025 13:07
Copy link
Contributor

openshift-ci bot commented Jun 26, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2025
@ffromani ffromani force-pushed the perfprof-creator-autosize-sched-ctrlplane branch 4 times, most recently from 6955fc6 to 99f8d42 Compare July 4, 2025 11:21
@ffromani
Copy link
Contributor Author

ffromani commented Jul 5, 2025

/test e2e-no-cluster

@ffromani ffromani force-pushed the perfprof-creator-autosize-sched-ctrlplane branch from 99f8d42 to 71242c5 Compare July 7, 2025 12:48
@ffromani ffromani force-pushed the perfprof-creator-autosize-sched-ctrlplane branch 2 times, most recently from be7eda2 to 6912796 Compare July 9, 2025 12:30
@ffromani
Copy link
Contributor Author

ffromani commented Jul 9, 2025

/cc @MarSik

@openshift-ci openshift-ci bot requested a review from MarSik July 9, 2025 12:31
@ffromani ffromani changed the title WIP: autosize control plane support for performance profile creator autosize control plane support for performance profile creator Jul 16, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 16, 2025
@MarSik
Copy link
Contributor

MarSik commented Jul 16, 2025

This is an interesting approach, I wonder if we can express constraints like allocate the whole (multiples) L3 CCDs for reserved or integrate the NIC queue count (no more than 16/32) and interrupt counts (224 per cpu).

@ffromani
Copy link
Contributor Author

This is an interesting approach, I wonder if we can express constraints like allocate the whole (multiples) L3 CCDs for reserved or integrate the NIC queue count (no more than 16/32) and interrupt counts (224 per cpu).

We can introduce a quantitative limitation like I did for SMT, so we allocate in such a way to minimize the LLC count, for example. The problem however is that we only have the quantitative axis because kubelet owns the exact placement. We can say that, overall, 7 CPUs is better than 9 because we have all aligned and then we can make best use of compute resources (completely made up numbers, hope it is clear enough)


// Assumptions:
// 1. All the machines in the node pool have identical HW specs and need identical sizing.
// 2. We cannot distinguyish betwee infra/OS CPU requirements and control plane CPU requirement.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: distinguish

@MarSik
Copy link
Contributor

MarSik commented Jul 16, 2025

@ffromani Well the PPC can (and should) select the specific cpus for the reserved/isolated split based on the capacity computation. It has all the hardware topology information to be able to do that. Or are we talking about different aspects of PPC here?

ffromani added 6 commits July 16, 2025 19:18
Implement reserved cpu (aka infra+control plane) sizing using
a the linear programming optimization (gonum/optimize).

The core idea is to model the constraints and let the optimization
package compute the desired target.

These changes where AI-Assisted (hence the AA tag),
then largely amended by a human (hence the HI tag - Human Intervention).

The initial penalty cost structure was suggested by google Gemini 2.5 flash,
and then amended by human intervention.

Assisted-by: Google Gemini
Assisted-by-model: gemini-2.5-flash
Signed-off-by: Francesco Romani <[email protected]>
TODO explain why

Signed-off-by: Francesco Romani <[email protected]>
consider the real SMT Level when doing autosize computations.

Signed-off-by: Francesco Romani <[email protected]>
Signed-off-by: Francesco Romani <[email protected]>
Signed-off-by: Francesco Romani <[email protected]>
Signed-off-by: Francesco Romani <[email protected]>
@ffromani ffromani force-pushed the perfprof-creator-autosize-sched-ctrlplane branch from 6912796 to 6005e98 Compare July 16, 2025 17:32
Copy link
Contributor

openshift-ci bot commented Jul 16, 2025

@ffromani: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/lint 6005e98 link true /test lint
ci/prow/e2e-aws-ovn-techpreview 6005e98 link true /test e2e-aws-ovn-techpreview
ci/prow/e2e-aws-ovn 6005e98 link true /test e2e-aws-ovn
ci/prow/okd-scos-e2e-aws-ovn 6005e98 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ffromani
Copy link
Contributor Author

@ffromani Well the PPC can (and should) select the specific cpus for the reserved/isolated split based on the capacity computation. It has all the hardware topology information to be able to do that. Or are we talking about different aspects of PPC here?

Uhm, we can achieve that rethinking all the core allocation stage. I added an add-on step to demo the autosizing, but we are looking to a possible full rewrite of the allocation logic embedding some form of optimization, this is a much bigger endeavour. Do we want to go this direction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants