-
Notifications
You must be signed in to change notification settings - Fork 115
autosize control plane support for performance profile creator #1349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
autosize control plane support for performance profile creator #1349
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6955fc6
to
99f8d42
Compare
/test e2e-no-cluster |
99f8d42
to
71242c5
Compare
be7eda2
to
6912796
Compare
/cc @MarSik |
This is an interesting approach, I wonder if we can express constraints like allocate the whole (multiples) L3 CCDs for reserved or integrate the NIC queue count (no more than 16/32) and interrupt counts (224 per cpu). |
We can introduce a quantitative limitation like I did for SMT, so we allocate in such a way to minimize the LLC count, for example. The problem however is that we only have the quantitative axis because kubelet owns the exact placement. We can say that, overall, 7 CPUs is better than 9 because we have all aligned and then we can make best use of compute resources (completely made up numbers, hope it is clear enough) |
|
||
// Assumptions: | ||
// 1. All the machines in the node pool have identical HW specs and need identical sizing. | ||
// 2. We cannot distinguyish betwee infra/OS CPU requirements and control plane CPU requirement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: distinguish
@ffromani Well the PPC can (and should) select the specific cpus for the reserved/isolated split based on the capacity computation. It has all the hardware topology information to be able to do that. Or are we talking about different aspects of PPC here? |
Implement reserved cpu (aka infra+control plane) sizing using a the linear programming optimization (gonum/optimize). The core idea is to model the constraints and let the optimization package compute the desired target. These changes where AI-Assisted (hence the AA tag), then largely amended by a human (hence the HI tag - Human Intervention). The initial penalty cost structure was suggested by google Gemini 2.5 flash, and then amended by human intervention. Assisted-by: Google Gemini Assisted-by-model: gemini-2.5-flash Signed-off-by: Francesco Romani <[email protected]>
TODO explain why Signed-off-by: Francesco Romani <[email protected]>
consider the real SMT Level when doing autosize computations. Signed-off-by: Francesco Romani <[email protected]>
Signed-off-by: Francesco Romani <[email protected]>
Signed-off-by: Francesco Romani <[email protected]>
Signed-off-by: Francesco Romani <[email protected]>
6912796
to
6005e98
Compare
@ffromani: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Uhm, we can achieve that rethinking all the core allocation stage. I added an add-on step to demo the autosizing, but we are looking to a possible full rewrite of the allocation logic embedding some form of optimization, this is a much bigger endeavour. Do we want to go this direction? |
Implement reserved cpu (aka infra+control plane) sizing using a the linear programming optimization (gonum/optimize).
The core idea is to model the constraints and let the optimization package compute the desired target.
These changes where AI-Assisted (hence the AA tag), then largely amended by a human (hence the HI tag - Human Intervention).
The initial penalty cost structure was suggested by google Gemini 2.5 flash, and then amended by human intervention.