Skip to content

Adjust the platform/manifests charter according to the last 5 years and the state of 2025 #837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

juliusvonkohout
Copy link
Member

@juliusvonkohout juliusvonkohout commented Mar 18, 2025

@kimwnasptd for lgtm since these changes are usually and historically done by the manifests WG leads (@kimwnasptd and @juliusvonkohout)

CC @kubeflow/kubeflow-steering-committee

#832 (comment) and #834 (comment) also provides a lot of context and argumentation.

"

WG Manifests Charter

This charter adheres to the conventions, roles and organization management
outlined in [wg-governance].

Scope

We simply (automatically) synchronize the application and dependencies manifests to then elaborately combine (configure)them for full platform experience.
Providing a consistent and tested end-to-end multi-tenant experience is the most important task of the platform/manifests WG.
To achieve this we maintain an extensive testing suite that covers most basic scenarios users would expect from a Platform for ML orchestration.
We also provide the documentation regarding, but not limited to installation, extension, security and architecture to enable users to run their own ML Platform on Kubernetes.
Users may choose to derive from platform/manifests to create so called distributions, which are opinionated to satisfy individual requirements.
Users may also choose to install individual components without the benefits of the platform.

In scope

Code, Binaries and Services

  • Enable users / distributions to install, extend and maintain Kubeflow as a end-to-end multi-tenant platform for multiple users
  • This includes dependencies, security efforts and exemplary integration with popular tools and frameworks.
  • Users can also install individual components without the benefits of the platform, but then they could also just directly fetch them from the WG releases.
  • Synchronize the manifests between working groups and make sure via integration tests that the components work end-to-end together as multi-tenant platform
  • Release tested releases of the Kubeflow platform for downstream consumption
  • We try to be compatible with the popular Kubernetes clusters (Kind, Rancher, AKS, EKS, GKE, ...)
  • We provide hints and experimental examples how a user / distribution could integrate non-default external authentication (e.g. companies Identity Provider) and popular non-default services on his own
  • We in general document the installation of Kubeflow as a platform and / or individual components including common problems and architectural overviews.
  • There is the evolving and not exhaustive list of dependencies for a proper multi-tenant platform installation: Istio, KNative, Dex, Oauth2-proxy, Cert-Manager, ...
  • There is the evolving and not exhaustive list of applications: KFP, Trainer, Dashboard, Workspaces / Noteboks, Kserve, Spark, ...

Cross-cutting and Externally Facing Processes

With Application Owners

  • Aid the application owner in creating manifests (Helm, Kustomize) for his application
  • Aid the application owner regarding security best practices
  • Communicate with the application owner regarding releases and versioning

With Users / Distribution Owners

  • Distributions are opinionated derivatives of Kubeflow platform/manifests, for example replacing all databases with closed source managed databases from AWS, GKE, Azure, ...
  • A distribution can be created by an arbitrary amount of users / companies in private or in public by deriving from Kubeflow platform/manifests, see the definition above
  • Coordinate with "distribution owners" / users to take part in the testing of Kubeflow releases.

Out of scope

  • We do not support a specific deployment tool (e.g., ArgoCD, Flux)
  • The default installation shall not contain deep integration with external cloud services or closed source solutions, instead we aim for Kubernetes-native solutions and light authentication and authorization integration with external IDPs
    ...
    "

This diagram might be outdated and from 2023 (oauth2-proxy, model-registry, spark etc is missing), but it captures a lot of essentials which are also mandatory parts of the kubeflow/notebooks and kubeflow/dashboard repository, just imagine 2-2000 Users (statistics Canada recently reported around 1000 users) and for more details I can provide hours of Kubecon architecture presentations with slides. The grey box is just one user.
image

…nd the state of 2025

Signed-off-by: Julius von Kohout <[email protected]>
Signed-off-by: Julius von Kohout <[email protected]>
@juliusvonkohout juliusvonkohout self-assigned this Mar 18, 2025
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from juliusvonkohout. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Julius von Kohout <[email protected]>
Signed-off-by: Julius von Kohout <[email protected]>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @juliusvonkohout.
I will take a look at these changes later this week.
/hold for WGs to review
cc @kubeflow/wg-training-leads @kubeflow/wg-notebooks-leads @kubeflow/wg-manifests-leads @kubeflow/wg-pipeline-leads @kubeflow/wg-data-leads @kubeflow/wg-automl-leads @kubeflow/kubeflow-steering-committee @kubeflow/wg-deployment-leads @kubeflow/release-team

@juliusvonkohout
Copy link
Member Author

juliusvonkohout commented Mar 18, 2025

It is also interesting that part of this is even in the CNCF graduation criteria, see #834 (comment)

"
3. "The release notes MUST identify every publicly known run-time vulnerability fixed in this release that already had a CVE assignment or similar when the release was created.", "The project MUST publish the process for reporting vulnerabilities on the project site.", "If private vulnerability reports are supported, the project MUST include how to send the information in a way that is kept private.", "The project's initial response time for any vulnerability report received in the last 6 months MUST be less than or equal to 14 days." Is something I need to work on. Right now it is me handling this at the platform level, so indirectly benefiting all working groups https://github.com/kubeflow/manifests?tab=readme-ov-file#cve-scanning and we need to add an Email.
5. "The project MUST use at least one automated test suite that is publicly released as FLOSS (this test suite may be maintained as a separate FLOSS project). The project MUST clearly show or document how to run the test suite(s) (e.g., via a continuous integration (CI) script or via documentation in files such as BUILD.md, README.md, or CONTRIBUTING.md).", "The project MUST have a general policy (formal or not) that as major new functionality is added to the software produced by the project, tests of that functionality should be added to an automated test suite.", "The project MUST have evidence that the [test_policy](https://www.bestpractices.dev/en/criteria#test_policy) for adding tests has been adhered to in the most recent major changes to the software produced by the project." I think that is covered by the platform integration tests at kubeflow/manifests and continuously improved by me and others (GSOC)
6. "The project MUST have at least one primary developer who knows how to design secure software. (See ‘details’ for the exact requirements.) [[know_secure_design](https://www.bestpractices.dev/en/criteria#0.know_secure_design)]", "At least one of the project's primary developers MUST know of common kinds of errors that lead to vulnerabilities in this kind of software, as well as at least one method to counter or mitigate each of them." I think I am pushing quite well for that :-D I push for container security and hard multi-tenancy and have written Information Security Concepts, Security Blogposts and done penetration tests for Kubeflow.
7. "The project MUST use a delivery mechanism that counters MITM attacks" is why we have the platform with Cert-manager, Istio and mTLS as dependency plus networkpolicies
8. "There MUST be no unpatched vulnerabilities of medium or higher severity that have been publicly known for more than 60 days. Projects SHOULD fix all critical vulnerabilities rapidly after they are reported." Well that is also what I am working on, in multiple WGs (Platform, Pipelines, Katib, trainer) and I have added the CVE scanning at platform level so for all working groups last GSOC. https://github.com/kubeflow/manifests/actions/workflows/trivy.yaml
9. "At least one static code analysis tool (beyond compiler warnings and "safe" language modes) MUST be applied to any proposed major production release of the software before its release, if there is at least one FLOSS tool that implements this criterion in the selected language.", "All medium and higher severity exploitable vulnerabilities discovered with static code analysis MUST be fixed in a timely way after they are confirmed." Is something we need to improve but at least formatting / linting is there at the platform level. https://github.com/kubeflow/manifests/actions/workflows/linting_bash_python_yaml_files.yaml
10. I think a lot of that is also supporting the updated charter #837
"

@juliusvonkohout
Copy link
Member Author

#832 (comment) also provides a lot of context and argumentation.

Copy link
Member

@thesuperzapper thesuperzapper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left some comments.

However I am quite concerned that we don't explicitly define what the actual code products of this new working group are. The scope has expanded over the years due to a lack of clarity on this issue.

If we're going to rewrite the charter, we should consider scoping it very explicitly to only aggregating the application manifests (as originally intended).

All other tasks listed might be delegated to other groups, or even downstream distributions, as this ensures the community is more focused on creating the AI/ML tools, which actually make up Kubeflow.

- Enable users / distributions to install, extend and maintain Kubeflow as a multi-tenant platform for multiple users
- This includes dependencies, security efforts and exemplary integration with popular tools and frameworks.
- Synchronize the manifests (Helm, Kustomize) between working groups
- We try to be compatible with the popular Kubernetes clusters (Kind, Rancher, AKS, EKS, GKE, ...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary for the manifest working group?

We intentionally excluded this goal from the original manifest wg charter to prevent unnecessary focus on vendor-specific issues.

Copy link
Member Author

@juliusvonkohout juliusvonkohout Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice the users are going to have different Kubernetes layers (Kind, Rancher, AKS, EKS, GKE, ...) but this only covers Kubernetes, not AWS managed databases or so. We definitely try to be compatible with the most popular ones although we cannot guarantee it. Right now it works on Kind, Rancher, AKS, EKS, GKE for me and this is also what most users expect. So it is a "soft goal" we try for our users, but we do not guarantee it.

In the end this is done by volunteers, that is what we want to work on. This is where we see the value in contributing to Kubeflow. If someone else wants to focus on something else he is free to do that what is sustainable and valuable for him. No one is forced to work on that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the kubeflow/manifests were only meant to be a "minimum viable deployment" for testing purposes on Kind clusters?

Should we say that instead?

Comment on lines 27 to 28
- Distributions are strongly opinionated derivatives of Kubeflow platform/manifests, for example replacing all databases with closed source managed databases from AWS, GKE, Azure, ...
- A distribution can be created by an arbitrary amount of users / companies in private or in public by deriving from Kubeflow platform/manifests, see the definition above
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the current definition of a distribution, there is no requirement for it to be "strongly opinionated".

If we are to include a definition here, we must agree as a community.

It might be better to just define it somewhere else rather than in this charter.

Copy link
Member Author

@juliusvonkohout juliusvonkohout Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I can remove the word "strongly" but i think that is a good definition and right where it belongs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of where we define the term "Kubeflow Distribution" is defined, changing it will require a wider community discussion, because we already have a definition in the original KEP.

https://github.com/kubeflow/community/blob/master/proposals/434-kubeflow-distribution/README.md


### With Application Owners

- Aid the application owner in creating manifests (Helm, Kustomize) for his application
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that requiring the manifests WG to support the upstream manifests is sustainable.

But obviously, it is something that the individuals who are participating might also choose to do if they are so inclined.

Copy link
Member Author

@juliusvonkohout juliusvonkohout Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You always have to keep in mind that we are volunteers. All of this is best-effort. We try it. Sometimes other working groups need help to understand for example securitycontexts of a pod, since they are rather focused on the source code. Or we help them to fix the kustomize 5 warnings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that many contributors are volunteers, but either way, the WG charters are governance documents.

It's important for the health of the WG (and project) that we set reasonable expectations for the working group members.

I am not sure it's sustainable to include the expectation of upstream manifest maintenance, this is why the original charter focused only on "aggregating manifests".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over the last five years it was very sustainable and we made great progress.

Comment on lines 13 to 14
- The default installation shall not contain deep integration with external cloud services or closed source solutions, instead we aim for Kubernetes-native solutions and light authentication and authorization integration with external IDPs
- We provide hints and experimental examples how a user / distribution could integrate non-default external authentication (e.g. companies Identity Provider) and popular non-default services on his own
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand how this relates to the core goal of the manifest working group, which is to enable the creation of distributions.

As a lot of these things seem kind of distribution-specific.

Copy link
Member Author

@juliusvonkohout juliusvonkohout Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on " A distribution can be created by an arbitrary amount of users / companies in private or in public by deriving from Kubeflow platform/manifests, see the definition above" it covers the full userbase. It can be a private 1 man distribution or a large enterprise inhouse distribution. It can even be a public distribution or one that is for sale. There are probably several thousands.

So we want to restrict it via "- The default installation shall not contain deep integration with external cloud services or closed source solutions, instead we aim for Kubernetes-native solutions and light authentication and authorization integration with external IDPs"

but not block people from exchanging non-default ideas and examples for basic needs "- We provide hints and experimental examples how a user / distribution could integrate non-default external authentication (e.g. companies Identity Provider) and popular non-default services on his own" It does not mean that we support or help with such examples, it just means "hey you can connect Kubeflow to your IDP with low effort. Here is the dex and oauth2 architecture and documentation, feel free to try it out on your own or check out our get support page if you need help"

This is done by volunteers, that is what we want to work on. This is where we see the value in contributing to Kubeflow. If someone else wants to focus on something else he is free to do that what is sustainable and valuable for him.

As a opinionated distribution derived from Kubeflow/manifests you can cater exactly to your userbase and make such company specific things supported and enabled by default, but that is not what we are doing here.

@juliusvonkohout
Copy link
Member Author

juliusvonkohout commented Mar 18, 2025

I have left some comments.

However I am quite concerned that we don't explicitly define what the actual code products of this new working group are. The scope has expanded over the years due to a lack of clarity on this issue.

If we're going to rewrite the charter, we should consider scoping it very explicitly to only aggregating the application manifests (as originally intended).

All other tasks listed might be delegated to other groups, or even downstream distributions, as this ensures the community is more focused on creating the AI/ML tools, which actually make up Kubeflow.

"If we're going to rewrite the charter, we should consider scoping it very explicitly to only aggregating the application manifests (as originally intended)." Is rather useless and can be done by a robot.

Right now or at least for the last 5 years our focus is 95+ % platform while the manifests make up less than 5 % of the work. Just copying manifests can be done by a robot and does not need a WG on its own. This is not what this WG is about. We as WG focus for 5 years on platform and that is what we want to work on. This is where we see the value in contributing to Kubeflow. If someone else wants to focus on something else he is free to do that what is sustainable and valuable for him.
Everybody is free to use the individual components without the platform layer if he does not like it.

In the end this is done by volunteers, that is what we want to work on. This is where we see the value in contributing to Kubeflow. If someone else wants to focus on something else he is free to do that what is sustainable and valuable for him. No one is forced to work on that and no one should be blocked from working on that.

Signed-off-by: Julius von Kohout <[email protected]>
Signed-off-by: Julius von Kohout <[email protected]>
Signed-off-by: Julius von Kohout <[email protected]>
Signed-off-by: Julius von Kohout <[email protected]>
@juliusvonkohout
Copy link
Member Author

Alright I updated it to satisfy the comments and the old structure.

I also hope to have clarified what is owned and maintained by the WG.

"In that case, do we need to completely remove concepts of "Kubeflow Distribution"?" We could, but then users would just find a new word for derivatives from platfrom/manifests. I think the definition of derivative fits quite well with how the term distribution is commonly used also outside of Kubeflow.

Copy link
Member

@thesuperzapper thesuperzapper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left some more comments.

as Subproject Owners, Tech Leads and Chairs. This is done to ensure we have a
simple enough model to start that people can understand and get used to. So for
the Manifests WG we only have Manifests WG Leads, which are the root approvers.

The following sections will aim to define the requirements for someone to become
a reviewer and an approver in the root OWNERS file (Manifests WG Lead).

### Manifests WG Lead Requirements
### Platform/Manifests WG Lead Requirements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets keep the name of the WG the same, as discussed above.

1. Being involved with the release team, since the [release process](https://github.com/kubeflow/community/tree/master/releases) is tightly intertwined with the manifests/platform repository
2. Testing methodologies (GitHub Actions)
3. Processes regarding the [experimental](https://github.com/kubeflow/manifests/blob/master/experimental) components
4. [Platform manifests](https://github.com/kubeflow/manifests/tree/master/common) maintained irectly by Manifests/Platform WG (Istio, Knative, Cert Manager etc.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo, and also lets keep the name as Manifests WG.

5. Community and health of the project

Root approvers, or Manifests WG Leads, are expected to have expertise and be able
Root approvers, or Manifests/Platform WG Leads, are expected to have expertise and be able
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets keep the name as Manifests WG.

Comment on lines +9 to +14
We simply (automatically) synchronize the application and dependencies manifests to then elaborately combine (configure)them for full platform experience.
Providing a consistent and tested end-to-end multi-tenant experience is the most important task of the platform/manifests WG.
To achieve this we maintain an extensive testing suite that covers most basic scenarios users would expect from a Platform for ML orchestration.
We also provide the documentation regarding, but not limited to installation, extension, security and architecture to enable users to run their own ML Platform on Kubernetes.
Users may choose to derive from platform/manifests to create so called distributions, which are opinionated to satisfy individual requirements.
Users may also choose to install individual components without the benefits of the platform.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we format this as a list of bullet points, and combine the ones which are the the same idea?

This makes it easier to have discussions about each specific element of the scope, as some new elements are being proposed.

- This includes dependencies, security efforts and exemplary integration with popular tools and frameworks.
- Users can also install individual components without the benefits of the platform, but then they could also just directly fetch them from the WG releases.
- Synchronize the manifests between working groups and make sure via integration tests that the components work end-to-end together as multi-tenant platform
- Release tested releases of the Kubeflow platform for downstream consumption
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to clarify what is meant by "Kubeflow Platform", because this is not defined, or just not use that term.

Comment on lines +40 to +41
- Distributions are opinionated derivatives of Kubeflow platform/manifests, for example replacing all databases with closed source managed databases from AWS, GKE, Azure, ...
- A distribution can be created by an arbitrary amount of users / companies in private or in public by deriving from Kubeflow platform/manifests, see the definition above
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to define "distribution" here, lets be as generic as possible:

  • distributions are all downstream derivatives of the kubeflow manifests which are not maintained by the kubeflow community

We could also define it, and other terms at the top of the document.

- Maintain a catalog that will allow users to install Kubeflow apps and
common services easily on Kubernetes, either on the cloud or on-prem, without
depending on external cloud services or closed source solutions. Those
manifests are deployed using `kubectl` and `kustomize` and include:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to ensure that kustomize/kubectl remains the primary "scope" of the manifests repo, as this is what all existing distributions/users are based on.

Also, are we allowing other deployment tools beyond kubectl and kustomize (e.g. helm, argo cd, flux cd), because this is a big scope change if so?

Comment on lines +20 to +29
- Enable users / distributions to install, extend and maintain Kubeflow as a end-to-end multi-tenant platform for multiple users
- This includes dependencies, security efforts and exemplary integration with popular tools and frameworks.
- Users can also install individual components without the benefits of the platform, but then they could also just directly fetch them from the WG releases.
- Synchronize the manifests between working groups and make sure via integration tests that the components work end-to-end together as multi-tenant platform
- Release tested releases of the Kubeflow platform for downstream consumption
- We try to be compatible with the popular Kubernetes clusters (Kind, Rancher, AKS, EKS, GKE, ...)
- We provide hints and experimental examples how a user / distribution could integrate non-default external authentication (e.g. companies Identity Provider) and popular non-default services on his own
- We in general document the installation of Kubeflow as a platform and / or individual components including common problems and architectural overviews.
- There is the evolving and not exhaustive list of dependencies for a proper multi-tenant platform installation: Istio, KNative, Dex, Oauth2-proxy, Cert-Manager, ...
- There is the evolving and not exhaustive list of applications: KFP, Trainer, Dashboard, Workspaces / Noteboks, Kserve, Spark, ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets try and list these as a specific list of "responsibilities" (like the current ones).

Words like "enable", "hints", and "experimental examples" are not very clear.

- Enable users / distributions to install, extend and maintain Kubeflow as a multi-tenant platform for multiple users
- This includes dependencies, security efforts and exemplary integration with popular tools and frameworks.
- Synchronize the manifests (Helm, Kustomize) between working groups
- We try to be compatible with the popular Kubernetes clusters (Kind, Rancher, AKS, EKS, GKE, ...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the kubeflow/manifests were only meant to be a "minimum viable deployment" for testing purposes on Kind clusters?

Should we say that instead?

- Decide which applications to include in Kubeflow.
- Decide which variant of an application to include (e.g., KFP Standalone vs
KFP with Istio).
- Create and maintain one or more Kubeflow distributions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Opening a new thread because the old one was marked as resolved)

It is critical that we still explicitly exclude the creation of a distribution by the manifest working group, as this would create a massive conflict of interest.

@kimwnasptd
Copy link
Member

Thanks for taking the time to update this @juliusvonkohout! I'm taking a detailed look as well.

With a diagonal look, it seems to me we have these 2 discussions going on in parallel:

  1. Existing manifests work/scope, regarding
    • tools (kustomize, helm)
    • thin line between having a distribution, while having an out of the box example
    • when/if this repo should be applying extra manifests on top of ones provided by WGs
  2. Platform terminology, which we need to clarify. To my understanding it includes:
    1. Designs for KF components to work in multi-tenant fashion (web app mechanisms, jwts etc)
    2. Components for multi-tenancy (essentially code in http://github.com/kubeflow/dashboard/)

IMO we first need to discuss if Manifests WG should be Platform WG, and include multi tenancy as well as what processes to have around such a WG that can have decisions that affect other WGs.

Will add more comments in the proposal as well, towards that direction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants