Skip to content

Monitoring API: Add AlertmanagerMainConfig #2148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

marioferh
Copy link
Contributor

Every component will be in a separated PR in order to improve review process

First PR: #1929
Related: Enhancements Proposal openshift/enhancements#1627

Copy link
Contributor

openshift-ci bot commented Jan 15, 2025

Hello @marioferh! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 15, 2025
@openshift-ci openshift-ci bot requested review from bparees and JoelSpeed January 15, 2025 14:55
@marioferh marioferh force-pushed the alertmanager_monitoring_api branch from c0d2965 to c35227f Compare January 15, 2025 15:21
@marioferh
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 15, 2025
Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the comments are centered around godoc. I'd recommend looking at https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md#write-user-readable-documentation-in-godoc for more information on what makes a good godoc that is helpful to users.

Another thing that stood out was multiple fields related to pod spec configuration - you may want to group those into a separate struct to have a single field that clearly denotes that the sub-fields in that object map directly to pod spec fields.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 26, 2025
@marioferh marioferh force-pushed the alertmanager_monitoring_api branch from c35227f to 75fbb23 Compare April 29, 2025 15:21
Copy link
Contributor

openshift-ci bot commented Apr 29, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: marioferh
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@marioferh marioferh force-pushed the alertmanager_monitoring_api branch 3 times, most recently from 450dd7b to a6d7bc9 Compare April 30, 2025 08:11
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 30, 2025
Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only got about halfway through the changes on this round, will circle back soon to review the rest.

Comment on lines 127 to 155
// userMode controls whether Alertmanager should process configurations from user-defined (non-platform)
// namespaces for AlertmanagerConfig lookups.
// When set to true, Alertmanager will search for AlertmanagerConfig resources in user-defined namespaces.
// This field is only effective when the user workload Alertmanager instance is not enabled.
// If the user workload monitoring Alertmanager is enabled, this field is ignored.
// Required: This field must be specified.
// +kubebuilder:validation:Enum="";Enabled;Disabled
// +required
UserMode UserAlertManagerMode `json:"userMode"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things:

  • The godoc seems a bit outdated here, referencing "true" as value that can be set
  • Is UserMode the most appropriate name here? Maybe this would make more sense as something like ConfigurationPolicy with options like PlatformDefined and UserDefined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/monitoring/about-openshift-container-platform-monitoring#default-monitoring-targets_monitoring-stack-architecture

I think it's important to highlight how the monitoring stack works. We have the default stack and the user-defined mode, and with these two fields, I believe it becomes clearer than with a ConfigurationPolicy.
Also because this: // This field is only effective when the user workload Alertmanager instance is not enabled.

// This field is optional.
// More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
// +optional
Resources *v1.ResourceRequirements `json:"resources,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we don't encourage the use of the ResourceRequirements type and instead recommend using a list of a custom type that gets translated to the ResourceRequirements you set on a Pod

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall this conversation recently related to ResourceLists and because ResourceRequirements at least includes fields of type ResourceList the same principle applies: #2222 (comment)

@JoelSpeed Would probably have more explicit knowledge as to the limitations of using these types, but my understanding is that they don't align with our API conventions any more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could work?
ee3b854

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something along those lines may be reasonable if you need explicit fields for each resource type. For example, that change would result in a UX of writing YAML like:

...
resources:
  cpu:
    requests: ...
    limits: ...
  memory:
    requests: ...
    limits: ...
  hugepages:
    requests: ...
    limits: ...
    size: ...
...

An alternative would be to do something like:

...
resources:
  - name: cpu
    requests: ...
    limits: ...
  - name: memory
    requests: ...
    limits: ...
  - name: hugepages
    requests: ...
    limits: ...
...

Based on my understanding of the comment I linked to, I think the alternative I've shared is the preferred approach for something like this because it can more easily grow as your needs to support other resource types do.

I'll leave @JoelSpeed to explain any further and dictate what approach is actually preferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case here? Are we just passing the data through to the pod template itself?

Are there any restrictions that we would be applying, eg, are there certain types of resources that we wouldn't want to support?

Does each type of resource support the same options, for example, in the comment above, it looks as though maybe hugepages has other options that CPU and memory do not support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the CRD passes these values through to the pod template.

AFAIK this is the info we have: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

    spec.containers[].resources.limits.cpu
    spec.containers[].resources.limits.memory
    spec.containers[].resources.limits.hugepages-<size>
    spec.containers[].resources.requests.cpu
    spec.containers[].resources.requests.memory
    spec.containers[].resources.requests.hugepages-<size>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we aren't interpreting the content of the field, and simply copying it through, it can be ok to re-use the types from upstream.

However, in this case, there are alpha fields in the struct that we would want to not expose, so creating a mirror that meets the parts of the API that we do want to expose makes sense. It also means we can exclude other resources that we do not support users configuring

@marioferh marioferh force-pushed the alertmanager_monitoring_api branch 2 times, most recently from e062ce8 to ee3b854 Compare May 6, 2025 10:38
// +optional
Memory *ResourceSpec `json:"memory,omitempty"`

// hugepages is a list of hugepage resource specifications by page size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why might a user care to set these? What happens if they don't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same in other comments:
If it's not set, containers have no resource limits, which can be harmful to the system. Users configuring containers in OpenShift should be aware of this.

// HugePageResource describes hugepages resources by page size (e.g. 2Mi, 1Gi).
type HugePageResource struct {
// size of the hugepage (e.g. "2Mi", "1Gi").
// +kubebuilder:validation:Pattern=`^[0-9]+(Ki|Mi|Gi)$`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally try to avoid using the kubebuilder:validation:Pattern marker now in favor of writing a CEL expression that performs the regular expression evaluation. Using CEL expressions allows us to provide a more human readable error message than returned when using the pattern marker.

An example:

// +kubebuilder:validation:XValidation:rule="self.matches('^arn:aws:kms:[a-z0-9-]+:[0-9]{12}:key/[a-f0-9-]+$')",message="keyARN must follow the format `arn:aws:kms:<region>:<account_id>:key/<key_id>`. The account ID must be a 12 digit number and the region and key ID should consist only of lowercase hexadecimal characters and hyphens (-)."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, but I not sure if it is needed a validation pattern, remove it.

@marioferh marioferh force-pushed the alertmanager_monitoring_api branch 2 times, most recently from b324126 to d9fba48 Compare May 7, 2025 11:34
Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of comments. Additionally, I would like to see tests added to ensure the API and validations you have are working as expected.

Comment on lines 78 to 79
// userDefined is optional.
// +optional`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this moved to optional?

Also, minor typo:

Suggested change
// userDefined is optional.
// +optional`
// userDefined is optional.
// +optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By making this optional, you now have no required fields in the spec.

Are we ok with folks creating an object like

spec: {}

What does that object mean? Should there be a required field or a minProperties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that with the latest changes, the user-defined field had to be optional. From the configmap // The default value is false. — so I have to make it optional, because was enable, disabled.

	// A Boolean flag that enables or disables user-defined namespaces
	// to be selected for `AlertmanagerConfig` lookups. This setting only
	// applies if the user workload monitoring instance of Alertmanager
	// is not enabled.
	// The default value is `false`.
	EnableUserAlertManagerConfig bool `json:"enableUserAlertmanagerConfig,omitempty"`

Comment on lines +131 to +134
// deployed contains configuration options for the deployed Alertmanager instance.
// +optional
Deployed *AlertmanagerDeployedConfig `json:"deployed,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For discriminated unions, this field must be set when the discriminator is set to Deployed and unset otherwise. We have a pretty standard CEL expression we use for this:

// +kubebuilder:validation:XValidation:rule="has(self.type) && self.type == 'Filters' ? has(self.filters) : !has(self.filters)",message="filters is required when type is Filters, and forbidden otherwise"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is correct now?


// AlertmanagerContainerResources defines simplified resource requirements for a container.
type AlertmanagerContainerResources struct {
// cpu defines the CPU resource limits and requests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because not setting this could be harmful to the system, are there any defaults that we set on a users behalf?

Comment on lines 309 to 310
// The list is treated as a map, using `size` as the key, which simplifies updates and replacements
// of individual entries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an end-user, I believe this means that entries must be unique. I'm not sure an end-user cares about whether or not this simplifies updates and replacements of individual entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 78 to 79
// userDefined is optional.
// +optional`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By making this optional, you now have no required fields in the spec.

Are we ok with folks creating an object like

spec: {}

What does that object mean? Should there be a required field or a minProperties?

// should be deployed in the `openshift-monitoring` namespace.
// alertmanagerMainConfig is optional.
// +optional`
AlertmanagerMainConfig AlertmanagerConfig `json:"alertmanagerMainConfig"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Main?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explained before, removed.

// +unionDiscriminator
// +kubebuilder:validation:Enum=Deployed;NotDeployed
// +kubebuilder:validation:Required
DeploymentMode string `json:"deploymentMode"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent struct is optional, so what does it mean when the parent is omitted?

The parent also does not have omitempty, nor is it a pointer. Which means it is discoverable (++ for config API), however, this field being required, is going to cause issues.

If I asked you to allow "" as a valid value for the enum, what would that mean to the controller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think AlertmanagerConfig should be required?

//
// When omitted, this means the user has no opinion and the platform is left
// to choose reasonable defaults. These defaults are subject to change over time.
// The current default is `- operator: "Exists"` which means that all taints are tolerated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that safe? Not even an API question, but, tolerating all taints is generally not something we would do for control plane components. There are many valid taints (uninitialized for the CCM, network not ready) that I would expect this pod not to tolerate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me think about it

// SecretName is a type that represents the name of a Secret in the same namespace.
// It must be at most 256 characters in length.
// +kubebuilder:validation:XValidation:rule="!format.dns1123Subdomain().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."
// +kubebuilder:validation:MaxLength=256
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 253

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// UserAlertManagerMode defines mode for user-defines namespaced
//
// Possible values:
// - "Selectable": User-defined namespaces can be selected for AlertmanagerConfig lookups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, where is the selector that the user would configure to determine which namespaces to use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the selector is use alertmanagerMain or user defined config

// +kubebuilder:validation:MaxLength=24
// This filed is optional
// +optional
Request string `json:"request,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a quantity type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@marioferh
Copy link
Contributor Author

Continue tomorrow with last comments

@marioferh marioferh force-pushed the alertmanager_monitoring_api branch 2 times, most recently from ea249c1 to d60f672 Compare May 14, 2025 17:20
marioferh added 4 commits May 15, 2025 08:50
Signed-off-by: Mario Fernandez <[email protected]>
Signed-off-by: Mario Fernandez <[email protected]>
Signed-off-by: Mario Fernandez <[email protected]>
marioferh and others added 8 commits May 15, 2025 08:50
@marioferh marioferh force-pushed the alertmanager_monitoring_api branch from d60f672 to b758adf Compare May 15, 2025 08:22
Copy link
Contributor

openshift-ci bot commented May 15, 2025

@marioferh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-images c35227f link true /test okd-scos-images
ci/prow/e2e-upgrade 2a869cb link true /test e2e-upgrade
ci/prow/e2e-azure 2a869cb link false /test e2e-azure
ci/prow/verify-crd-schema 2a869cb link true /test verify-crd-schema
ci/prow/e2e-aws-serial-2of2 2a869cb link true /test e2e-aws-serial-2of2
ci/prow/e2e-upgrade-out-of-change 2a869cb link true /test e2e-upgrade-out-of-change

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants