Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate existing Google Cloud alerts from click-ops to git-ops model #1624

Open
spiffxp opened this issue Feb 8, 2021 · 15 comments
Open

Migrate existing Google Cloud alerts from click-ops to git-ops model #1624

spiffxp opened this issue Feb 8, 2021 · 15 comments
Assignees
Labels
area/release-eng Issues or PRs related to the Release Engineering subproject help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/release Categorizes an issue or PR as relevant to SIG Release.

Comments

@spiffxp
Copy link
Member

spiffxp commented Feb 8, 2021

Discussed in k8s-infra meeting 2020-02-03

We have some slack alerting setup today, but it's been configured by humans clicking around on the Google Cloud website (aka "click-ops"). It would be ideal if we could drive that configuration automatically via files checked into git (aka "git-ops").

This is likely similar to or overlaps with making a gitops-driven workflow for Google Cloud Monitoring dashboards (#1376)

/wg k8s-infra
/sig release
/area release-eng
FYI @kubernetes/release-engineering since #k8s-infra-alerts contains container image promoter alerts
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added wg/k8s-infra sig/release Categorizes an issue or PR as relevant to SIG Release. area/release-eng Issues or PRs related to the Release Engineering subproject priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Feb 8, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Feb 8, 2021

/help

@k8s-ci-robot
Copy link
Contributor

@spiffxp:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Feb 8, 2021
@rikatz
Copy link
Contributor

rikatz commented Feb 17, 2021

I'm with low bandwidth now, but If we have some time (not urgent) I can take a look into this to see how to manage the alerts and dashboards with Gitops :)

@rikatz
Copy link
Contributor

rikatz commented Feb 17, 2021

/assign

@rikatz
Copy link
Contributor

rikatz commented Mar 23, 2021

So far:

My thoughts on this specific part: I really like the idea of using crossplane (k8s objects) to manage our cloud env, but I guess a lot of folks are familiar already with Terraform (although I agree with Justin, migration between versions sometimes is...annoying...)

Will create some simple .tf tomorrow with the same approach, trying to create notification channels and alert policies, and seeing how this reflects on stack driver.

@rikatz
Copy link
Contributor

rikatz commented Apr 6, 2021

@ameukam will work on this, using @thockin tests to monitor certificates renew and expiration as an example.

@rikatz
Copy link
Contributor

rikatz commented Apr 6, 2021

#1877 <- Created a PR with a really simple Terraform that adds an uptime check and the current alert policy.

We can improve this, like adding latency/uptime alerting (like for cs.k8s.io and others), etc.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 5, 2021
@ameukam
Copy link
Member

ameukam commented Jul 5, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 5, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Jul 16, 2021

A good first step would be understanding how to export whatever existing alerts we have as part of audit/audit-gcp.sh

@spiffxp
Copy link
Member Author

spiffxp commented Aug 17, 2021

@spiffxp
Copy link
Member Author

spiffxp commented Aug 17, 2021

/milestone v1.23
I think it would be really handy to use this at a bare minimum for uptime checks on the apps we run on aaa

@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Aug 17, 2021
@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed wg/k8s-infra labels Sep 29, 2021
@ameukam
Copy link
Member

ameukam commented Dec 14, 2021

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.23 milestone Dec 14, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2022
@ameukam
Copy link
Member

ameukam commented Mar 15, 2022

/remove-lifecycle stale
/lifecycle frozen
/milestone clear

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/release-eng Issues or PRs related to the Release Engineering subproject help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/release Categorizes an issue or PR as relevant to SIG Release.
Projects
Development

No branches or pull requests

6 participants