Skip to content

WIP: state based alerting #392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

WIP: state based alerting #392

wants to merge 1 commit into from

Conversation

moshloop
Copy link
Member

@moshloop moshloop commented May 6, 2025

No description provided.

@moshloop moshloop requested review from yashmehrotra and Copilot May 6, 2025 07:49
Copy link

netlify bot commented May 6, 2025

Deploy Preview for flanksource-docs ready!

Name Link
🔨 Latest commit 69aa9de
🔍 Latest deploy log https://app.netlify.com/sites/flanksource-docs/deploys/6819beff459a8000083bc1ee
😎 Deploy Preview https://deploy-preview-392--flanksource-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

netlify bot commented May 6, 2025

Deploy Preview for canarychecker canceled.

Name Link
🔨 Latest commit 69aa9de
🔍 Latest deploy log https://app.netlify.com/sites/canarychecker/deploys/6819beffaf0761000841fdc2

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces updates to various configuration and prompt files for Flanksource styles and Mission Control, as well as new Kubernetes resource manifests and component enhancements. Key changes include the removal and adjustment of word mappings in YAML config files, additions and corrections to developer prompt documents, and new resource definitions for state‐based alerting.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
styles/Flanksource/ComplexWords.yml Removed the "component: part" mapping to update word swap rules.
styles/Flanksource/CaseSensitiveSpellingSuggestions.yml Added a new replacement rule for the "[Nn]ginx" pattern.
styles/Flanksource/Acronyms.yml Updated the acronym exceptions list; duplicates and ordering were adjusted.
prompts/style.md Added writing style guidelines with minor spelling corrections.
prompts/blog.md Added a detailed blog prompt with multiple content and spelling updates.
mission-control/blog/state-based-alerting/nginx.yaml Introduced a new manifest for an NGINX pod, with a suspicious image tag.
mission-control/blog/state-based-alerting/canary.yaml Introduced a new canary check resource manifest.
common/src/components/TerminalOutput.jsx Updated the terminal output formatting logic for improved error handling.
Comments suppressed due to low confidence (1)

styles/Flanksource/ComplexWords.yml:48

  • [nitpick] Confirm that the removal of the 'component: part' mapping is intentional, since this may affect how the term 'component' is translated across the application.
  -  component: part

"NGINX [Ii]ngress controller": NGINX Ingress Controller
"(?<!Ingress-)(?:Nginx|NGINX)(?! Ingress)": nginx
"(?<!Ingress-)(?:Nginx|NGINX)'s(?! Ingress)": nginx's
"[Nn]ginx": "nginx"
Copy link
Preview

Copilot AI May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Verify that the new mapping for '[Nn]ginx' is consistent with existing rules and desired capitalization outcomes.

Suggested change
"[Nn]ginx": "nginx"
"[Nn]ginx": "Nginx"

Copilot uses AI. Check for mistakes.

- ECSCluster
- ECSService
- ECSTask
- ECSTask
Copy link
Preview

Copilot AI May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The acronym 'ECSTask' appears more than once; please remove duplicate entries to keep the list concise.

Suggested change
- ECSTask

Copilot uses AI. Check for mistakes.

Comment on lines +39 to 41
- EKSCluster
- FAQ
- GCC
Copy link
Preview

Copilot AI May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The acronym 'EKSCluster' is duplicated; consider removing one instance to prevent redundancy.

Suggested change
- EKSCluster
- FAQ
- GCC
- FAQ
- GCC
- GCC

Copilot uses AI. Check for mistakes.


1. Avoid adverbs and complex language

## Formating
Copy link
Preview

Copilot AI May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct the heading from 'Formating' to 'Formatting'.

Suggested change
## Formating
## Formatting

Copilot uses AI. Check for mistakes.

1. Avoid adverbs and complex language

## Formating
- Format all output using MDX (markdowon)
Copy link
Preview

Copilot AI May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct the typo 'markdowon' to 'markdown'.

Suggested change
- Format all output using MDX (markdowon)
- Format all output using MDX (markdown)

Copilot uses AI. Check for mistakes.


Write a blog post on Flanksource MIssion Control approach to AIops, primarily building a real-time and update to mirror state of cloud resources that can be queried rapidly, plus an advanced graph that builds relationships between resources e.g. Cloudformation -> Auto Scaling Group > EC2 instance and then layers on soft relationsyhips like ebs volumes, subnets, IAM poilcies - For Kubernetes it understands Flux and Gitops being able to build a graph of a Flux Kustomization object creating a HelmRelease CRD, which then creates a deployment -> replicset -> pod and then layeying relationships like services, nodes, PVC, ingress, etc..

State based alerting (i.e. whene resource self-report failure) and traditioanl alerts from APM tools trigger playbooks that can then proactively collect infomation in a distrubuted fashion from agents deployed closest to the data, the graph, changes to the graph resources, events and pro-acrtive playboks are then fed into the model which tan the recommend futher playbooks to execute.
Copy link
Preview

Copilot AI May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple typos in this line: 'whene' should be 'when', 'traditioanl' should be 'traditional', 'infomation' should be 'information', 'distrubuted' should be 'distributed', 'pro-acrtive' should be 'proactive', 'playboks' should be 'playbooks', 'tan' should be 'to', and 'futher' should be 'further'.

Suggested change
State based alerting (i.e. whene resource self-report failure) and traditioanl alerts from APM tools trigger playbooks that can then proactively collect infomation in a distrubuted fashion from agents deployed closest to the data, the graph, changes to the graph resources, events and pro-acrtive playboks are then fed into the model which tan the recommend futher playbooks to execute.
State-based alerting (i.e. when resources self-report failure) and traditional alerts from APM tools trigger playbooks that can then proactively collect information in a distributed fashion from agents deployed closest to the data. The graph, changes to the graph resources, events, and proactive playbooks are then fed into the model, which can then recommend further playbooks to execute.

Copilot uses AI. Check for mistakes.


State based alerting (i.e. whene resource self-report failure) and traditioanl alerts from APM tools trigger playbooks that can then proactively collect infomation in a distrubuted fashion from agents deployed closest to the data, the graph, changes to the graph resources, events and pro-acrtive playboks are then fed into the model which tan the recommend futher playbooks to execute.

This is advantage as acess to systems is pushed down to agents who can use secrets like pod identity and service accounts to collect duta, new agent actions are use to create with YAML based playbooks.
Copy link
Preview

Copilot AI May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct 'acess' to 'access' and 'duta' to 'data' to improve clarity.

Suggested change
This is advantage as acess to systems is pushed down to agents who can use secrets like pod identity and service accounts to collect duta, new agent actions are use to create with YAML based playbooks.
This is an advantage as access to systems is pushed down to agents who can use secrets like pod identity and service accounts to collect data, new agent actions are used to create YAML-based playbooks.

Copilot uses AI. Check for mistakes.


Write a blog post on the benefits of GitOps and the challenges of adoption - especially with mixed maturity teams (some prefer working in git and others like clickops) - Highlight the mission control approach to gitops (tracking resources and building a graph on how they map to git repository and sources), which enables "editing" kubernetes objects with the changes being submitted back to git. The benefits include a

contrasting metrics vs state driven alerting, store with concepts such as RED and USE and how they are more apropriate for monitoring transactions and staady state workloads and fall short for more platform engineering tasks such as monitoring the rollout of a new application or checking if a cluster is healthy after an upgrade. and then use examples of Prometheus and the canary-checker kubernetes check (https://flanksource.com/docs/guide/canary-checker/reference/kubernetes) which used the underlying https://github.com/flanksource/is-healthy library
Copy link
Preview

Copilot AI May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct 'apropriate' to 'appropriate' and 'staady' to 'steady'.

Suggested change
contrasting metrics vs state driven alerting, store with concepts such as RED and USE and how they are more apropriate for monitoring transactions and staady state workloads and fall short for more platform engineering tasks such as monitoring the rollout of a new application or checking if a cluster is healthy after an upgrade. and then use examples of Prometheus and the canary-checker kubernetes check (https://flanksource.com/docs/guide/canary-checker/reference/kubernetes) which used the underlying https://github.com/flanksource/is-healthy library
contrasting metrics vs state driven alerting, store with concepts such as RED and USE and how they are more appropriate for monitoring transactions and steady state workloads and fall short for more platform engineering tasks such as monitoring the rollout of a new application or checking if a cluster is healthy after an upgrade. and then use examples of Prometheus and the canary-checker kubernetes check (https://flanksource.com/docs/guide/canary-checker/reference/kubernetes) which used the underlying https://github.com/flanksource/is-healthy library

Copilot uses AI. Check for mistakes.

* Background - Describe the context and initial challenges.
* Step by step guide
* Common Pitfalls - Highlight common mistakes and how to avoid them and add use-cases that are not a good fiit
* Conclustion - Offer final thoughts and potential future implications.
Copy link
Preview

Copilot AI May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct 'Conclustion' to 'Conclusion'.

Suggested change
* Conclustion - Offer final thoughts and potential future implications.
* Conclusion - Offer final thoughts and potential future implications.

Copilot uses AI. Check for mistakes.

name: test-pod
spec:
containers:
- image: nginx:invalid
Copy link
Preview

Copilot AI May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image tag 'nginx:invalid' appears to be a mistake that may cause runtime failures; please verify and update it with a valid tag.

Suggested change
- image: nginx:invalid
- image: nginx:latest

Copilot uses AI. Check for mistakes.

- ip: 10.0.5.78
 qosClass: BestEffort
 startTime: "2025-03-26T10:17:16Z"
</TerminalOutput>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This did not render correctly


While standards exist for exposing metrics, there's no equivalent standard for exposing the thresholds or conditions that trigger alerts. This leads to fragmentation and complexity in monitoring setups.

[is-healthy](https://github.com/flanksource/is-healthy) is a tool designed to assess and report the health status of Kubernetes and other cloud resources (such as AWS) without the limitations of metric-based approaches.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does mentioning is-healthy add too much value here ?

Unless, we want to promote its usage to people as well (in which case we should probably add more things about it, eg. support for flux,argo,crossplane etc. ?

memory: "10Mi"
ports:
- containerPort: 80
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a screenshot here to show how canary checker tells the exact error along with a time based graph

2. `Progressing` is `False` because the rollout timed out (`ProgressDeadlineExceeded`). The message provides specific details about the failure, potentially including reasons like OOM killing if the system surfaces that information here.

Mission Control captures this state and provides an alert with the error message from the `Progressing` condition (e.g., "ReplicaSet ... has timed out progressing..."). This points more directly to the root cause or the symptom reported by Kubernetes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention somewhere that mission-control supports both ? Maybe in the conclusion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're comparing with prometheus rules, but we also write about how it is to best use both in the end


State changes can trigger multiple alerts. To avoid this:

- Group related states into single alerts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some lines / screenshot about how we allow notification grouping of related items, (pod, deployment eg.)

This is example isn't really thay useful, as it needs to be run continously, [canary-checker](https://canarychecker.io/) is a kubernetes health-check platform with support for 30+ check types, The [`kubernetes`](https://canarychecker.io/reference/kubernetes) check uses the `is-healthy` library:

```yaml title=kubernetes.yaml file=./canary.yaml
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file didn't render

Comment on lines +466 to +473
State-based alerting excels when:
- Resources self-report meaningful status
- Problems have descriptive error messages
- You need context for troubleshooting

It's less effective when:
- Resources don't update status fields
- You need to alert on trends over time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can expand on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants