-
Notifications
You must be signed in to change notification settings - Fork 2
WIP: state based alerting #392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for flanksource-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
✅ Deploy Preview for canarychecker canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces updates to various configuration and prompt files for Flanksource styles and Mission Control, as well as new Kubernetes resource manifests and component enhancements. Key changes include the removal and adjustment of word mappings in YAML config files, additions and corrections to developer prompt documents, and new resource definitions for state‐based alerting.
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 11 comments.
Show a summary per file
File | Description |
---|---|
styles/Flanksource/ComplexWords.yml | Removed the "component: part" mapping to update word swap rules. |
styles/Flanksource/CaseSensitiveSpellingSuggestions.yml | Added a new replacement rule for the "[Nn]ginx" pattern. |
styles/Flanksource/Acronyms.yml | Updated the acronym exceptions list; duplicates and ordering were adjusted. |
prompts/style.md | Added writing style guidelines with minor spelling corrections. |
prompts/blog.md | Added a detailed blog prompt with multiple content and spelling updates. |
mission-control/blog/state-based-alerting/nginx.yaml | Introduced a new manifest for an NGINX pod, with a suspicious image tag. |
mission-control/blog/state-based-alerting/canary.yaml | Introduced a new canary check resource manifest. |
common/src/components/TerminalOutput.jsx | Updated the terminal output formatting logic for improved error handling. |
Comments suppressed due to low confidence (1)
styles/Flanksource/ComplexWords.yml:48
- [nitpick] Confirm that the removal of the 'component: part' mapping is intentional, since this may affect how the term 'component' is translated across the application.
- component: part
"NGINX [Ii]ngress controller": NGINX Ingress Controller | ||
"(?<!Ingress-)(?:Nginx|NGINX)(?! Ingress)": nginx | ||
"(?<!Ingress-)(?:Nginx|NGINX)'s(?! Ingress)": nginx's | ||
"[Nn]ginx": "nginx" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Verify that the new mapping for '[Nn]ginx' is consistent with existing rules and desired capitalization outcomes.
"[Nn]ginx": "nginx" | |
"[Nn]ginx": "Nginx" |
Copilot uses AI. Check for mistakes.
- ECSCluster | ||
- ECSService | ||
- ECSTask | ||
- ECSTask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The acronym 'ECSTask' appears more than once; please remove duplicate entries to keep the list concise.
- ECSTask |
Copilot uses AI. Check for mistakes.
- EKSCluster | ||
- FAQ | ||
- GCC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The acronym 'EKSCluster' is duplicated; consider removing one instance to prevent redundancy.
- EKSCluster | |
- FAQ | |
- GCC | |
- FAQ | |
- GCC | |
- GCC |
Copilot uses AI. Check for mistakes.
|
||
1. Avoid adverbs and complex language | ||
|
||
## Formating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the heading from 'Formating' to 'Formatting'.
## Formating | |
## Formatting |
Copilot uses AI. Check for mistakes.
1. Avoid adverbs and complex language | ||
|
||
## Formating | ||
- Format all output using MDX (markdowon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the typo 'markdowon' to 'markdown'.
- Format all output using MDX (markdowon) | |
- Format all output using MDX (markdown) |
Copilot uses AI. Check for mistakes.
|
||
Write a blog post on Flanksource MIssion Control approach to AIops, primarily building a real-time and update to mirror state of cloud resources that can be queried rapidly, plus an advanced graph that builds relationships between resources e.g. Cloudformation -> Auto Scaling Group > EC2 instance and then layers on soft relationsyhips like ebs volumes, subnets, IAM poilcies - For Kubernetes it understands Flux and Gitops being able to build a graph of a Flux Kustomization object creating a HelmRelease CRD, which then creates a deployment -> replicset -> pod and then layeying relationships like services, nodes, PVC, ingress, etc.. | ||
|
||
State based alerting (i.e. whene resource self-report failure) and traditioanl alerts from APM tools trigger playbooks that can then proactively collect infomation in a distrubuted fashion from agents deployed closest to the data, the graph, changes to the graph resources, events and pro-acrtive playboks are then fed into the model which tan the recommend futher playbooks to execute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple typos in this line: 'whene' should be 'when', 'traditioanl' should be 'traditional', 'infomation' should be 'information', 'distrubuted' should be 'distributed', 'pro-acrtive' should be 'proactive', 'playboks' should be 'playbooks', 'tan' should be 'to', and 'futher' should be 'further'.
State based alerting (i.e. whene resource self-report failure) and traditioanl alerts from APM tools trigger playbooks that can then proactively collect infomation in a distrubuted fashion from agents deployed closest to the data, the graph, changes to the graph resources, events and pro-acrtive playboks are then fed into the model which tan the recommend futher playbooks to execute. | |
State-based alerting (i.e. when resources self-report failure) and traditional alerts from APM tools trigger playbooks that can then proactively collect information in a distributed fashion from agents deployed closest to the data. The graph, changes to the graph resources, events, and proactive playbooks are then fed into the model, which can then recommend further playbooks to execute. |
Copilot uses AI. Check for mistakes.
|
||
State based alerting (i.e. whene resource self-report failure) and traditioanl alerts from APM tools trigger playbooks that can then proactively collect infomation in a distrubuted fashion from agents deployed closest to the data, the graph, changes to the graph resources, events and pro-acrtive playboks are then fed into the model which tan the recommend futher playbooks to execute. | ||
|
||
This is advantage as acess to systems is pushed down to agents who can use secrets like pod identity and service accounts to collect duta, new agent actions are use to create with YAML based playbooks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct 'acess' to 'access' and 'duta' to 'data' to improve clarity.
This is advantage as acess to systems is pushed down to agents who can use secrets like pod identity and service accounts to collect duta, new agent actions are use to create with YAML based playbooks. | |
This is an advantage as access to systems is pushed down to agents who can use secrets like pod identity and service accounts to collect data, new agent actions are used to create YAML-based playbooks. |
Copilot uses AI. Check for mistakes.
|
||
Write a blog post on the benefits of GitOps and the challenges of adoption - especially with mixed maturity teams (some prefer working in git and others like clickops) - Highlight the mission control approach to gitops (tracking resources and building a graph on how they map to git repository and sources), which enables "editing" kubernetes objects with the changes being submitted back to git. The benefits include a | ||
|
||
contrasting metrics vs state driven alerting, store with concepts such as RED and USE and how they are more apropriate for monitoring transactions and staady state workloads and fall short for more platform engineering tasks such as monitoring the rollout of a new application or checking if a cluster is healthy after an upgrade. and then use examples of Prometheus and the canary-checker kubernetes check (https://flanksource.com/docs/guide/canary-checker/reference/kubernetes) which used the underlying https://github.com/flanksource/is-healthy library |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct 'apropriate' to 'appropriate' and 'staady' to 'steady'.
contrasting metrics vs state driven alerting, store with concepts such as RED and USE and how they are more apropriate for monitoring transactions and staady state workloads and fall short for more platform engineering tasks such as monitoring the rollout of a new application or checking if a cluster is healthy after an upgrade. and then use examples of Prometheus and the canary-checker kubernetes check (https://flanksource.com/docs/guide/canary-checker/reference/kubernetes) which used the underlying https://github.com/flanksource/is-healthy library | |
contrasting metrics vs state driven alerting, store with concepts such as RED and USE and how they are more appropriate for monitoring transactions and steady state workloads and fall short for more platform engineering tasks such as monitoring the rollout of a new application or checking if a cluster is healthy after an upgrade. and then use examples of Prometheus and the canary-checker kubernetes check (https://flanksource.com/docs/guide/canary-checker/reference/kubernetes) which used the underlying https://github.com/flanksource/is-healthy library |
Copilot uses AI. Check for mistakes.
* Background - Describe the context and initial challenges. | ||
* Step by step guide | ||
* Common Pitfalls - Highlight common mistakes and how to avoid them and add use-cases that are not a good fiit | ||
* Conclustion - Offer final thoughts and potential future implications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct 'Conclustion' to 'Conclusion'.
* Conclustion - Offer final thoughts and potential future implications. | |
* Conclusion - Offer final thoughts and potential future implications. |
Copilot uses AI. Check for mistakes.
name: test-pod | ||
spec: | ||
containers: | ||
- image: nginx:invalid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The image tag 'nginx:invalid' appears to be a mistake that may cause runtime failures; please verify and update it with a valid tag.
- image: nginx:invalid | |
- image: nginx:latest |
Copilot uses AI. Check for mistakes.
-[36m ip[0m:[32m 10.0.5.78[0m | ||
[32m [0m[36mqosClass[0m:[32m BestEffort[0m | ||
[32m [0m[36mstartTime[0m:[32m "2025-03-26T10:17:16Z"[0m | ||
</TerminalOutput> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This did not render correctly
|
||
While standards exist for exposing metrics, there's no equivalent standard for exposing the thresholds or conditions that trigger alerts. This leads to fragmentation and complexity in monitoring setups. | ||
|
||
[is-healthy](https://github.com/flanksource/is-healthy) is a tool designed to assess and report the health status of Kubernetes and other cloud resources (such as AWS) without the limitations of metric-based approaches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does mentioning is-healthy add too much value here ?
Unless, we want to promote its usage to people as well (in which case we should probably add more things about it, eg. support for flux,argo,crossplane etc. ?
memory: "10Mi" | ||
ports: | ||
- containerPort: 80 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a screenshot here to show how canary checker tells the exact error along with a time based graph
2. `Progressing` is `False` because the rollout timed out (`ProgressDeadlineExceeded`). The message provides specific details about the failure, potentially including reasons like OOM killing if the system surfaces that information here. | ||
|
||
Mission Control captures this state and provides an alert with the error message from the `Progressing` condition (e.g., "ReplicaSet ... has timed out progressing..."). This points more directly to the root cause or the symptom reported by Kubernetes. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention somewhere that mission-control supports both ? Maybe in the conclusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're comparing with prometheus rules, but we also write about how it is to best use both in the end
|
||
State changes can trigger multiple alerts. To avoid this: | ||
|
||
- Group related states into single alerts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some lines / screenshot about how we allow notification grouping of related items, (pod, deployment eg.)
This is example isn't really thay useful, as it needs to be run continously, [canary-checker](https://canarychecker.io/) is a kubernetes health-check platform with support for 30+ check types, The [`kubernetes`](https://canarychecker.io/reference/kubernetes) check uses the `is-healthy` library: | ||
|
||
```yaml title=kubernetes.yaml file=./canary.yaml | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file didn't render
State-based alerting excels when: | ||
- Resources self-report meaningful status | ||
- Problems have descriptive error messages | ||
- You need context for troubleshooting | ||
|
||
It's less effective when: | ||
- Resources don't update status fields | ||
- You need to alert on trends over time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can expand on this
No description provided.