Skip to content

Grouping alarms which might be caused due to the same reason #361

@rhl-bthr

Description

@rhl-bthr

Is your feature request related to a problem? Please describe.
While debugging the alarms in FoundationDB's operator, I noticed that 266 out of 270 were raised because Acto was adding a new key in the spec/processes named ACTOKEY, and a reasonably well-thought object as it's value. However, in all the 266 alarms, Acto was modifying the values within the object itself, but the tests were failing since the operator did not accept the name ACTOKEY, and only accepted a set of predefined values.

Acto can attempt to group the failing test cases to give a better insight into the root cause of the problem.

Describe the solution you'd like
A way to do this is to look at all the modifications made as a large decision tree, the leaves of which indicate if the value of ALARM is true or false. Then, Acto can do a breadth first search on all the nodes of the tree and check if all the leaves of a node result as true. If it does, then all the test cases of that node. It should be a reasonable assumption that the alarm is likely caused by the same reason, which is the value modified at that node of the decision tree.

Describe alternatives you've considered
Ofcourse the right place for this would be for the operator to either,

  1. Not accept an arbitrary string in spec/processes and already define all the keys it can accept, and then mark whichever it's not using as NULL or,
  2. Throw an error! Currently no erorr was displayed in any of the logs, which made it slightly hard to debug.

But hoping for operators to do the right thing defeats the purpose of the project ;)

Additional context
xlab-uiuc/kube-523#135

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions