Skip to content

Commit 380f4ec

Browse files
authored
Merge pull request #26 from GDC-ConsumerEdge/on-failure
Add onFailure property to ignore failures
2 parents 61d8c17 + 7a6aa28 commit 380f4ec

File tree

10 files changed

+245
-32
lines changed

10 files changed

+245
-32
lines changed

README.md

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ spec:
4141
Cluster Health Validator allows customization for which platform and workload health checks are performed. This is specified as part of the ConfigMap as part of the deployment.
4242

4343
```
44+
---
4445
apiVersion: v1
4546
kind: ConfigMap
4647
metadata:
@@ -87,11 +88,40 @@ Below details the health check modules available as part of the solution, with s
8788
| CheckGoogleGroupRBAC | Checks that Google Group RBAC has been enabled | |
8889
| CheckRobinCluster | Checks RobinCluster Health | |
8990
| CheckRootSyncs | Checks that RootSyncs are synced and have completed reconciling | |
90-
| CheckVMRuntime | Checks that VMruntime is Ready, without any preflight failure | |
91+
| CheckVMRuntime | Checks that VMRuntime is Ready, without any preflight failure | |
9192
| CheckVirtualMachines | Checks that the expected # of VMs are in a Running State | **namespace**: namespace to run check against <br > **count**: (Optional) expected # of VMs |
9293
| CheckDataVolumes | Checks that the expected # of Data Volumes are 100% imported and ready | **namespace**: namespace to run check against <br > **count**: (Optional) expected # of DVs |
9394
| CheckHttpEndpoints | Checks that a list of HTTP endpoints are reachable and return a successful status code | **endpoints**: A list of HTTP endpoints to check. Each endpoint has the following parameters: <ul><li> **name**: The name of the endpoint </li><li> **url**: The URL of the endpoint </li><li> **timeout**: (Optional) The timeout in seconds for the request </li><li> **method**: (Optional) The HTTP method to use (e.g. 'GET', 'POST') </li></ul> |
9495

96+
### on_failure property
97+
98+
Each health check module supports an `on_failure` property that allows you to control the behavior of the health check when it fails. The `on_failure` property can be set to one of two values:
99+
100+
- `fail` (default): If the health check fails, the entire group of checks (platform or workload) will be considered failed.
101+
- `ignore`: If the health check fails, the failure will be logged and tracked in metrics, but it will not affect the overall health status of the group.
102+
103+
This is useful for non-critical health checks that you want to monitor but not have affect the overall health status.
104+
105+
Example:
106+
107+
```yaml
108+
platform_checks:
109+
- name: Node Health
110+
module: CheckNodes
111+
- name: Robin Cluster Health
112+
module: CheckRobinCluster
113+
on_failure: ignore
114+
115+
workload_checks:
116+
- name: VM Workloads Health
117+
module: CheckVirtualMachines
118+
parameters:
119+
namespace: vm-workloads
120+
on_failure: fail
121+
- name: VM Disk Health
122+
module: CheckVirtualMachineDisks
123+
on_failure: ignore
124+
```
95125
96126
## Building the image
97127

app/app.py

Lines changed: 31 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -83,53 +83,59 @@ def create_health_check_cr():
8383

8484

8585
def run_checks():
86+
app_config = read_config()
8687
global health_check_cr
8788
if not health_check_cr:
8889
health_check_cr = HealthCheck()
8990

90-
platform_checks = []
91-
workload_checks = []
92-
93-
app_config = read_config()
94-
95-
for check in app_config.platform_checks:
96-
if "parameters" in check:
97-
platform_checks.append(
98-
health_check_map[check["module"]](check["parameters"])
99-
)
91+
platform_checks_with_configs = []
92+
for check_config in app_config.platform_checks:
93+
check_class = health_check_map[check_config["module"]]
94+
if "parameters" in check_config:
95+
instance = check_class(check_config["parameters"])
10096
else:
101-
platform_checks.append(health_check_map[check["module"]]())
102-
103-
for check in app_config.workload_checks:
104-
if "parameters" in check:
105-
workload_checks.append(
106-
health_check_map[check["module"]](check["parameters"])
107-
)
97+
instance = check_class()
98+
platform_checks_with_configs.append((instance, check_config))
99+
100+
workload_checks_with_configs = []
101+
for check_config in app_config.workload_checks:
102+
check_class = health_check_map[check_config["module"]]
103+
if "parameters" in check_config:
104+
instance = check_class(check_config["parameters"])
108105
else:
109-
workload_checks.append(health_check_map[check["module"]]())
106+
instance = check_class()
107+
workload_checks_with_configs.append((instance, check_config))
110108

111109
with concurrent.futures.ThreadPoolExecutor(max_workers=_MAX_WORKERS) as executor:
112110
platform_checks_futures = {
113-
executor.submit(check.is_healthy): check.__class__.__name__
114-
for check in platform_checks
111+
executor.submit(check.is_healthy): config
112+
for check, config in platform_checks_with_configs
115113
}
116114
workload_checks_futures = {
117-
executor.submit(check.is_healthy): check.__class__.__name__
118-
for check in workload_checks
115+
executor.submit(check.is_healthy): config
116+
for check, config in workload_checks_with_configs
119117
}
120118

121119
def wait_on_futures(futures):
122120
checks_failed = []
123121
for future in concurrent.futures.as_completed(futures):
124-
name = futures[future]
122+
config = futures[future]
123+
name = config["name"]
124+
on_failure = config.get("on_failure", "fail")
125125
try:
126126
if not future.result():
127-
checks_failed.append(name)
127+
if on_failure == "fail":
128+
checks_failed.append(name)
129+
else:
130+
logging.info(f"Check '{name}' failed but is set to be ignored.")
128131
# Handling k8s resource not found here as it is not
129132
# handled in the individual checks.
130133
except ApiException as e:
131134
if e.status == 404:
132-
checks_failed.append(name)
135+
if on_failure == "fail":
136+
checks_failed.append(name)
137+
else:
138+
logging.info(f"Check '{name}' failed with 404 but is set to be ignored.")
133139
else:
134140
raise
135141
return checks_failed

app/config.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ class HealthCheck(TypedDict):
2626
name: str
2727
module: str
2828
parameters: NotRequired[dict] = {}
29+
on_failure: NotRequired[str] = "fail"
2930

3031

3132
class Config(BaseModel):

app/test_app.py

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
import unittest
2+
from unittest.mock import MagicMock, patch
3+
import sys
4+
5+
from prometheus_client import REGISTRY
6+
7+
class TestApp(unittest.TestCase):
8+
9+
def setUp(self):
10+
self.load_config_patcher = patch('kubernetes.config.load_config')
11+
self.mock_load_config = self.load_config_patcher.start()
12+
13+
self.apiextensions_v1_api_patcher = patch('kubernetes.client.ApiextensionsV1Api')
14+
self.mock_apiextensions_v1_api = self.apiextensions_v1_api_patcher.start()
15+
16+
self.custom_objects_api_patcher = patch('kubernetes.client.CustomObjectsApi')
17+
self.mock_custom_objects_api = self.custom_objects_api_patcher.start()
18+
19+
# import app after patch
20+
import app
21+
self.app = app
22+
23+
self.create_health_check_cr_patcher = patch('app.create_health_check_cr')
24+
self.mock_create_health_check_cr = self.create_health_check_cr_patcher.start()
25+
26+
def tearDown(self):
27+
self.load_config_patcher.stop()
28+
self.apiextensions_v1_api_patcher.stop()
29+
self.custom_objects_api_patcher.stop()
30+
self.create_health_check_cr_patcher.stop()
31+
# Unregister metrics to prevent duplicate metric error
32+
for metric in ['platform_health', 'workload_health']:
33+
if metric in REGISTRY._names_to_collectors:
34+
REGISTRY.unregister(REGISTRY._names_to_collectors[metric])
35+
36+
@patch('app.read_config')
37+
@patch('app.health_check_cr')
38+
@patch('app.health_check_map')
39+
def test_run_checks_onfailure_ignore(self, mock_health_check_map, mock_health_check_cr, mock_read_config):
40+
# Mock config
41+
from config import Config
42+
mock_config = Config(
43+
platform_checks=[
44+
{
45+
"name": "CheckNodes",
46+
"module": "CheckNodes",
47+
"on_failure": "ignore"
48+
},
49+
{
50+
"name": "CheckRobinCluster",
51+
"module": "CheckRobinCluster",
52+
"on_failure": "fail"
53+
}
54+
],
55+
workload_checks=[]
56+
)
57+
mock_read_config.return_value = mock_config
58+
59+
# Mock health check modules
60+
mock_check_nodes_class = MagicMock()
61+
mock_check_nodes_instance = mock_check_nodes_class.return_value
62+
mock_check_nodes_instance.is_healthy.return_value = False # Fails
63+
64+
mock_check_robin_cluster_class = MagicMock()
65+
mock_check_robin_cluster_instance = mock_check_robin_cluster_class.return_value
66+
mock_check_robin_cluster_instance.is_healthy.return_value = False # Fails
67+
68+
mock_health_check_map.__getitem__.side_effect = lambda key: {
69+
"CheckNodes": mock_check_nodes_class,
70+
"CheckRobinCluster": mock_check_robin_cluster_class
71+
}[key]
72+
73+
# Run the checks
74+
self.app.run_checks()
75+
76+
# Assertions
77+
mock_health_check_cr.update_status.assert_called_once_with(
78+
["CheckRobinCluster"], []
79+
)
80+
81+
@patch('app.read_config')
82+
@patch('app.health_check_cr')
83+
@patch('app.health_check_map')
84+
def test_run_checks_onfailure_fail(self, mock_health_check_map, mock_health_check_cr, mock_read_config):
85+
# Mock config
86+
from config import Config
87+
mock_config = Config(
88+
platform_checks=[
89+
{
90+
"name": "CheckNodes",
91+
"module": "CheckNodes",
92+
"on_failure": "fail"
93+
}
94+
],
95+
workload_checks=[]
96+
)
97+
mock_read_config.return_value = mock_config
98+
99+
# Mock health check modules
100+
mock_check_nodes_class = MagicMock()
101+
mock_check_nodes_instance = mock_check_nodes_class.return_value
102+
mock_check_nodes_instance.is_healthy.return_value = False # Fails
103+
104+
mock_health_check_map.__getitem__.side_effect = lambda key: {
105+
"CheckNodes": mock_check_nodes_class
106+
}[key]
107+
108+
# Run the checks
109+
self.app.run_checks()
110+
111+
# Assertions
112+
mock_health_check_cr.update_status.assert_called_once_with(
113+
["CheckNodes"], []
114+
)
115+
116+
@patch('app.read_config')
117+
@patch('app.health_check_cr')
118+
@patch('app.health_check_map')
119+
def test_run_checks_onfailure_default(self, mock_health_check_map, mock_health_check_cr, mock_read_config):
120+
# Mock config
121+
from config import Config
122+
mock_config = Config(
123+
platform_checks=[
124+
{
125+
"name": "CheckNodes",
126+
"module": "CheckNodes"
127+
}
128+
],
129+
workload_checks=[]
130+
)
131+
mock_read_config.return_value = mock_config
132+
133+
# Mock health check modules
134+
mock_check_nodes_class = MagicMock()
135+
mock_check_nodes_instance = mock_check_nodes_class.return_value
136+
mock_check_nodes_instance.is_healthy.return_value = False # Fails
137+
138+
mock_health_check_map.__getitem__.side_effect = lambda key: {
139+
"CheckNodes": mock_check_nodes_class
140+
}[key]
141+
142+
# Run the checks
143+
self.app.run_checks()
144+
145+
# Assertions
146+
mock_health_check_cr.update_status.assert_called_once_with(
147+
["CheckNodes"], []
148+
)
149+
150+
if __name__ == "__main__":
151+
unittest.main()

app/test_config.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,4 +32,13 @@ def test_optional_module_parameters(self):
3232
result = read_config()
3333
self.assertEqual(len(result.platform_checks), 4)
3434
self.assertEqual(len(result.workload_checks), 2)
35-
35+
36+
def test_on_failure_property(self):
37+
os.environ["APP_CONFIG_PATH"] = "testdata/onfailure_config.yaml"
38+
result = read_config()
39+
self.assertEqual(result.platform_checks[1]["on_failure"], "ignore")
40+
self.assertEqual(result.workload_checks[0]["on_failure"], "fail")
41+
self.assertEqual(result.workload_checks[1]["on_failure"], "ignore")
42+
# Check default value
43+
self.assertNotIn("on_failure", result.platform_checks[0])
44+

app/testdata/onfailure_config.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
platform_checks:
2+
- name: Node Health
3+
module: CheckNodes
4+
- name: Robin Cluster Health
5+
module: CheckRobinCluster
6+
on_failure: ignore
7+
8+
workload_checks:
9+
- name: VM Workloads Health
10+
module: CheckVirtualMachines
11+
parameters:
12+
namespace: vm-workloads
13+
on_failure: fail
14+
- name: VM Disk Health
15+
module: CheckVirtualMachineDisks
16+
on_failure: ignore

base/kustomization.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,4 @@ resources:
1111

1212
images:
1313
- name: ghcr.io/gdc-consumeredge/cluster-health-validator/cluster-health-validator
14-
newTag: "v1.2.0"
14+
newTag: "v1.2.1"

build.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,9 @@ function display_common() {
1818

1919
function build_container() {
2020
if [[ -f ".npmrc" ]]; then
21-
docker build -f "${DOCKERFILE}" -t "${APP}:${VERSION}" . --secret id=npmrc,src=./.npmrc
21+
docker build -f "${DOCKERFILE}" -t "${APP}:${VERSION}" .
2222
else
23-
docker build -f "${DOCKERFILE}" -t "${APP}:${VERSION}" . --secret id=npmrc,src=./.npmrc
23+
docker build -f "${DOCKERFILE}" -t "${APP}:${VERSION}" .
2424
fi
2525

2626
if [[ $? -ne 0 ]]; then

config/default/gdc-clusterdefault-generated.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,7 @@ spec:
217217
value: INFO
218218
- name: APP_CONFIG_PATH
219219
value: /config/config.yaml
220-
image: ghcr.io/gdc-consumeredge/cluster-health-validator/cluster-health-validator:v1.2.0
220+
image: ghcr.io/gdc-consumeredge/cluster-health-validator/cluster-health-validator:v1.2.1
221221
livenessProbe:
222222
httpGet:
223223
path: /health

package/gdc-cluster-health-pkg-default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,7 @@ spec:
217217
value: INFO
218218
- name: APP_CONFIG_PATH
219219
value: /config/config.yaml
220-
image: ghcr.io/gdc-consumeredge/cluster-health-validator/cluster-health-validator:v1.2.0
220+
image: ghcr.io/gdc-consumeredge/cluster-health-validator/cluster-health-validator:v1.2.1
221221
livenessProbe:
222222
httpGet:
223223
path: /health

0 commit comments

Comments
 (0)