Address cphc service_checker issue in sidecar container#25851
Address cphc service_checker issue in sidecar container#25851FengPan-Frank wants to merge 1 commit intosonic-net:masterfrom
Conversation
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR aims to address system-health / service_checker behavior when running in the telemetry sidecar (KubeSonic context) by syncing an updated service_checker.py from the sidecar container onto the host and adjusting container selection logic.
Changes:
- Add a KubeSonic container whitelist to bypass service/container health checks for specific containers.
- Ship
service_checker.pyinto the telemetry-sidecar image and sync it onto the host with branch-conditional selection. - Extend telemetry-sidecar
systemd_stub.pyand its CLI plugin tests to cover the new sync item.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/system-health/health_checker/service_checker.py | Adds a container whitelist and changes how running containers are discovered/filtered. |
| rules/scripts.mk | Exposes service_checker.py as a build-copyable script input. |
| rules/docker-telemetry-sidecar.mk | Includes service_checker.py in telemetry-sidecar copied files. |
| dockers/docker-telemetry-sidecar/systemd_stub.py | Syncs service_checker.py into host health_checker package; adds post-copy restart action; branch-conditional source selection. |
| dockers/docker-telemetry-sidecar/systemd_scripts/service_checker.py_202411 | Adds a branch-specific service_checker.py payload for syncing onto 202411 hosts. |
| dockers/docker-telemetry-sidecar/cli-plugin-tests/test_systemd_stub.py | Updates tests/fixtures for the new sync item and branch-conditional behavior. |
| dockers/docker-telemetry-sidecar/Dockerfile.j2 | Copies the staged service_checker.py into the image for syncing. |
Comments suppressed due to low confidence (1)
src/system-health/health_checker/service_checker.py:108
CONTAINER_K8S_WHITELISTis applied unconditionally to the FEATURE table, so containers liketelemetry/restapiwill never be considered "expected running" even on non-KubeSonic/non-K8s images. This also makes the existingif container_name == "telemetry"special-case unreachable. Consider gating this whitelist behind an explicit KubeSonic/K8s enablement check (or removingtelemetryfrom the unconditional set) so classic deployments don’t lose health coverage.
container_list = []
for container_name in feature_table.keys():
# Skip containers in the whitelist
if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:
logger.log_debug("Skipping whitelisted kubesonic managed container '{}' from expected running check".format(container_name))
continue
# skip frr_bmp since it's not container just bmp option used by bgpd
if container_name == "frr_bmp":
continue
# slim image does not have telemetry container and corresponding docker image
if container_name == "telemetry":
ret = check_docker_image("docker-sonic-telemetry")
| for ctr in lst: | ||
| # Check if this is a Kubernetes-managed container | ||
| labels = ctr.labels or {} | ||
| ns = labels.get("io.kubernetes.pod.namespace") | ||
| dtype = labels.get("io.kubernetes.docker.type") | ||
| kname = labels.get("io.kubernetes.container.name") | ||
|
|
||
| if ns == "sonic": | ||
| # Kubernetes-managed container - add service name to running containers | ||
| # but skip critical process checking (k8s has its own health mechanisms) | ||
| if dtype == "container" and kname and kname not in ("<no value>", "POD"): | ||
| running_containers.add(kname) | ||
|
|
||
| if ns: | ||
| continue | ||
|
|
||
| # Regular Docker container - use the container name | ||
| # Skip kubesonic managed containers in the whitelist | ||
| if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST: | ||
| continue | ||
| running_containers.add(ctr.name) |
There was a problem hiding this comment.
get_current_running_containers() now continues for any container with io.kubernetes.pod.namespace set, which drops the previously-supported behavior of mapping k8s-managed containers to their service name (e.g., io.kubernetes.container.name). This breaks existing unit tests in src/system-health/tests/test_system_health.py that assert k8s containers like snmp/database appear in running_containers, and can also cause false "not running" alerts if a FEATURE is enabled but the container is k8s-managed. Restore adding the k8s container name to running_containers (while still skipping critical process inspection), and then apply the whitelist only to the intended subset.
| ["sudo", "systemctl", "daemon-reload"], | ||
| ["sudo", "systemctl", "restart", "monit"], | ||
| ], | ||
| "/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py": [ |
There was a problem hiding this comment.
POST_COPY_ACTIONS hardcodes the service_checker destination path as a string even though HOST_SERVICE_CHECKER is defined below. Using HOST_SERVICE_CHECKER as the dict key avoids drift if the destination path ever changes (and keeps tests/logic aligned).
| "/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py": [ | |
| HOST_SERVICE_CHECKER: [ |
| # For 202412 and other branches, use the default container_checker | ||
| container_checker_src = "/usr/share/sonic/systemd_scripts/container_checker" | ||
| # For 202412 and other branches, use the default service_checker |
There was a problem hiding this comment.
The comment says "For 202412 and other branches" but _get_branch_name() can return values like master, internal, or private as well. Consider updating the comment to match the actual branching logic (i.e., "non-202411 branches").
| # For 202412 and other branches, use the default container_checker | |
| container_checker_src = "/usr/share/sonic/systemd_scripts/container_checker" | |
| # For 202412 and other branches, use the default service_checker | |
| # For non-202411 branches, use the default container_checker | |
| container_checker_src = "/usr/share/sonic/systemd_scripts/container_checker" | |
| # For non-202411 branches, use the default service_checker |
|
|
||
| def check_by_monit(self, config): | ||
| """ | ||
| et and analyze the output of $CHECK_CMD, collect status for file system or customize checker if any. |
There was a problem hiding this comment.
Docstring typo: "et and analyze" looks like it should be "Get and analyze" (and this string is user-facing documentation for the method).
| et and analyze the output of $CHECK_CMD, collect status for file system or customize checker if any. | |
| Get and analyze the output of $CHECK_CMD, collect status for file system or customize checker if any. |
65597f2 to
6c6d426
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
6c6d426 to
854b78c
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| @@ -0,0 +1,238 @@ | |||
|
|
|||
There was a problem hiding this comment.
The first line of this script is blank, so the shebang is not on line 1. If this file is copied to /bin/container_checker and executed by Monit as a program, it will fail to execute (exec format error). Remove the leading blank line so the shebang is the first line of the file.
| # Skip containers in the whitelist | ||
| if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST: |
There was a problem hiding this comment.
CONTAINER_K8S_WHITELIST includes "telemetry", so this early-continue makes the later telemetry-specific handling in the same loop unreachable and disables telemetry health checking even on non-KubeSonic (non-K8s) deployments. Consider applying this whitelist only when the container is actually Kubernetes-managed, or remove "telemetry" from the unconditional name-based whitelist.
| # Skip containers in the whitelist | |
| if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST: | |
| # Skip containers in the whitelist, except telemetry which has special handling below | |
| if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST and container_name != "telemetry": |
| self.reset() | ||
| self.check_by_monit(config) | ||
| self.check_services(config) | ||
| swsscommon.events_deinit_publisher(self.events_handle) |
There was a problem hiding this comment.
This module initializes an events publisher handle in init but deinitializes it at the end of check(). ServiceChecker instances are typically long-lived (created once by HealthCheckerManager and reused), so after the first check the handle will be deinitialized and subsequent publish_events() calls may fail. Align with the in-tree ServiceChecker pattern (init/deinit the publisher inside publish_events(), or keep the handle open for the object lifetime and deinit only on shutdown).
| swsscommon.events_deinit_publisher(self.events_handle) |
| self.reset() | ||
| self.check_by_monit(config) | ||
| self.check_services(config) | ||
| swsscommon.events_deinit_publisher(self.events_handle) |
There was a problem hiding this comment.
This module initializes an events publisher handle in init but deinitializes it at the end of check(). ServiceChecker instances are typically long-lived (created once by HealthCheckerManager and reused), so after the first check the handle will be deinitialized and subsequent publish_events() calls may fail. Align with the in-tree ServiceChecker pattern (init/deinit the publisher inside publish_events(), or keep the handle open for the object lifetime and deinit only on shutdown).
| swsscommon.events_deinit_publisher(self.events_handle) |
| self.reset() | ||
| self.check_by_monit(config) | ||
| self.check_services(config) | ||
| swsscommon.events_deinit_publisher(self.events_handle) |
There was a problem hiding this comment.
This module initializes an events publisher handle in init but deinitializes it at the end of check(). ServiceChecker instances are typically long-lived (created once by HealthCheckerManager and reused), so after the first check the handle will be deinitialized and subsequent publish_events() calls may fail. Align with the in-tree ServiceChecker pattern (init/deinit the publisher inside publish_events(), or keep the handle open for the object lifetime and deinit only on shutdown).
| swsscommon.events_deinit_publisher(self.events_handle) |
854b78c to
9f546b8
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
9f546b8 to
23a71cf
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.
Comments suppressed due to low confidence (1)
src/system-health/health_checker/service_checker.py:122
- The whitelist includes
telemetry, but the loop checks the whitelist before the existingif container_name == "telemetry"handling (slim-image / gnmi fallback). As written, the telemetry-specific logic is unreachable and telemetry will never be part ofexpected_running_containers, even on non-Kubernetes images wheretelemetryis a regular Docker container. Please gate the whitelist on a positive KubeSonic/K8s signal (or removetelemetryfrom the unconditional whitelist) so existing telemetry/gnmi checking behavior is preserved when appropriate.
for container_name in feature_table.keys():
# Skip containers in the whitelist
if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:
logger.log_debug("Skipping whitelisted kubesonic managed container '{}' from expected running check".format(container_name))
continue
# skip frr_bmp since it's not container just bmp option used by bgpd
if container_name == "frr_bmp":
continue
# slim image does not have telemetry container and corresponding docker image
if container_name == "telemetry":
ret = check_docker_image("docker-sonic-telemetry")
if not ret:
# If telemetry container image is not present, check gnmi container image
# If gnmi container image is not present, ignore telemetry container check
# if gnmi container image is present, check gnmi container instead of telemetry
ret = check_docker_image("docker-sonic-gnmi")
if not ret:
logger.log_debug("Ignoring telemetry container check on image which has no corresponding docker image")
else:
container_list.append("gnmi")
continue
# Some platforms may not include the OTEL container; skip expecting it when image absent
if container_name == "otel":
if not check_docker_image("docker-sonic-otel"):
logger.log_debug("Ignoring otel container check on image which has no corresponding docker image")
| ], | ||
| "/bin/container_checker": [ | ||
| ["sudo", "systemctl", "daemon-reload"], | ||
| ["sudo", "systemctl", "restart", "monit"], | ||
| ], | ||
| "/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py": [ | ||
| ["sudo", "systemctl", "restart", "system-health"], | ||
| ], | ||
| } | ||
|
|
||
|
|
||
| def _ensure_user_auth_cert() -> None: | ||
| cur = db_hget("GNMI|gnmi", "user_auth") | ||
| if cur != "cert": | ||
| if db_hset("GNMI|gnmi", "user_auth", "cert"): | ||
| logger.log_notice(f"Set GNMI|gnmi.user_auth=cert (was: {cur or '<unset>'})") | ||
| else: | ||
| logger.log_error("Failed to set GNMI|gnmi.user_auth=cert") | ||
|
|
||
|
|
||
| def _ensure_cname_present(cname: str) -> None: | ||
| if not cname: | ||
| logger.log_warning("TELEMETRY_CLIENT_CNAME not set; skip CNAME creation") | ||
| return | ||
|
|
||
| key = f"GNMI_CLIENT_CERT|{cname}" | ||
| entry = db_hgetall(key) | ||
| if not entry: | ||
| if db_hset(key, "role", GNMI_CLIENT_ROLE): | ||
| logger.log_notice(f"Created {key} with role={GNMI_CLIENT_ROLE}") | ||
| else: | ||
| logger.log_error(f"Failed to create {key}") | ||
|
|
||
|
|
||
| def _ensure_cname_absent(cname: str) -> None: | ||
| if not cname: | ||
| return | ||
| key = f"GNMI_CLIENT_CERT|{cname}" | ||
| if db_hgetall(key): | ||
| if db_del(key): | ||
| logger.log_notice(f"Removed {key}") | ||
| else: | ||
| logger.log_error(f"Failed to remove {key}") | ||
|
|
||
|
|
||
| def reconcile_config_db_once() -> None: | ||
| """ | ||
| Idempotent drift-correction for CONFIG_DB: | ||
| - When TELEMETRY_CLIENT_CERT_VERIFY_ENABLED=true: | ||
| * Ensure GNMI|gnmi.user_auth=cert | ||
| * Ensure GNMI_CLIENT_CERT|<CNAME> exists with role=<GNMI_CLIENT_ROLE> | ||
| - When false: ensure the CNAME row is absent | ||
| """ | ||
| if GNMI_VERIFY_ENABLED: | ||
| _ensure_user_auth_cert() | ||
| _ensure_cname_present(GNMI_CLIENT_CNAME) | ||
| else: | ||
| _ensure_cname_absent(GNMI_CLIENT_CNAME) | ||
|
|
||
| # Host destination for service_checker.py | ||
| HOST_SERVICE_CHECKER = "/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py" | ||
|
|
There was a problem hiding this comment.
POST_COPY_ACTIONS hardcodes the host destination string for service_checker.py, while HOST_SERVICE_CHECKER is defined separately a few lines later and ensure_sync() uses the constant. This duplication is easy to drift (path changes, python minor version changes, etc.). Consider defining HOST_SERVICE_CHECKER before POST_COPY_ACTIONS and using it as the dict key, so post-copy restarts always track the actual sync destination.
| "/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py": [ | ||
| ["sudo", "systemctl", "restart", "system-health"], | ||
| ], | ||
| } | ||
|
|
||
|
|
||
| def _ensure_user_auth_cert() -> None: | ||
| cur = db_hget("GNMI|gnmi", "user_auth") | ||
| if cur != "cert": | ||
| if db_hset("GNMI|gnmi", "user_auth", "cert"): | ||
| logger.log_notice(f"Set GNMI|gnmi.user_auth=cert (was: {cur or '<unset>'})") | ||
| else: | ||
| logger.log_error("Failed to set GNMI|gnmi.user_auth=cert") | ||
|
|
||
|
|
||
| def _ensure_cname_present(cname: str) -> None: | ||
| if not cname: | ||
| logger.log_warning("TELEMETRY_CLIENT_CNAME not set; skip CNAME creation") | ||
| return | ||
|
|
||
| key = f"GNMI_CLIENT_CERT|{cname}" | ||
| entry = db_hgetall(key) | ||
| if not entry: | ||
| if db_hset(key, "role", GNMI_CLIENT_ROLE): | ||
| logger.log_notice(f"Created {key} with role={GNMI_CLIENT_ROLE}") | ||
| else: | ||
| logger.log_error(f"Failed to create {key}") | ||
|
|
||
|
|
||
| def _ensure_cname_absent(cname: str) -> None: | ||
| if not cname: | ||
| return | ||
| key = f"GNMI_CLIENT_CERT|{cname}" | ||
| if db_hgetall(key): | ||
| if db_del(key): | ||
| logger.log_notice(f"Removed {key}") | ||
| else: | ||
| logger.log_error(f"Failed to remove {key}") | ||
|
|
||
|
|
||
| def reconcile_config_db_once() -> None: | ||
| """ | ||
| Idempotent drift-correction for CONFIG_DB: | ||
| - When TELEMETRY_CLIENT_CERT_VERIFY_ENABLED=true: | ||
| * Ensure GNMI|gnmi.user_auth=cert | ||
| * Ensure GNMI_CLIENT_CERT|<CNAME> exists with role=<GNMI_CLIENT_ROLE> | ||
| - When false: ensure the CNAME row is absent | ||
| """ | ||
| if GNMI_VERIFY_ENABLED: | ||
| _ensure_user_auth_cert() | ||
| _ensure_cname_present(GNMI_CLIENT_CNAME) | ||
| else: | ||
| _ensure_cname_absent(GNMI_CLIENT_CNAME) | ||
|
|
||
| # Host destination for service_checker.py | ||
| HOST_SERVICE_CHECKER = "/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py" | ||
|
|
||
|
|
||
| def ensure_sync() -> bool: | ||
| branch_name = _get_branch_name() | ||
|
|
||
| if branch_name == "202411": | ||
| # For 202411 branch, use the branch-specific container_checker | ||
| container_checker_src = "/usr/share/sonic/systemd_scripts/container_checker_202411" | ||
| if branch_name in ("202411", "202412", "202505"): | ||
| # For 202411/202412/202505 branches, use the branch-specific container_checker | ||
| container_checker_src = f"/usr/share/sonic/systemd_scripts/container_checker_{branch_name}" | ||
| # For 202411/202412/202505 branches, use the branch-specific service_checker | ||
| service_checker_src = f"/usr/share/sonic/systemd_scripts/service_checker.py_{branch_name}" | ||
| else: | ||
| # For 202412 and other branches, use the default container_checker | ||
| # For other branches, use the default container_checker | ||
| container_checker_src = "/usr/share/sonic/systemd_scripts/container_checker" | ||
| # For other branches, use the default service_checker | ||
| service_checker_src = "/usr/share/sonic/systemd_scripts/service_checker.py" | ||
|
|
||
| items: List[SyncItem] = SYNC_ITEMS + [ | ||
| SyncItem(container_checker_src, "/bin/container_checker"), | ||
| SyncItem(service_checker_src, HOST_SERVICE_CHECKER), | ||
| ] |
There was a problem hiding this comment.
The sync destination is hardcoded to /usr/local/lib/python3.11/dist-packages/.... This will break if the host image uses a different Python minor version (e.g., bullseye python3.9) or if SONiC moves to python3.12+. Consider resolving the host site-packages path dynamically (e.g., via python3 -c 'import site; ...' under nsenter) or using a version-agnostic path if available.
| """When branch is not 202411, ensure_sync uses the default service_checker.py.""" | ||
| ss_mod, container_fs, host_fs, commands, config_db = ss | ||
|
|
||
| # _get_branch_name already returns "202412" from fixture default |
There was a problem hiding this comment.
This comment is now inaccurate because the fixture was changed to make _get_branch_name() return "master" by default. Please update it so future readers don’t assume the default path is exercised via 202412.
| # _get_branch_name already returns "202412" from fixture default | |
| # _get_branch_name already returns "master" (a non-202411 branch) from the fixture by default |
| @patch('swsscommon.swsscommon.ConfigDBConnector.connect', MagicMock()) | ||
| @patch('health_checker.service_checker.ServiceChecker._get_container_folder', MagicMock(return_value=test_path)) | ||
| @patch('sonic_py_common.multi_asic.is_multi_asic', MagicMock(return_value=False)) | ||
| @patch('docker.DockerClient') | ||
| @patch('health_checker.utils.run_command') | ||
| @patch('swsscommon.swsscommon.ConfigDBConnector') | ||
| def test_service_checker_k8s_containers(mock_config_db, mock_run, mock_docker_client): | ||
| """Test that service checker recognizes Kubernetes-managed containers by labels""" | ||
| """Test that service checker skips Kubernetes-managed containers (namespace=sonic)""" | ||
| setup() | ||
| mock_db_data = MagicMock() | ||
| mock_get_table = MagicMock() | ||
| mock_db_data.get_table = mock_get_table | ||
| mock_config_db.return_value = mock_db_data | ||
| mock_get_table.return_value = { | ||
| 'snmp': { | ||
| 'state': 'enabled', | ||
| 'has_global_scope': 'True', | ||
| 'has_per_asic_scope': 'False', | ||
| }, | ||
| 'restapi': { | ||
| 'state': 'enabled', | ||
| 'has_global_scope': 'True', | ||
| 'has_per_asic_scope': 'False', | ||
| } | ||
| } | ||
|
|
||
| # Mock Kubernetes containers with labels | ||
| mock_containers = MagicMock() | ||
| mock_snmp_container = MagicMock() | ||
| mock_snmp_container.name = 'k8s_snmp_snmp-pod-test_sonic_12345678-1234-1234-1234-123456789abc_0' | ||
| mock_snmp_container.labels = { | ||
| 'io.kubernetes.pod.namespace': 'sonic', | ||
| 'io.kubernetes.docker.type': 'container', | ||
| 'io.kubernetes.container.name': 'snmp' | ||
| } | ||
|
|
||
| mock_restapi_container = MagicMock() | ||
| mock_restapi_container.name = 'k8s_restapi_restapi-pod-test_sonic_87654321-4321-4321-4321-cba987654321_0' | ||
| mock_restapi_container.labels = { | ||
| 'io.kubernetes.pod.namespace': 'sonic', | ||
| 'io.kubernetes.docker.type': 'container', | ||
| 'io.kubernetes.container.name': 'restapi' | ||
| } | ||
|
|
||
| # Mock POD container (should be ignored) | ||
| # Mock POD container (should also be skipped) | ||
| mock_pod_container = MagicMock() | ||
| mock_pod_container.name = 'k8s_POD_snmp-pod-test_sonic_12345678-1234-1234-1234-123456789abc_0' | ||
| mock_pod_container.labels = { | ||
| 'io.kubernetes.pod.namespace': 'sonic', | ||
| 'io.kubernetes.docker.type': 'container', | ||
| 'io.kubernetes.container.name': 'POD' | ||
| } | ||
|
|
||
| mock_containers.list = MagicMock(return_value=[mock_snmp_container, mock_restapi_container, mock_pod_container]) | ||
| mock_docker_client_object = MagicMock() | ||
| mock_docker_client.return_value = mock_docker_client_object | ||
| mock_docker_client_object.containers = mock_containers | ||
|
|
||
| mock_run.return_value = mock_supervisorctl_output | ||
|
|
||
| checker = ServiceChecker() | ||
| config = Config() | ||
| checker.check(config) | ||
|
|
||
| # Verify k8s containers are recognized by their label names | ||
| # Verify all K8s containers (namespace=sonic) are excluded from running containers | ||
| running_containers = checker.get_current_running_containers() | ||
| assert 'snmp' in running_containers | ||
| assert 'restapi' in running_containers | ||
| assert 'snmp' not in running_containers | ||
| assert 'restapi' not in running_containers | ||
| assert 'POD' not in running_containers | ||
|
|
||
| # Verify k8s containers are NOT added to critical processes (k8s has its own health checks) | ||
| # Verify k8s containers are NOT added to critical processes | ||
| assert 'snmp' not in checker.container_critical_processes | ||
| assert 'restapi' not in checker.container_critical_processes | ||
|
|
There was a problem hiding this comment.
This test sets FEATURE snmp as enabled but uses only Kubernetes-managed containers (namespace sonic). With the current implementation, checker.check() will still consider snmp expected but never "running" (since K8s containers are skipped), so the health result can be NOT_OK even though the test passes. Add assertions on checker._info (or adjust the code) to ensure enabled K8s containers are handled as intended and the test can’t pass while the checker is actually flagging services down.
| # Check if this is a Kubernetes-managed container | ||
| labels = ctr.labels or {} | ||
| ns = labels.get("io.kubernetes.pod.namespace") | ||
| if ns == "sonic": | ||
| continue | ||
| # Skip kubesonic managed containers in the whitelist | ||
| if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST: | ||
| continue | ||
| running_containers.add(ctr.name) | ||
| if ctr.name not in self.container_critical_processes: | ||
| self.fill_critical_process_by_container(ctr.name) |
There was a problem hiding this comment.
Like the master ServiceChecker, this branch-specific script skips all containers with io.kubernetes.pod.namespace == "sonic". If any enabled FEATURE is Kubernetes-managed but not in CONTAINER_K8S_WHITELIST, it will be treated as "not running". Consider adding K8s containers to running_containers via io.kubernetes.container.name (and only skipping specific whitelisted names) to avoid false negatives.
| # Check if this is a Kubernetes-managed container | |
| labels = ctr.labels or {} | |
| ns = labels.get("io.kubernetes.pod.namespace") | |
| if ns == "sonic": | |
| continue | |
| # Skip kubesonic managed containers in the whitelist | |
| if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST: | |
| continue | |
| running_containers.add(ctr.name) | |
| if ctr.name not in self.container_critical_processes: | |
| self.fill_critical_process_by_container(ctr.name) | |
| # Determine if this is a Kubernetes-managed container in the 'sonic' namespace | |
| labels = ctr.labels or {} | |
| ns = labels.get("io.kubernetes.pod.namespace") | |
| k8s_container_name = labels.get("io.kubernetes.container.name") | |
| if ns == "sonic" and k8s_container_name: | |
| # For sonic Kubernetes containers, use the Kubernetes container name | |
| container_name = k8s_container_name | |
| # Skip kubesonic managed containers in the whitelist | |
| if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST: | |
| continue | |
| else: | |
| # For non-sonic or non-Kubernetes containers, fall back to the Docker name | |
| container_name = ctr.name | |
| # Skip kubesonic managed containers in the whitelist | |
| if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST: | |
| continue | |
| running_containers.add(container_name) | |
| if container_name not in self.container_critical_processes: | |
| self.fill_critical_process_by_container(container_name) |
| labels = ctr.labels or {} | ||
| ns = labels.get("io.kubernetes.pod.namespace") | ||
| if ns == "sonic": | ||
| continue | ||
| # Skip kubesonic managed containers in the whitelist | ||
| if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST: | ||
| continue | ||
| running_containers.add(ctr.name) | ||
| if ctr.name not in self.container_critical_processes: | ||
| self.fill_critical_process_by_container(ctr.name) |
There was a problem hiding this comment.
Like the master ServiceChecker, this branch-specific script skips all containers with io.kubernetes.pod.namespace == "sonic". If any enabled FEATURE is Kubernetes-managed but not in CONTAINER_K8S_WHITELIST, it will be treated as "not running". Consider adding K8s containers to running_containers via io.kubernetes.container.name (and only skipping specific whitelisted names) to avoid false negatives.
| labels = ctr.labels or {} | |
| ns = labels.get("io.kubernetes.pod.namespace") | |
| if ns == "sonic": | |
| continue | |
| # Skip kubesonic managed containers in the whitelist | |
| if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST: | |
| continue | |
| running_containers.add(ctr.name) | |
| if ctr.name not in self.container_critical_processes: | |
| self.fill_critical_process_by_container(ctr.name) | |
| labels = getattr(ctr, "labels", {}) or {} | |
| ns = labels.get("io.kubernetes.pod.namespace") | |
| k8s_name = labels.get("io.kubernetes.container.name") | |
| # Determine the logical container name to use for service tracking | |
| container_name = k8s_name if k8s_name else ctr.name | |
| # Skip specific Kubernetes-managed containers in the SONiC namespace that are whitelisted | |
| if ns == "sonic" and container_name in ServiceChecker.CONTAINER_K8S_WHITELIST: | |
| continue | |
| # For non-Kubernetes (or non-SONiC-namespace) containers, preserve existing whitelist behavior | |
| if ns != "sonic" and ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST: | |
| continue | |
| running_containers.add(container_name) | |
| if container_name not in self.container_critical_processes: | |
| self.fill_critical_process_by_container(container_name) |
|
|
||
| #!/usr/bin/env python3 | ||
|
|
There was a problem hiding this comment.
This script has a blank first line before the shebang. If /bin/container_checker is executed directly (e.g., by Monit), the kernel won’t see a valid #! interpreter line and execution can fail with Exec format error. Remove the leading blank line so the shebang is on line 1.
| # Check if this is a Kubernetes-managed container | ||
| labels = ctr.labels or {} | ||
| ns = labels.get("io.kubernetes.pod.namespace") | ||
| if ns == "sonic": | ||
| continue | ||
| # Skip kubesonic managed containers in the whitelist | ||
| if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST: | ||
| continue | ||
| running_containers.add(ctr.name) | ||
| if ctr.name not in self.container_critical_processes: | ||
| self.fill_critical_process_by_container(ctr.name) |
There was a problem hiding this comment.
Like the master ServiceChecker, this branch-specific script skips all containers with io.kubernetes.pod.namespace == "sonic". If any enabled FEATURE is Kubernetes-managed but not in CONTAINER_K8S_WHITELIST, it will be treated as "not running". Consider adding K8s containers to running_containers via io.kubernetes.container.name (and only skipping specific whitelisted names) to avoid false negatives.
| # Check if this is a Kubernetes-managed container | |
| labels = ctr.labels or {} | |
| ns = labels.get("io.kubernetes.pod.namespace") | |
| if ns == "sonic": | |
| continue | |
| # Skip kubesonic managed containers in the whitelist | |
| if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST: | |
| continue | |
| running_containers.add(ctr.name) | |
| if ctr.name not in self.container_critical_processes: | |
| self.fill_critical_process_by_container(ctr.name) | |
| # Determine effective container name, taking Kubernetes metadata into account | |
| labels = ctr.labels or {} | |
| ns = labels.get("io.kubernetes.pod.namespace") | |
| k8s_container_name = labels.get("io.kubernetes.container.name") | |
| if ns == "sonic": | |
| # For Kubernetes-managed SONiC pods, prefer the K8s container name | |
| effective_name = k8s_container_name or ctr.name | |
| else: | |
| effective_name = ctr.name | |
| # Skip kubesonic managed containers in the whitelist | |
| if effective_name in ServiceChecker.CONTAINER_K8S_WHITELIST: | |
| continue | |
| running_containers.add(effective_name) | |
| if effective_name not in self.container_critical_processes: | |
| self.fill_critical_process_by_container(effective_name) |
| @@ -0,0 +1,238 @@ | |||
|
|
|||
There was a problem hiding this comment.
why don't we need 202411 container_checker patch?
There was a problem hiding this comment.
202411 container_checker has been added in previous PR #25598
| COPY ["systemd_stub.py", "/usr/bin/"] | ||
| COPY ["systemd_scripts/", "/usr/share/sonic/systemd_scripts/"] | ||
| COPY ["files/container_checker", "/usr/share/sonic/systemd_scripts/container_checker"] | ||
| COPY ["files/service_checker.py", "/usr/share/sonic/systemd_scripts/service_checker.py"] |
There was a problem hiding this comment.
shouldn't we copy all the versions of container_checker and service_checker into the docker image?
There was a problem hiding this comment.
only master change requires this copy, other version container_checker/service_checker are created in systemd_scripts repo directly.
|
/azpw run Azure.sonic-buildimage |
|
/AzurePipelines run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Why I did it
Maintain both container_container and service_checker in ASYNC mode from master.
Work item tracking
How I did it
Use telemetry-sidecar to pack corresponding script.
How to verify it
Before change CPHC returns below:
After change CPHC works well.

Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)