Address cphc service_checker issue in sidecar container by FengPan-Frank · Pull Request #25851 · sonic-net/sonic-buildimage

FengPan-Frank · 2026-03-03T04:48:20Z

Why I did it

Maintain both container_container and service_checker in ASYNC mode from master.

Work item tracking

Microsoft ADO (number only):

How I did it

Use telemetry-sidecar to pack corresponding script.

How to verify it

Before change CPHC returns below:

After change CPHC works well.

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

mssonicbld · 2026-03-03T04:48:28Z

/azp run Azure.sonic-buildimage

azure-pipelines · 2026-03-03T04:48:39Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

This PR aims to address system-health / service_checker behavior when running in the telemetry sidecar (KubeSonic context) by syncing an updated service_checker.py from the sidecar container onto the host and adjusting container selection logic.

Changes:

Add a KubeSonic container whitelist to bypass service/container health checks for specific containers.
Ship service_checker.py into the telemetry-sidecar image and sync it onto the host with branch-conditional selection.
Extend telemetry-sidecar systemd_stub.py and its CLI plugin tests to cover the new sync item.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/system-health/health_checker/service_checker.py	Adds a container whitelist and changes how running containers are discovered/filtered.
rules/scripts.mk	Exposes `service_checker.py` as a build-copyable script input.
rules/docker-telemetry-sidecar.mk	Includes `service_checker.py` in telemetry-sidecar copied files.
dockers/docker-telemetry-sidecar/systemd_stub.py	Syncs `service_checker.py` into host `health_checker` package; adds post-copy restart action; branch-conditional source selection.
dockers/docker-telemetry-sidecar/systemd_scripts/service_checker.py_202411	Adds a branch-specific `service_checker.py` payload for syncing onto 202411 hosts.
dockers/docker-telemetry-sidecar/cli-plugin-tests/test_systemd_stub.py	Updates tests/fixtures for the new sync item and branch-conditional behavior.
dockers/docker-telemetry-sidecar/Dockerfile.j2	Copies the staged `service_checker.py` into the image for syncing.

Comments suppressed due to low confidence (1)

src/system-health/health_checker/service_checker.py:108

CONTAINER_K8S_WHITELIST is applied unconditionally to the FEATURE table, so containers like telemetry/restapi will never be considered "expected running" even on non-KubeSonic/non-K8s images. This also makes the existing if container_name == "telemetry" special-case unreachable. Consider gating this whitelist behind an explicit KubeSonic/K8s enablement check (or removing telemetry from the unconditional set) so classic deployments don’t lose health coverage.

        container_list = []
        for container_name in feature_table.keys():
            # Skip containers in the whitelist
            if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:
                logger.log_debug("Skipping whitelisted kubesonic managed container '{}' from expected running check".format(container_name))
                continue
            # skip frr_bmp since it's not container just bmp option used by bgpd
            if container_name == "frr_bmp":
                continue
            # slim image does not have telemetry container and corresponding docker image
            if container_name == "telemetry":
                ret = check_docker_image("docker-sonic-telemetry")

Copilot · 2026-03-03T04:54:57Z

src/system-health/health_checker/service_checker.py

            for ctr in lst:
                # Check if this is a Kubernetes-managed container
                labels = ctr.labels or {}
                ns = labels.get("io.kubernetes.pod.namespace")
-                dtype = labels.get("io.kubernetes.docker.type")
-                kname = labels.get("io.kubernetes.container.name")
-
-                if ns == "sonic":
-                    # Kubernetes-managed container - add service name to running containers
-                    # but skip critical process checking (k8s has its own health mechanisms)
-                    if dtype == "container" and kname and kname not in ("<no value>", "POD"):
-                        running_containers.add(kname)
+
+                if ns:
                    continue

-                # Regular Docker container - use the container name
+                # Skip kubesonic managed containers in the whitelist
+                if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:
+                    continue
                running_containers.add(ctr.name)


get_current_running_containers() now continues for any container with io.kubernetes.pod.namespace set, which drops the previously-supported behavior of mapping k8s-managed containers to their service name (e.g., io.kubernetes.container.name). This breaks existing unit tests in src/system-health/tests/test_system_health.py that assert k8s containers like snmp/database appear in running_containers, and can also cause false "not running" alerts if a FEATURE is enabled but the container is k8s-managed. Restore adding the k8s container name to running_containers (while still skipping critical process inspection), and then apply the whitelist only to the intended subset.

Copilot · 2026-03-03T04:54:57Z

dockers/docker-telemetry-sidecar/systemd_stub.py

        ["sudo", "systemctl", "daemon-reload"],
        ["sudo", "systemctl", "restart", "monit"],
    ],
+    "/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py": [


POST_COPY_ACTIONS hardcodes the service_checker destination path as a string even though HOST_SERVICE_CHECKER is defined below. Using HOST_SERVICE_CHECKER as the dict key avoids drift if the destination path ever changes (and keeps tests/logic aligned).

Suggested change

"/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py": [

HOST_SERVICE_CHECKER: [

Copilot · 2026-03-03T04:54:57Z

dockers/docker-telemetry-sidecar/systemd_stub.py

        # For 202412 and other branches, use the default container_checker
        container_checker_src = "/usr/share/sonic/systemd_scripts/container_checker"
+        # For 202412 and other branches, use the default service_checker


The comment says "For 202412 and other branches" but _get_branch_name() can return values like master, internal, or private as well. Consider updating the comment to match the actual branching logic (i.e., "non-202411 branches").

Suggested change

# For 202412 and other branches, use the default container_checker

container_checker_src = "/usr/share/sonic/systemd_scripts/container_checker"

# For 202412 and other branches, use the default service_checker

# For non-202411 branches, use the default container_checker

container_checker_src = "/usr/share/sonic/systemd_scripts/container_checker"

# For non-202411 branches, use the default service_checker

Copilot · 2026-03-03T04:54:58Z

dockers/docker-telemetry-sidecar/systemd_scripts/service_checker.py_202411

+
+    def check_by_monit(self, config):
+        """
+        et and analyze the output of $CHECK_CMD, collect status for file system or customize checker if any.


Docstring typo: "et and analyze" looks like it should be "Get and analyze" (and this string is user-facing documentation for the method).

Suggested change

et and analyze the output of $CHECK_CMD, collect status for file system or customize checker if any.

Get and analyze the output of $CHECK_CMD, collect status for file system or customize checker if any.

mssonicbld · 2026-03-03T05:13:46Z

/azp run Azure.sonic-buildimage

azure-pipelines · 2026-03-03T05:13:59Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-03-03T06:07:04Z

/azp run Azure.sonic-buildimage

azure-pipelines · 2026-03-03T06:07:15Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Copilot · 2026-03-03T06:14:39Z

dockers/docker-telemetry-sidecar/systemd_scripts/container_checker_202412

@@ -0,0 +1,238 @@
+


The first line of this script is blank, so the shebang is not on line 1. If this file is copied to /bin/container_checker and executed by Monit as a program, it will fail to execute (exec format error). Remove the leading blank line so the shebang is the first line of the file.

Suggested change

Copilot · 2026-03-03T06:14:40Z

src/system-health/health_checker/service_checker.py

+            # Skip containers in the whitelist
+            if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:


CONTAINER_K8S_WHITELIST includes "telemetry", so this early-continue makes the later telemetry-specific handling in the same loop unreachable and disables telemetry health checking even on non-KubeSonic (non-K8s) deployments. Consider applying this whitelist only when the container is actually Kubernetes-managed, or remove "telemetry" from the unconditional name-based whitelist.

Suggested change

# Skip containers in the whitelist

if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:

# Skip containers in the whitelist, except telemetry which has special handling below

if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST and container_name != "telemetry":

Copilot · 2026-03-03T06:14:40Z

dockers/docker-telemetry-sidecar/systemd_scripts/service_checker.py_202412

+        self.reset()
+        self.check_by_monit(config)
+        self.check_services(config)
+        swsscommon.events_deinit_publisher(self.events_handle)


This module initializes an events publisher handle in init but deinitializes it at the end of check(). ServiceChecker instances are typically long-lived (created once by HealthCheckerManager and reused), so after the first check the handle will be deinitialized and subsequent publish_events() calls may fail. Align with the in-tree ServiceChecker pattern (init/deinit the publisher inside publish_events(), or keep the handle open for the object lifetime and deinit only on shutdown).

Suggested change

swsscommon.events_deinit_publisher(self.events_handle)

Copilot · 2026-03-03T06:14:40Z

dockers/docker-telemetry-sidecar/systemd_scripts/service_checker.py_202505

+        self.reset()
+        self.check_by_monit(config)
+        self.check_services(config)
+        swsscommon.events_deinit_publisher(self.events_handle)


This module initializes an events publisher handle in init but deinitializes it at the end of check(). ServiceChecker instances are typically long-lived (created once by HealthCheckerManager and reused), so after the first check the handle will be deinitialized and subsequent publish_events() calls may fail. Align with the in-tree ServiceChecker pattern (init/deinit the publisher inside publish_events(), or keep the handle open for the object lifetime and deinit only on shutdown).

Suggested change

swsscommon.events_deinit_publisher(self.events_handle)

Copilot · 2026-03-03T06:14:41Z

dockers/docker-telemetry-sidecar/systemd_scripts/service_checker.py_202411

+        self.reset()
+        self.check_by_monit(config)
+        self.check_services(config)
+        swsscommon.events_deinit_publisher(self.events_handle)


This module initializes an events publisher handle in init but deinitializes it at the end of check(). ServiceChecker instances are typically long-lived (created once by HealthCheckerManager and reused), so after the first check the handle will be deinitialized and subsequent publish_events() calls may fail. Align with the in-tree ServiceChecker pattern (init/deinit the publisher inside publish_events(), or keep the handle open for the object lifetime and deinit only on shutdown).

Suggested change

swsscommon.events_deinit_publisher(self.events_handle)

mssonicbld · 2026-03-03T08:26:47Z

/azp run Azure.sonic-buildimage

azure-pipelines · 2026-03-03T08:26:57Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-03-03T11:06:15Z

/azp run Azure.sonic-buildimage

azure-pipelines · 2026-03-03T11:06:29Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (1)

src/system-health/health_checker/service_checker.py:122

The whitelist includes telemetry, but the loop checks the whitelist before the existing if container_name == "telemetry" handling (slim-image / gnmi fallback). As written, the telemetry-specific logic is unreachable and telemetry will never be part of expected_running_containers, even on non-Kubernetes images where telemetry is a regular Docker container. Please gate the whitelist on a positive KubeSonic/K8s signal (or remove telemetry from the unconditional whitelist) so existing telemetry/gnmi checking behavior is preserved when appropriate.

        for container_name in feature_table.keys():
            # Skip containers in the whitelist
            if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:
                logger.log_debug("Skipping whitelisted kubesonic managed container '{}' from expected running check".format(container_name))
                continue
            # skip frr_bmp since it's not container just bmp option used by bgpd
            if container_name == "frr_bmp":
                continue
            # slim image does not have telemetry container and corresponding docker image
            if container_name == "telemetry":
                ret = check_docker_image("docker-sonic-telemetry")
                if not ret:
                    # If telemetry container image is not present, check gnmi container image
                    # If gnmi container image is not present, ignore telemetry container check
                    # if gnmi container image is present, check gnmi container instead of telemetry
                    ret = check_docker_image("docker-sonic-gnmi")
                    if not ret:
                        logger.log_debug("Ignoring telemetry container check on image which has no corresponding docker image")
                    else:
                        container_list.append("gnmi")
                    continue
            # Some platforms may not include the OTEL container; skip expecting it when image absent
            if container_name == "otel":
                if not check_docker_image("docker-sonic-otel"):
                    logger.log_debug("Ignoring otel container check on image which has no corresponding docker image")

Copilot · 2026-03-03T11:15:22Z

dockers/docker-telemetry-sidecar/systemd_stub.py

    ],
    "/bin/container_checker": [
        ["sudo", "systemctl", "daemon-reload"],
        ["sudo", "systemctl", "restart", "monit"],
    ],
+    "/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py": [
+        ["sudo", "systemctl", "restart", "system-health"],
+    ],
 }


 def _ensure_user_auth_cert() -> None:
    cur = db_hget("GNMI|gnmi", "user_auth")
    if cur != "cert":
        if db_hset("GNMI|gnmi", "user_auth", "cert"):
            logger.log_notice(f"Set GNMI|gnmi.user_auth=cert (was: {cur or '<unset>'})")
        else:
            logger.log_error("Failed to set GNMI|gnmi.user_auth=cert")


 def _ensure_cname_present(cname: str) -> None:
    if not cname:
        logger.log_warning("TELEMETRY_CLIENT_CNAME not set; skip CNAME creation")
        return

    key = f"GNMI_CLIENT_CERT|{cname}"
    entry = db_hgetall(key)
    if not entry:
        if db_hset(key, "role", GNMI_CLIENT_ROLE):
            logger.log_notice(f"Created {key} with role={GNMI_CLIENT_ROLE}")
        else:
            logger.log_error(f"Failed to create {key}")


 def _ensure_cname_absent(cname: str) -> None:
    if not cname:
        return
    key = f"GNMI_CLIENT_CERT|{cname}"
    if db_hgetall(key):
        if db_del(key):
            logger.log_notice(f"Removed {key}")
        else:
            logger.log_error(f"Failed to remove {key}")


 def reconcile_config_db_once() -> None:
    """
    Idempotent drift-correction for CONFIG_DB:
      - When TELEMETRY_CLIENT_CERT_VERIFY_ENABLED=true:
          * Ensure GNMI|gnmi.user_auth=cert
          * Ensure GNMI_CLIENT_CERT|<CNAME> exists with role=<GNMI_CLIENT_ROLE>
      - When false: ensure the CNAME row is absent
    """
    if GNMI_VERIFY_ENABLED:
        _ensure_user_auth_cert()
        _ensure_cname_present(GNMI_CLIENT_CNAME)
    else:
        _ensure_cname_absent(GNMI_CLIENT_CNAME)

+# Host destination for service_checker.py
+HOST_SERVICE_CHECKER = "/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py"
+


POST_COPY_ACTIONS hardcodes the host destination string for service_checker.py, while HOST_SERVICE_CHECKER is defined separately a few lines later and ensure_sync() uses the constant. This duplication is easy to drift (path changes, python minor version changes, etc.). Consider defining HOST_SERVICE_CHECKER before POST_COPY_ACTIONS and using it as the dict key, so post-copy restarts always track the actual sync destination.

Copilot · 2026-03-03T11:15:23Z

dockers/docker-telemetry-sidecar/systemd_stub.py

+    "/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py": [
+        ["sudo", "systemctl", "restart", "system-health"],
+    ],
 }


 def _ensure_user_auth_cert() -> None:
    cur = db_hget("GNMI|gnmi", "user_auth")
    if cur != "cert":
        if db_hset("GNMI|gnmi", "user_auth", "cert"):
            logger.log_notice(f"Set GNMI|gnmi.user_auth=cert (was: {cur or '<unset>'})")
        else:
            logger.log_error("Failed to set GNMI|gnmi.user_auth=cert")


 def _ensure_cname_present(cname: str) -> None:
    if not cname:
        logger.log_warning("TELEMETRY_CLIENT_CNAME not set; skip CNAME creation")
        return

    key = f"GNMI_CLIENT_CERT|{cname}"
    entry = db_hgetall(key)
    if not entry:
        if db_hset(key, "role", GNMI_CLIENT_ROLE):
            logger.log_notice(f"Created {key} with role={GNMI_CLIENT_ROLE}")
        else:
            logger.log_error(f"Failed to create {key}")


 def _ensure_cname_absent(cname: str) -> None:
    if not cname:
        return
    key = f"GNMI_CLIENT_CERT|{cname}"
    if db_hgetall(key):
        if db_del(key):
            logger.log_notice(f"Removed {key}")
        else:
            logger.log_error(f"Failed to remove {key}")


 def reconcile_config_db_once() -> None:
    """
    Idempotent drift-correction for CONFIG_DB:
      - When TELEMETRY_CLIENT_CERT_VERIFY_ENABLED=true:
          * Ensure GNMI|gnmi.user_auth=cert
          * Ensure GNMI_CLIENT_CERT|<CNAME> exists with role=<GNMI_CLIENT_ROLE>
      - When false: ensure the CNAME row is absent
    """
    if GNMI_VERIFY_ENABLED:
        _ensure_user_auth_cert()
        _ensure_cname_present(GNMI_CLIENT_CNAME)
    else:
        _ensure_cname_absent(GNMI_CLIENT_CNAME)

+# Host destination for service_checker.py
+HOST_SERVICE_CHECKER = "/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py"
+
+
 def ensure_sync() -> bool:
    branch_name = _get_branch_name()

-    if branch_name == "202411":
-        # For 202411 branch, use the branch-specific container_checker
-        container_checker_src = "/usr/share/sonic/systemd_scripts/container_checker_202411"
+    if branch_name in ("202411", "202412", "202505"):
+        # For 202411/202412/202505 branches, use the branch-specific container_checker
+        container_checker_src = f"/usr/share/sonic/systemd_scripts/container_checker_{branch_name}"
+        # For 202411/202412/202505 branches, use the branch-specific service_checker
+        service_checker_src = f"/usr/share/sonic/systemd_scripts/service_checker.py_{branch_name}"
    else:
-        # For 202412 and other branches, use the default container_checker
+        # For other branches, use the default container_checker
        container_checker_src = "/usr/share/sonic/systemd_scripts/container_checker"
+        # For other branches, use the default service_checker
+        service_checker_src = "/usr/share/sonic/systemd_scripts/service_checker.py"

    items: List[SyncItem] = SYNC_ITEMS + [
        SyncItem(container_checker_src, "/bin/container_checker"),
+        SyncItem(service_checker_src, HOST_SERVICE_CHECKER),
    ]


The sync destination is hardcoded to /usr/local/lib/python3.11/dist-packages/.... This will break if the host image uses a different Python minor version (e.g., bullseye python3.9) or if SONiC moves to python3.12+. Consider resolving the host site-packages path dynamically (e.g., via python3 -c 'import site; ...' under nsenter) or using a version-agnostic path if available.

Copilot · 2026-03-03T11:15:23Z

dockers/docker-telemetry-sidecar/cli-plugin-tests/test_systemd_stub.py

+    """When branch is not 202411, ensure_sync uses the default service_checker.py."""
+    ss_mod, container_fs, host_fs, commands, config_db = ss
+
+    # _get_branch_name already returns "202412" from fixture default


This comment is now inaccurate because the fixture was changed to make _get_branch_name() return "master" by default. Please update it so future readers don’t assume the default path is exercised via 202412.

Suggested change

# _get_branch_name already returns "202412" from fixture default

# _get_branch_name already returns "master" (a non-202411 branch) from the fixture by default

Copilot · 2026-03-03T11:15:23Z

src/system-health/tests/test_system_health.py

 @patch('swsscommon.swsscommon.ConfigDBConnector.connect', MagicMock())
 @patch('health_checker.service_checker.ServiceChecker._get_container_folder', MagicMock(return_value=test_path))
 @patch('sonic_py_common.multi_asic.is_multi_asic', MagicMock(return_value=False))
 @patch('docker.DockerClient')
 @patch('health_checker.utils.run_command')
 @patch('swsscommon.swsscommon.ConfigDBConnector')
 def test_service_checker_k8s_containers(mock_config_db, mock_run, mock_docker_client):
-    """Test that service checker recognizes Kubernetes-managed containers by labels"""
+    """Test that service checker skips Kubernetes-managed containers (namespace=sonic)"""
    setup()
    mock_db_data = MagicMock()
    mock_get_table = MagicMock()
    mock_db_data.get_table = mock_get_table
    mock_config_db.return_value = mock_db_data
    mock_get_table.return_value = {
        'snmp': {
            'state': 'enabled',
            'has_global_scope': 'True',
            'has_per_asic_scope': 'False',
        },
        'restapi': {
            'state': 'enabled',
            'has_global_scope': 'True',
            'has_per_asic_scope': 'False',
        }
    }

    # Mock Kubernetes containers with labels
    mock_containers = MagicMock()
    mock_snmp_container = MagicMock()
    mock_snmp_container.name = 'k8s_snmp_snmp-pod-test_sonic_12345678-1234-1234-1234-123456789abc_0'
    mock_snmp_container.labels = {
        'io.kubernetes.pod.namespace': 'sonic',
        'io.kubernetes.docker.type': 'container',
        'io.kubernetes.container.name': 'snmp'
    }

    mock_restapi_container = MagicMock()
    mock_restapi_container.name = 'k8s_restapi_restapi-pod-test_sonic_87654321-4321-4321-4321-cba987654321_0'
    mock_restapi_container.labels = {
        'io.kubernetes.pod.namespace': 'sonic',
        'io.kubernetes.docker.type': 'container',
        'io.kubernetes.container.name': 'restapi'
    }

-    # Mock POD container (should be ignored)
+    # Mock POD container (should also be skipped)
    mock_pod_container = MagicMock()
    mock_pod_container.name = 'k8s_POD_snmp-pod-test_sonic_12345678-1234-1234-1234-123456789abc_0'
    mock_pod_container.labels = {
        'io.kubernetes.pod.namespace': 'sonic',
        'io.kubernetes.docker.type': 'container',
        'io.kubernetes.container.name': 'POD'
    }

    mock_containers.list = MagicMock(return_value=[mock_snmp_container, mock_restapi_container, mock_pod_container])
    mock_docker_client_object = MagicMock()
    mock_docker_client.return_value = mock_docker_client_object
    mock_docker_client_object.containers = mock_containers

    mock_run.return_value = mock_supervisorctl_output

    checker = ServiceChecker()
    config = Config()
    checker.check(config)

-    # Verify k8s containers are recognized by their label names
+    # Verify all K8s containers (namespace=sonic) are excluded from running containers
    running_containers = checker.get_current_running_containers()
-    assert 'snmp' in running_containers
-    assert 'restapi' in running_containers
+    assert 'snmp' not in running_containers
+    assert 'restapi' not in running_containers
    assert 'POD' not in running_containers

-    # Verify k8s containers are NOT added to critical processes (k8s has its own health checks)
+    # Verify k8s containers are NOT added to critical processes
    assert 'snmp' not in checker.container_critical_processes
    assert 'restapi' not in checker.container_critical_processes



This test sets FEATURE snmp as enabled but uses only Kubernetes-managed containers (namespace sonic). With the current implementation, checker.check() will still consider snmp expected but never "running" (since K8s containers are skipped), so the health result can be NOT_OK even though the test passes. Add assertions on checker._info (or adjust the code) to ensure enabled K8s containers are handled as intended and the test can’t pass while the checker is actually flagging services down.

Copilot · 2026-03-03T11:15:23Z

dockers/docker-telemetry-sidecar/systemd_scripts/service_checker.py_202411

+                # Check if this is a Kubernetes-managed container
+                labels = ctr.labels or {}
+                ns = labels.get("io.kubernetes.pod.namespace")
+                if ns == "sonic":
+                    continue
+                # Skip kubesonic managed containers in the whitelist
+                if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:
+                    continue
+                running_containers.add(ctr.name)
+                if ctr.name not in self.container_critical_processes:
+                    self.fill_critical_process_by_container(ctr.name)


Like the master ServiceChecker, this branch-specific script skips all containers with io.kubernetes.pod.namespace == "sonic". If any enabled FEATURE is Kubernetes-managed but not in CONTAINER_K8S_WHITELIST, it will be treated as "not running". Consider adding K8s containers to running_containers via io.kubernetes.container.name (and only skipping specific whitelisted names) to avoid false negatives.

Suggested change

# Check if this is a Kubernetes-managed container

labels = ctr.labels or {}

ns = labels.get("io.kubernetes.pod.namespace")

if ns == "sonic":

continue

# Skip kubesonic managed containers in the whitelist

if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:

continue

running_containers.add(ctr.name)

if ctr.name not in self.container_critical_processes:

self.fill_critical_process_by_container(ctr.name)

# Determine if this is a Kubernetes-managed container in the 'sonic' namespace

labels = ctr.labels or {}

ns = labels.get("io.kubernetes.pod.namespace")

k8s_container_name = labels.get("io.kubernetes.container.name")

if ns == "sonic" and k8s_container_name:

# For sonic Kubernetes containers, use the Kubernetes container name

container_name = k8s_container_name

# Skip kubesonic managed containers in the whitelist

if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:

continue

else:

# For non-sonic or non-Kubernetes containers, fall back to the Docker name

container_name = ctr.name

# Skip kubesonic managed containers in the whitelist

if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:

continue

running_containers.add(container_name)

if container_name not in self.container_critical_processes:

self.fill_critical_process_by_container(container_name)

Copilot · 2026-03-03T11:15:24Z

dockers/docker-telemetry-sidecar/systemd_scripts/service_checker.py_202412

+                labels = ctr.labels or {}
+                ns = labels.get("io.kubernetes.pod.namespace")
+                if ns == "sonic":
+                    continue
+                # Skip kubesonic managed containers in the whitelist
+                if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:
+                    continue
+                running_containers.add(ctr.name)
+                if ctr.name not in self.container_critical_processes:
+                    self.fill_critical_process_by_container(ctr.name)


Like the master ServiceChecker, this branch-specific script skips all containers with io.kubernetes.pod.namespace == "sonic". If any enabled FEATURE is Kubernetes-managed but not in CONTAINER_K8S_WHITELIST, it will be treated as "not running". Consider adding K8s containers to running_containers via io.kubernetes.container.name (and only skipping specific whitelisted names) to avoid false negatives.

Suggested change

labels = ctr.labels or {}

ns = labels.get("io.kubernetes.pod.namespace")

if ns == "sonic":

continue

# Skip kubesonic managed containers in the whitelist

if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:

continue

running_containers.add(ctr.name)

if ctr.name not in self.container_critical_processes:

self.fill_critical_process_by_container(ctr.name)

labels = getattr(ctr, "labels", {}) or {}

ns = labels.get("io.kubernetes.pod.namespace")

k8s_name = labels.get("io.kubernetes.container.name")

# Determine the logical container name to use for service tracking

container_name = k8s_name if k8s_name else ctr.name

# Skip specific Kubernetes-managed containers in the SONiC namespace that are whitelisted

if ns == "sonic" and container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:

continue

# For non-Kubernetes (or non-SONiC-namespace) containers, preserve existing whitelist behavior

if ns != "sonic" and ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:

continue

running_containers.add(container_name)

if container_name not in self.container_critical_processes:

self.fill_critical_process_by_container(container_name)

Copilot · 2026-03-03T11:15:24Z

dockers/docker-telemetry-sidecar/systemd_scripts/container_checker_202412

+
+#!/usr/bin/env python3
+


This script has a blank first line before the shebang. If /bin/container_checker is executed directly (e.g., by Monit), the kernel won’t see a valid #! interpreter line and execution can fail with Exec format error. Remove the leading blank line so the shebang is on line 1.

Copilot · 2026-03-03T11:15:24Z

dockers/docker-telemetry-sidecar/systemd_scripts/service_checker.py_202505

+                # Check if this is a Kubernetes-managed container
+                labels = ctr.labels or {}
+                ns = labels.get("io.kubernetes.pod.namespace")
+                if ns == "sonic":
+                    continue
+                # Skip kubesonic managed containers in the whitelist
+                if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:
+                    continue
+                running_containers.add(ctr.name)
+                if ctr.name not in self.container_critical_processes:
+                    self.fill_critical_process_by_container(ctr.name)


Like the master ServiceChecker, this branch-specific script skips all containers with io.kubernetes.pod.namespace == "sonic". If any enabled FEATURE is Kubernetes-managed but not in CONTAINER_K8S_WHITELIST, it will be treated as "not running". Consider adding K8s containers to running_containers via io.kubernetes.container.name (and only skipping specific whitelisted names) to avoid false negatives.

Suggested change

# Check if this is a Kubernetes-managed container

labels = ctr.labels or {}

ns = labels.get("io.kubernetes.pod.namespace")

if ns == "sonic":

continue

# Skip kubesonic managed containers in the whitelist

if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:

continue

running_containers.add(ctr.name)

if ctr.name not in self.container_critical_processes:

self.fill_critical_process_by_container(ctr.name)

# Determine effective container name, taking Kubernetes metadata into account

labels = ctr.labels or {}

ns = labels.get("io.kubernetes.pod.namespace")

k8s_container_name = labels.get("io.kubernetes.container.name")

if ns == "sonic":

# For Kubernetes-managed SONiC pods, prefer the K8s container name

effective_name = k8s_container_name or ctr.name

else:

effective_name = ctr.name

# Skip kubesonic managed containers in the whitelist

if effective_name in ServiceChecker.CONTAINER_K8S_WHITELIST:

continue

running_containers.add(effective_name)

if effective_name not in self.container_critical_processes:

self.fill_critical_process_by_container(effective_name)

make1980 · 2026-03-03T18:05:39Z

dockers/docker-telemetry-sidecar/systemd_scripts/container_checker_202412

@@ -0,0 +1,238 @@
+


why don't we need 202411 container_checker patch?

202411 container_checker has been added in previous PR #25598

make1980 · 2026-03-03T18:07:57Z

dockers/docker-telemetry-sidecar/Dockerfile.j2

 COPY ["systemd_stub.py", "/usr/bin/"]
 COPY ["systemd_scripts/", "/usr/share/sonic/systemd_scripts/"]
 COPY ["files/container_checker", "/usr/share/sonic/systemd_scripts/container_checker"]
+COPY ["files/service_checker.py", "/usr/share/sonic/systemd_scripts/service_checker.py"]


shouldn't we copy all the versions of container_checker and service_checker into the docker image?

only master change requires this copy, other version container_checker/service_checker are created in systemd_scripts repo directly.

FengPan-Frank · 2026-03-04T02:31:24Z

/azpw run Azure.sonic-buildimage

mssonicbld · 2026-03-04T02:31:26Z

/AzurePipelines run Azure.sonic-buildimage

azure-pipelines · 2026-03-04T02:31:38Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI review requested due to automatic review settings March 3, 2026 04:48

FengPan-Frank requested review from lguohan, qiluo-msft and xumia as code owners March 3, 2026 04:48

Copilot started reviewing on behalf of FengPan-Frank March 3, 2026 04:48 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

FengPan-Frank force-pushed the fix_health_checker branch from 65597f2 to 6c6d426 Compare March 3, 2026 05:13

Copilot AI review requested due to automatic review settings March 3, 2026 06:06

FengPan-Frank force-pushed the fix_health_checker branch from 6c6d426 to 854b78c Compare March 3, 2026 06:06

Copilot started reviewing on behalf of FengPan-Frank March 3, 2026 06:07 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

FengPan-Frank requested a review from make1980 March 3, 2026 06:29

FengPan-Frank force-pushed the fix_health_checker branch from 854b78c to 9f546b8 Compare March 3, 2026 08:26

Address cphc service_checker issue in sidecar container

23a71cf

Copilot AI review requested due to automatic review settings March 3, 2026 11:06

FengPan-Frank force-pushed the fix_health_checker branch from 9f546b8 to 23a71cf Compare March 3, 2026 11:06

Copilot started reviewing on behalf of FengPan-Frank March 3, 2026 11:06 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

make1980 reviewed Mar 3, 2026

View reviewed changes

make1980 suggested changes Mar 3, 2026

View reviewed changes

FengPan-Frank requested a review from make1980 March 3, 2026 22:04

make1980 approved these changes Mar 3, 2026

View reviewed changes

	"/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py": [
	HOST_SERVICE_CHECKER: [

	et and analyze the output of $CHECK_CMD, collect status for file system or customize checker if any.
	Get and analyze the output of $CHECK_CMD, collect status for file system or customize checker if any.

		# Skip containers in the whitelist
		if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:

	# _get_branch_name already returns "202412" from fixture default
	# _get_branch_name already returns "master" (a non-202411 branch) from the fixture by default


		#!/usr/bin/env python3

Conversation

FengPan-Frank commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why I did it

Work item tracking

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Uh oh!

mssonicbld commented Mar 3, 2026

Uh oh!

azure-pipelines bot commented Mar 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented Mar 3, 2026

Uh oh!

azure-pipelines bot commented Mar 3, 2026

Uh oh!

mssonicbld commented Mar 3, 2026

Uh oh!

azure-pipelines bot commented Mar 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented Mar 3, 2026

Uh oh!

azure-pipelines bot commented Mar 3, 2026

Uh oh!

mssonicbld commented Mar 3, 2026

Uh oh!

azure-pipelines bot commented Mar 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

FengPan-Frank commented Mar 3, 2026 •

edited

Loading