Skip to content

Address cphc service_checker issue in sidecar container#25851

Open
FengPan-Frank wants to merge 1 commit intosonic-net:masterfrom
FengPan-Frank:fix_health_checker
Open

Address cphc service_checker issue in sidecar container#25851
FengPan-Frank wants to merge 1 commit intosonic-net:masterfrom
FengPan-Frank:fix_health_checker

Conversation

@FengPan-Frank
Copy link
Contributor

@FengPan-Frank FengPan-Frank commented Mar 3, 2026

Why I did it

Maintain both container_container and service_checker in ASYNC mode from master.

Work item tracking
  • Microsoft ADO (number only):

How I did it

Use telemetry-sidecar to pack corresponding script.

How to verify it

Before change CPHC returns below:

image

After change CPHC works well.
image

Which release branch to backport (provide reason below if selected)

  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Copilot AI review requested due to automatic review settings March 3, 2026 04:48
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to address system-health / service_checker behavior when running in the telemetry sidecar (KubeSonic context) by syncing an updated service_checker.py from the sidecar container onto the host and adjusting container selection logic.

Changes:

  • Add a KubeSonic container whitelist to bypass service/container health checks for specific containers.
  • Ship service_checker.py into the telemetry-sidecar image and sync it onto the host with branch-conditional selection.
  • Extend telemetry-sidecar systemd_stub.py and its CLI plugin tests to cover the new sync item.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/system-health/health_checker/service_checker.py Adds a container whitelist and changes how running containers are discovered/filtered.
rules/scripts.mk Exposes service_checker.py as a build-copyable script input.
rules/docker-telemetry-sidecar.mk Includes service_checker.py in telemetry-sidecar copied files.
dockers/docker-telemetry-sidecar/systemd_stub.py Syncs service_checker.py into host health_checker package; adds post-copy restart action; branch-conditional source selection.
dockers/docker-telemetry-sidecar/systemd_scripts/service_checker.py_202411 Adds a branch-specific service_checker.py payload for syncing onto 202411 hosts.
dockers/docker-telemetry-sidecar/cli-plugin-tests/test_systemd_stub.py Updates tests/fixtures for the new sync item and branch-conditional behavior.
dockers/docker-telemetry-sidecar/Dockerfile.j2 Copies the staged service_checker.py into the image for syncing.
Comments suppressed due to low confidence (1)

src/system-health/health_checker/service_checker.py:108

  • CONTAINER_K8S_WHITELIST is applied unconditionally to the FEATURE table, so containers like telemetry/restapi will never be considered "expected running" even on non-KubeSonic/non-K8s images. This also makes the existing if container_name == "telemetry" special-case unreachable. Consider gating this whitelist behind an explicit KubeSonic/K8s enablement check (or removing telemetry from the unconditional set) so classic deployments don’t lose health coverage.
        container_list = []
        for container_name in feature_table.keys():
            # Skip containers in the whitelist
            if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:
                logger.log_debug("Skipping whitelisted kubesonic managed container '{}' from expected running check".format(container_name))
                continue
            # skip frr_bmp since it's not container just bmp option used by bgpd
            if container_name == "frr_bmp":
                continue
            # slim image does not have telemetry container and corresponding docker image
            if container_name == "telemetry":
                ret = check_docker_image("docker-sonic-telemetry")

Comment on lines 162 to 173
for ctr in lst:
# Check if this is a Kubernetes-managed container
labels = ctr.labels or {}
ns = labels.get("io.kubernetes.pod.namespace")
dtype = labels.get("io.kubernetes.docker.type")
kname = labels.get("io.kubernetes.container.name")

if ns == "sonic":
# Kubernetes-managed container - add service name to running containers
# but skip critical process checking (k8s has its own health mechanisms)
if dtype == "container" and kname and kname not in ("<no value>", "POD"):
running_containers.add(kname)

if ns:
continue

# Regular Docker container - use the container name
# Skip kubesonic managed containers in the whitelist
if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:
continue
running_containers.add(ctr.name)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_current_running_containers() now continues for any container with io.kubernetes.pod.namespace set, which drops the previously-supported behavior of mapping k8s-managed containers to their service name (e.g., io.kubernetes.container.name). This breaks existing unit tests in src/system-health/tests/test_system_health.py that assert k8s containers like snmp/database appear in running_containers, and can also cause false "not running" alerts if a FEATURE is enabled but the container is k8s-managed. Restore adding the k8s container name to running_containers (while still skipping critical process inspection), and then apply the whitelist only to the intended subset.

Copilot uses AI. Check for mistakes.
["sudo", "systemctl", "daemon-reload"],
["sudo", "systemctl", "restart", "monit"],
],
"/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py": [
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POST_COPY_ACTIONS hardcodes the service_checker destination path as a string even though HOST_SERVICE_CHECKER is defined below. Using HOST_SERVICE_CHECKER as the dict key avoids drift if the destination path ever changes (and keeps tests/logic aligned).

Suggested change
"/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py": [
HOST_SERVICE_CHECKER: [

Copilot uses AI. Check for mistakes.
Comment on lines +197 to +199
# For 202412 and other branches, use the default container_checker
container_checker_src = "/usr/share/sonic/systemd_scripts/container_checker"
# For 202412 and other branches, use the default service_checker
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "For 202412 and other branches" but _get_branch_name() can return values like master, internal, or private as well. Consider updating the comment to match the actual branching logic (i.e., "non-202411 branches").

Suggested change
# For 202412 and other branches, use the default container_checker
container_checker_src = "/usr/share/sonic/systemd_scripts/container_checker"
# For 202412 and other branches, use the default service_checker
# For non-202411 branches, use the default container_checker
container_checker_src = "/usr/share/sonic/systemd_scripts/container_checker"
# For non-202411 branches, use the default service_checker

Copilot uses AI. Check for mistakes.

def check_by_monit(self, config):
"""
et and analyze the output of $CHECK_CMD, collect status for file system or customize checker if any.
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring typo: "et and analyze" looks like it should be "Get and analyze" (and this string is user-facing documentation for the method).

Suggested change
et and analyze the output of $CHECK_CMD, collect status for file system or customize checker if any.
Get and analyze the output of $CHECK_CMD, collect status for file system or customize checker if any.

Copilot uses AI. Check for mistakes.
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI review requested due to automatic review settings March 3, 2026 06:06
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

@@ -0,0 +1,238 @@

Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first line of this script is blank, so the shebang is not on line 1. If this file is copied to /bin/container_checker and executed by Monit as a program, it will fail to execute (exec format error). Remove the leading blank line so the shebang is the first line of the file.

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +100
# Skip containers in the whitelist
if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTAINER_K8S_WHITELIST includes "telemetry", so this early-continue makes the later telemetry-specific handling in the same loop unreachable and disables telemetry health checking even on non-KubeSonic (non-K8s) deployments. Consider applying this whitelist only when the container is actually Kubernetes-managed, or remove "telemetry" from the unconditional name-based whitelist.

Suggested change
# Skip containers in the whitelist
if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:
# Skip containers in the whitelist, except telemetry which has special handling below
if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST and container_name != "telemetry":

Copilot uses AI. Check for mistakes.
self.reset()
self.check_by_monit(config)
self.check_services(config)
swsscommon.events_deinit_publisher(self.events_handle)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module initializes an events publisher handle in init but deinitializes it at the end of check(). ServiceChecker instances are typically long-lived (created once by HealthCheckerManager and reused), so after the first check the handle will be deinitialized and subsequent publish_events() calls may fail. Align with the in-tree ServiceChecker pattern (init/deinit the publisher inside publish_events(), or keep the handle open for the object lifetime and deinit only on shutdown).

Suggested change
swsscommon.events_deinit_publisher(self.events_handle)

Copilot uses AI. Check for mistakes.
self.reset()
self.check_by_monit(config)
self.check_services(config)
swsscommon.events_deinit_publisher(self.events_handle)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module initializes an events publisher handle in init but deinitializes it at the end of check(). ServiceChecker instances are typically long-lived (created once by HealthCheckerManager and reused), so after the first check the handle will be deinitialized and subsequent publish_events() calls may fail. Align with the in-tree ServiceChecker pattern (init/deinit the publisher inside publish_events(), or keep the handle open for the object lifetime and deinit only on shutdown).

Suggested change
swsscommon.events_deinit_publisher(self.events_handle)

Copilot uses AI. Check for mistakes.
self.reset()
self.check_by_monit(config)
self.check_services(config)
swsscommon.events_deinit_publisher(self.events_handle)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module initializes an events publisher handle in init but deinitializes it at the end of check(). ServiceChecker instances are typically long-lived (created once by HealthCheckerManager and reused), so after the first check the handle will be deinitialized and subsequent publish_events() calls may fail. Align with the in-tree ServiceChecker pattern (init/deinit the publisher inside publish_events(), or keep the handle open for the object lifetime and deinit only on shutdown).

Suggested change
swsscommon.events_deinit_publisher(self.events_handle)

Copilot uses AI. Check for mistakes.
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI review requested due to automatic review settings March 3, 2026 11:06
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (1)

src/system-health/health_checker/service_checker.py:122

  • The whitelist includes telemetry, but the loop checks the whitelist before the existing if container_name == "telemetry" handling (slim-image / gnmi fallback). As written, the telemetry-specific logic is unreachable and telemetry will never be part of expected_running_containers, even on non-Kubernetes images where telemetry is a regular Docker container. Please gate the whitelist on a positive KubeSonic/K8s signal (or remove telemetry from the unconditional whitelist) so existing telemetry/gnmi checking behavior is preserved when appropriate.
        for container_name in feature_table.keys():
            # Skip containers in the whitelist
            if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:
                logger.log_debug("Skipping whitelisted kubesonic managed container '{}' from expected running check".format(container_name))
                continue
            # skip frr_bmp since it's not container just bmp option used by bgpd
            if container_name == "frr_bmp":
                continue
            # slim image does not have telemetry container and corresponding docker image
            if container_name == "telemetry":
                ret = check_docker_image("docker-sonic-telemetry")
                if not ret:
                    # If telemetry container image is not present, check gnmi container image
                    # If gnmi container image is not present, ignore telemetry container check
                    # if gnmi container image is present, check gnmi container instead of telemetry
                    ret = check_docker_image("docker-sonic-gnmi")
                    if not ret:
                        logger.log_debug("Ignoring telemetry container check on image which has no corresponding docker image")
                    else:
                        container_list.append("gnmi")
                    continue
            # Some platforms may not include the OTEL container; skip expecting it when image absent
            if container_name == "otel":
                if not check_docker_image("docker-sonic-otel"):
                    logger.log_debug("Ignoring otel container check on image which has no corresponding docker image")

Comment on lines 125 to +186
],
"/bin/container_checker": [
["sudo", "systemctl", "daemon-reload"],
["sudo", "systemctl", "restart", "monit"],
],
"/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py": [
["sudo", "systemctl", "restart", "system-health"],
],
}


def _ensure_user_auth_cert() -> None:
cur = db_hget("GNMI|gnmi", "user_auth")
if cur != "cert":
if db_hset("GNMI|gnmi", "user_auth", "cert"):
logger.log_notice(f"Set GNMI|gnmi.user_auth=cert (was: {cur or '<unset>'})")
else:
logger.log_error("Failed to set GNMI|gnmi.user_auth=cert")


def _ensure_cname_present(cname: str) -> None:
if not cname:
logger.log_warning("TELEMETRY_CLIENT_CNAME not set; skip CNAME creation")
return

key = f"GNMI_CLIENT_CERT|{cname}"
entry = db_hgetall(key)
if not entry:
if db_hset(key, "role", GNMI_CLIENT_ROLE):
logger.log_notice(f"Created {key} with role={GNMI_CLIENT_ROLE}")
else:
logger.log_error(f"Failed to create {key}")


def _ensure_cname_absent(cname: str) -> None:
if not cname:
return
key = f"GNMI_CLIENT_CERT|{cname}"
if db_hgetall(key):
if db_del(key):
logger.log_notice(f"Removed {key}")
else:
logger.log_error(f"Failed to remove {key}")


def reconcile_config_db_once() -> None:
"""
Idempotent drift-correction for CONFIG_DB:
- When TELEMETRY_CLIENT_CERT_VERIFY_ENABLED=true:
* Ensure GNMI|gnmi.user_auth=cert
* Ensure GNMI_CLIENT_CERT|<CNAME> exists with role=<GNMI_CLIENT_ROLE>
- When false: ensure the CNAME row is absent
"""
if GNMI_VERIFY_ENABLED:
_ensure_user_auth_cert()
_ensure_cname_present(GNMI_CLIENT_CNAME)
else:
_ensure_cname_absent(GNMI_CLIENT_CNAME)

# Host destination for service_checker.py
HOST_SERVICE_CHECKER = "/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py"

Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POST_COPY_ACTIONS hardcodes the host destination string for service_checker.py, while HOST_SERVICE_CHECKER is defined separately a few lines later and ensure_sync() uses the constant. This duplication is easy to drift (path changes, python minor version changes, etc.). Consider defining HOST_SERVICE_CHECKER before POST_COPY_ACTIONS and using it as the dict key, so post-copy restarts always track the actual sync destination.

Copilot uses AI. Check for mistakes.
Comment on lines +130 to 205
"/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py": [
["sudo", "systemctl", "restart", "system-health"],
],
}


def _ensure_user_auth_cert() -> None:
cur = db_hget("GNMI|gnmi", "user_auth")
if cur != "cert":
if db_hset("GNMI|gnmi", "user_auth", "cert"):
logger.log_notice(f"Set GNMI|gnmi.user_auth=cert (was: {cur or '<unset>'})")
else:
logger.log_error("Failed to set GNMI|gnmi.user_auth=cert")


def _ensure_cname_present(cname: str) -> None:
if not cname:
logger.log_warning("TELEMETRY_CLIENT_CNAME not set; skip CNAME creation")
return

key = f"GNMI_CLIENT_CERT|{cname}"
entry = db_hgetall(key)
if not entry:
if db_hset(key, "role", GNMI_CLIENT_ROLE):
logger.log_notice(f"Created {key} with role={GNMI_CLIENT_ROLE}")
else:
logger.log_error(f"Failed to create {key}")


def _ensure_cname_absent(cname: str) -> None:
if not cname:
return
key = f"GNMI_CLIENT_CERT|{cname}"
if db_hgetall(key):
if db_del(key):
logger.log_notice(f"Removed {key}")
else:
logger.log_error(f"Failed to remove {key}")


def reconcile_config_db_once() -> None:
"""
Idempotent drift-correction for CONFIG_DB:
- When TELEMETRY_CLIENT_CERT_VERIFY_ENABLED=true:
* Ensure GNMI|gnmi.user_auth=cert
* Ensure GNMI_CLIENT_CERT|<CNAME> exists with role=<GNMI_CLIENT_ROLE>
- When false: ensure the CNAME row is absent
"""
if GNMI_VERIFY_ENABLED:
_ensure_user_auth_cert()
_ensure_cname_present(GNMI_CLIENT_CNAME)
else:
_ensure_cname_absent(GNMI_CLIENT_CNAME)

# Host destination for service_checker.py
HOST_SERVICE_CHECKER = "/usr/local/lib/python3.11/dist-packages/health_checker/service_checker.py"


def ensure_sync() -> bool:
branch_name = _get_branch_name()

if branch_name == "202411":
# For 202411 branch, use the branch-specific container_checker
container_checker_src = "/usr/share/sonic/systemd_scripts/container_checker_202411"
if branch_name in ("202411", "202412", "202505"):
# For 202411/202412/202505 branches, use the branch-specific container_checker
container_checker_src = f"/usr/share/sonic/systemd_scripts/container_checker_{branch_name}"
# For 202411/202412/202505 branches, use the branch-specific service_checker
service_checker_src = f"/usr/share/sonic/systemd_scripts/service_checker.py_{branch_name}"
else:
# For 202412 and other branches, use the default container_checker
# For other branches, use the default container_checker
container_checker_src = "/usr/share/sonic/systemd_scripts/container_checker"
# For other branches, use the default service_checker
service_checker_src = "/usr/share/sonic/systemd_scripts/service_checker.py"

items: List[SyncItem] = SYNC_ITEMS + [
SyncItem(container_checker_src, "/bin/container_checker"),
SyncItem(service_checker_src, HOST_SERVICE_CHECKER),
]
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sync destination is hardcoded to /usr/local/lib/python3.11/dist-packages/.... This will break if the host image uses a different Python minor version (e.g., bullseye python3.9) or if SONiC moves to python3.12+. Consider resolving the host site-packages path dynamically (e.g., via python3 -c 'import site; ...' under nsenter) or using a version-agnostic path if available.

Copilot uses AI. Check for mistakes.
"""When branch is not 202411, ensure_sync uses the default service_checker.py."""
ss_mod, container_fs, host_fs, commands, config_db = ss

# _get_branch_name already returns "202412" from fixture default
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is now inaccurate because the fixture was changed to make _get_branch_name() return "master" by default. Please update it so future readers don’t assume the default path is exercised via 202412.

Suggested change
# _get_branch_name already returns "202412" from fixture default
# _get_branch_name already returns "master" (a non-202411 branch) from the fixture by default

Copilot uses AI. Check for mistakes.
Comment on lines 357 to 430
@patch('swsscommon.swsscommon.ConfigDBConnector.connect', MagicMock())
@patch('health_checker.service_checker.ServiceChecker._get_container_folder', MagicMock(return_value=test_path))
@patch('sonic_py_common.multi_asic.is_multi_asic', MagicMock(return_value=False))
@patch('docker.DockerClient')
@patch('health_checker.utils.run_command')
@patch('swsscommon.swsscommon.ConfigDBConnector')
def test_service_checker_k8s_containers(mock_config_db, mock_run, mock_docker_client):
"""Test that service checker recognizes Kubernetes-managed containers by labels"""
"""Test that service checker skips Kubernetes-managed containers (namespace=sonic)"""
setup()
mock_db_data = MagicMock()
mock_get_table = MagicMock()
mock_db_data.get_table = mock_get_table
mock_config_db.return_value = mock_db_data
mock_get_table.return_value = {
'snmp': {
'state': 'enabled',
'has_global_scope': 'True',
'has_per_asic_scope': 'False',
},
'restapi': {
'state': 'enabled',
'has_global_scope': 'True',
'has_per_asic_scope': 'False',
}
}

# Mock Kubernetes containers with labels
mock_containers = MagicMock()
mock_snmp_container = MagicMock()
mock_snmp_container.name = 'k8s_snmp_snmp-pod-test_sonic_12345678-1234-1234-1234-123456789abc_0'
mock_snmp_container.labels = {
'io.kubernetes.pod.namespace': 'sonic',
'io.kubernetes.docker.type': 'container',
'io.kubernetes.container.name': 'snmp'
}

mock_restapi_container = MagicMock()
mock_restapi_container.name = 'k8s_restapi_restapi-pod-test_sonic_87654321-4321-4321-4321-cba987654321_0'
mock_restapi_container.labels = {
'io.kubernetes.pod.namespace': 'sonic',
'io.kubernetes.docker.type': 'container',
'io.kubernetes.container.name': 'restapi'
}

# Mock POD container (should be ignored)
# Mock POD container (should also be skipped)
mock_pod_container = MagicMock()
mock_pod_container.name = 'k8s_POD_snmp-pod-test_sonic_12345678-1234-1234-1234-123456789abc_0'
mock_pod_container.labels = {
'io.kubernetes.pod.namespace': 'sonic',
'io.kubernetes.docker.type': 'container',
'io.kubernetes.container.name': 'POD'
}

mock_containers.list = MagicMock(return_value=[mock_snmp_container, mock_restapi_container, mock_pod_container])
mock_docker_client_object = MagicMock()
mock_docker_client.return_value = mock_docker_client_object
mock_docker_client_object.containers = mock_containers

mock_run.return_value = mock_supervisorctl_output

checker = ServiceChecker()
config = Config()
checker.check(config)

# Verify k8s containers are recognized by their label names
# Verify all K8s containers (namespace=sonic) are excluded from running containers
running_containers = checker.get_current_running_containers()
assert 'snmp' in running_containers
assert 'restapi' in running_containers
assert 'snmp' not in running_containers
assert 'restapi' not in running_containers
assert 'POD' not in running_containers

# Verify k8s containers are NOT added to critical processes (k8s has its own health checks)
# Verify k8s containers are NOT added to critical processes
assert 'snmp' not in checker.container_critical_processes
assert 'restapi' not in checker.container_critical_processes

Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test sets FEATURE snmp as enabled but uses only Kubernetes-managed containers (namespace sonic). With the current implementation, checker.check() will still consider snmp expected but never "running" (since K8s containers are skipped), so the health result can be NOT_OK even though the test passes. Add assertions on checker._info (or adjust the code) to ensure enabled K8s containers are handled as intended and the test can’t pass while the checker is actually flagging services down.

Copilot uses AI. Check for mistakes.
Comment on lines +161 to +171
# Check if this is a Kubernetes-managed container
labels = ctr.labels or {}
ns = labels.get("io.kubernetes.pod.namespace")
if ns == "sonic":
continue
# Skip kubesonic managed containers in the whitelist
if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:
continue
running_containers.add(ctr.name)
if ctr.name not in self.container_critical_processes:
self.fill_critical_process_by_container(ctr.name)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the master ServiceChecker, this branch-specific script skips all containers with io.kubernetes.pod.namespace == "sonic". If any enabled FEATURE is Kubernetes-managed but not in CONTAINER_K8S_WHITELIST, it will be treated as "not running". Consider adding K8s containers to running_containers via io.kubernetes.container.name (and only skipping specific whitelisted names) to avoid false negatives.

Suggested change
# Check if this is a Kubernetes-managed container
labels = ctr.labels or {}
ns = labels.get("io.kubernetes.pod.namespace")
if ns == "sonic":
continue
# Skip kubesonic managed containers in the whitelist
if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:
continue
running_containers.add(ctr.name)
if ctr.name not in self.container_critical_processes:
self.fill_critical_process_by_container(ctr.name)
# Determine if this is a Kubernetes-managed container in the 'sonic' namespace
labels = ctr.labels or {}
ns = labels.get("io.kubernetes.pod.namespace")
k8s_container_name = labels.get("io.kubernetes.container.name")
if ns == "sonic" and k8s_container_name:
# For sonic Kubernetes containers, use the Kubernetes container name
container_name = k8s_container_name
# Skip kubesonic managed containers in the whitelist
if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:
continue
else:
# For non-sonic or non-Kubernetes containers, fall back to the Docker name
container_name = ctr.name
# Skip kubesonic managed containers in the whitelist
if container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:
continue
running_containers.add(container_name)
if container_name not in self.container_critical_processes:
self.fill_critical_process_by_container(container_name)

Copilot uses AI. Check for mistakes.
Comment on lines +165 to +174
labels = ctr.labels or {}
ns = labels.get("io.kubernetes.pod.namespace")
if ns == "sonic":
continue
# Skip kubesonic managed containers in the whitelist
if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:
continue
running_containers.add(ctr.name)
if ctr.name not in self.container_critical_processes:
self.fill_critical_process_by_container(ctr.name)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the master ServiceChecker, this branch-specific script skips all containers with io.kubernetes.pod.namespace == "sonic". If any enabled FEATURE is Kubernetes-managed but not in CONTAINER_K8S_WHITELIST, it will be treated as "not running". Consider adding K8s containers to running_containers via io.kubernetes.container.name (and only skipping specific whitelisted names) to avoid false negatives.

Suggested change
labels = ctr.labels or {}
ns = labels.get("io.kubernetes.pod.namespace")
if ns == "sonic":
continue
# Skip kubesonic managed containers in the whitelist
if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:
continue
running_containers.add(ctr.name)
if ctr.name not in self.container_critical_processes:
self.fill_critical_process_by_container(ctr.name)
labels = getattr(ctr, "labels", {}) or {}
ns = labels.get("io.kubernetes.pod.namespace")
k8s_name = labels.get("io.kubernetes.container.name")
# Determine the logical container name to use for service tracking
container_name = k8s_name if k8s_name else ctr.name
# Skip specific Kubernetes-managed containers in the SONiC namespace that are whitelisted
if ns == "sonic" and container_name in ServiceChecker.CONTAINER_K8S_WHITELIST:
continue
# For non-Kubernetes (or non-SONiC-namespace) containers, preserve existing whitelist behavior
if ns != "sonic" and ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:
continue
running_containers.add(container_name)
if container_name not in self.container_critical_processes:
self.fill_critical_process_by_container(container_name)

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +3

#!/usr/bin/env python3

Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script has a blank first line before the shebang. If /bin/container_checker is executed directly (e.g., by Monit), the kernel won’t see a valid #! interpreter line and execution can fail with Exec format error. Remove the leading blank line so the shebang is on line 1.

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +174
# Check if this is a Kubernetes-managed container
labels = ctr.labels or {}
ns = labels.get("io.kubernetes.pod.namespace")
if ns == "sonic":
continue
# Skip kubesonic managed containers in the whitelist
if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:
continue
running_containers.add(ctr.name)
if ctr.name not in self.container_critical_processes:
self.fill_critical_process_by_container(ctr.name)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the master ServiceChecker, this branch-specific script skips all containers with io.kubernetes.pod.namespace == "sonic". If any enabled FEATURE is Kubernetes-managed but not in CONTAINER_K8S_WHITELIST, it will be treated as "not running". Consider adding K8s containers to running_containers via io.kubernetes.container.name (and only skipping specific whitelisted names) to avoid false negatives.

Suggested change
# Check if this is a Kubernetes-managed container
labels = ctr.labels or {}
ns = labels.get("io.kubernetes.pod.namespace")
if ns == "sonic":
continue
# Skip kubesonic managed containers in the whitelist
if ctr.name in ServiceChecker.CONTAINER_K8S_WHITELIST:
continue
running_containers.add(ctr.name)
if ctr.name not in self.container_critical_processes:
self.fill_critical_process_by_container(ctr.name)
# Determine effective container name, taking Kubernetes metadata into account
labels = ctr.labels or {}
ns = labels.get("io.kubernetes.pod.namespace")
k8s_container_name = labels.get("io.kubernetes.container.name")
if ns == "sonic":
# For Kubernetes-managed SONiC pods, prefer the K8s container name
effective_name = k8s_container_name or ctr.name
else:
effective_name = ctr.name
# Skip kubesonic managed containers in the whitelist
if effective_name in ServiceChecker.CONTAINER_K8S_WHITELIST:
continue
running_containers.add(effective_name)
if effective_name not in self.container_critical_processes:
self.fill_critical_process_by_container(effective_name)

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,238 @@

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we need 202411 container_checker patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

202411 container_checker has been added in previous PR #25598

COPY ["systemd_stub.py", "/usr/bin/"]
COPY ["systemd_scripts/", "/usr/share/sonic/systemd_scripts/"]
COPY ["files/container_checker", "/usr/share/sonic/systemd_scripts/container_checker"]
COPY ["files/service_checker.py", "/usr/share/sonic/systemd_scripts/service_checker.py"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we copy all the versions of container_checker and service_checker into the docker image?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only master change requires this copy, other version container_checker/service_checker are created in systemd_scripts repo directly.

@FengPan-Frank FengPan-Frank requested a review from make1980 March 3, 2026 22:04
@FengPan-Frank
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants