Skip to content

Wait for monit monitor <service> operation to complete during config reload#4295

Open
tirupatihemanth wants to merge 1 commit intosonic-net:masterfrom
tirupatihemanth:fix_reload_monit
Open

Wait for monit monitor <service> operation to complete during config reload#4295
tirupatihemanth wants to merge 1 commit intosonic-net:masterfrom
tirupatihemanth:fix_reload_monit

Conversation

@tirupatihemanth
Copy link
Contributor

@tirupatihemanth tirupatihemanth commented Feb 20, 2026

Fixes sonic-net/sonic-buildimage#25599

What I did

Wait for monit monitor operation to complete before monit reload operation

Why I did

There is a race condition in the implementation of config reload in sonic code. 

During config reload -y -f we ask monit to unmonitor container_checker to avoid errors as containers go down. And during restarting as part of the reload we ask monit to monitor container_checker again. For this we using "monit monitor container_checker" command. This is an async operation. Please see below 

https://github.com/sonic-net/sonic-utilities/blob/cbb31f0d65c6768107f2089f6c75a617d8b519b4/config/main.py#L1058C1-L1068C55

    try:
        subprocess.check_call(['sudo', 'monit', 'status'], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
        click.echo("Enabling container and routeCheck monitoring ...")
        clicommon.run_command(['sudo', 'monit', 'monitor', 'routeCheck'])
        clicommon.run_command(['sudo', 'monit', 'monitor', 'container_checker'])
        time.sleep(1)
    except subprocess.CalledProcessError as err:
        pass
    # Reload Monit configuration to pick up new hostname in case it changed
    click.echo("Reloading Monit configuration ...")
    clicommon.run_command(['sudo', 'monit', 'reload'])

During monit reload, monit saves and restores monitoring state and since container_checker was not enabled so it will forever remained unmonitored from this point onwards.

How I did it

Wait till monitor action is complete before monit reload

How to verify it

monit checkers will not be left in not monitored state after config reload

Copilot AI review requested due to automatic review settings February 20, 2026 20:00
@mssonicbld
Copy link
Collaborator

/azp run

@tirupatihemanth tirupatihemanth marked this pull request as draft February 20, 2026 20:00
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the config reload/load_minigraph service restart flow in sonic-utilities to avoid leaving monit checkers in a Not monitored state by waiting for monit monitor <service> to actually take effect before reloading monit.

Changes:

  • Added a polling helper to wait until a monit service is no longer reported as “Not monitored”.
  • Replaced the fixed sleep(1) after monit monitor ... with explicit waits for routeCheck and container_checker.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

reload

Signed-off-by: Hemanth Kumar Tirupati <htirupati@nvidia.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Race condition in enabling monitoring during config reload

4 participants