Skip to content

[feature:gsoc26] Add health_status_changed signal handler for fast pending-upgrade wake-up via OpenWISP Monitoring #425

Description

@Eeshu-Yadav

Is your feature request related to a problem? Please describe.

Sub-issue 04's 10-minute Beat scan guarantees pending ops eventually get retried, but the latency once a device comes online can stretch to 15+ minutes on an early retry (5-min scan jitter + half the scan cadence + the next backoff window). For an operator watching a deployment in real time, that's a frustrating gap.

The faster wake-up signal is openwisp-monitoring's health_status_changed, which fires the moment a device transitions back to ok. Two things make this tricky:

  • openwisp-monitoring is not a dependency of the firmware upgrader, so the integration has to stay optional - deployments without monitoring should still get persistence via Beat alone.
  • A burst can hurt: when a network outage recovers and 200 devices flip from critical → ok in the same second, naively connecting a handler would fire 200 retries at once and saturate the broker. Needs jitter.

Describe the solution I would implement

I would like to add an optional signal-based wake-up path that complements sub-issue 04's Beat scan without becoming a hard dependency on openwisp-monitoring.

  1. Add a connect_monitoring_signals() method to FirmwareUpdaterConfig and call it from ready(). Wrap the from openwisp_monitoring.device.signals import health_status_changed import in try/except ImportError - if monitoring isn't installed, the connection silently no-ops and the rest of ready() finishes normally. Sub-issue 04's Beat-driven path keeps working either way.

  2. Implement the handler in a new signals_handlers.py (or extend signals.py). Signal signature is verified at openwisp_monitoring/device/signals.py:3 and emitted from openwisp_monitoring/device/base/models.py:377: health_status_changed.send(sender, instance, status). The handler reacts only when status == "ok" and ignores critical, unknown, problem, and deactivated. Lateral ok → ok re-emissions could trigger duplicate dispatches, but sub-issue 04's atomic compare-and-swap absorbs them (see bullet 5).

  3. The signal's instance is the DeviceMonitoring row that owns the health status; its related Device is instance.device (OneToOneField). For each pending op on that device, dispatch retry_pending_upgrade from sub-issue 04 with a randomized countdown:

    pending_pks = UpgradeOperation.objects.filter(
        device=instance.device, status="pending"
    ).values_list("pk", flat=True)
    for pk in pending_pks:
        countdown = random.uniform(0, PERSISTENT_RETRY_SIGNAL_JITTER)
        retry_pending_upgrade.apply_async(args=[pk], countdown=countdown)
  4. One configurable setting for the signal-driven dispatch jitter:

    Setting Default Purpose
    ..._PERSISTENT_RETRY_SIGNAL_JITTER 120 (2 min) Smaller than sub-issue 04's 5-min Beat jitter because signal wake-up is meant to feel fast
  5. Idempotency comes for free from sub-issue 04: both the signal handler and the Beat scan call retry_pending_upgrade, which uses the atomic filter(status="pending").update(status="in-progress") compare-and-swap. If both fire for the same op in the same minute, only one worker's update returns nonzero; the other exits silently. That directly handles the edge case where the signal triggers a wake-up while the op is seemingly already woken up - ignore and do nothing.

  6. Testing approach: since openwisp-monitoring isn't installed in the firmware upgrader's CI, I'd construct a mock django.dispatch.Signal() locally with the same (sender, instance, status) kwargs and call the handler directly. Tests cover: status="ok" with a matching pending op dispatches one retry with countdown in [0, jitter]; non-recovery statuses dispatch nothing; no matching pending op dispatches nothing; connect_monitoring_signals silently no-ops when the import fails; signal + Beat dispatched concurrently for the same op result in exactly one upgrade_firmware.delay call.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestgsoc-ideaIssues part of Google Summer of Code project

Type

No fields configured for Task.

Projects

Status
In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions