Is your feature request related to a problem? Please describe.
Sub-issue 04's 10-minute Beat scan guarantees pending ops eventually get retried, but the latency once a device comes online can stretch to 15+ minutes on an early retry (5-min scan jitter + half the scan cadence + the next backoff window). For an operator watching a deployment in real time, that's a frustrating gap.
The faster wake-up signal is openwisp-monitoring's health_status_changed, which fires the moment a device transitions back to ok. Two things make this tricky:
- openwisp-monitoring is not a dependency of the firmware upgrader, so the integration has to stay optional - deployments without monitoring should still get persistence via Beat alone.
- A burst can hurt: when a network outage recovers and 200 devices flip from
critical → ok in the same second, naively connecting a handler would fire 200 retries at once and saturate the broker. Needs jitter.
Describe the solution I would implement
I would like to add an optional signal-based wake-up path that complements sub-issue 04's Beat scan without becoming a hard dependency on openwisp-monitoring.
-
Add a connect_monitoring_signals() method to FirmwareUpdaterConfig and call it from ready(). Wrap the from openwisp_monitoring.device.signals import health_status_changed import in try/except ImportError - if monitoring isn't installed, the connection silently no-ops and the rest of ready() finishes normally. Sub-issue 04's Beat-driven path keeps working either way.
-
Implement the handler in a new signals_handlers.py (or extend signals.py). Signal signature is verified at openwisp_monitoring/device/signals.py:3 and emitted from openwisp_monitoring/device/base/models.py:377: health_status_changed.send(sender, instance, status). The handler reacts only when status == "ok" and ignores critical, unknown, problem, and deactivated. Lateral ok → ok re-emissions could trigger duplicate dispatches, but sub-issue 04's atomic compare-and-swap absorbs them (see bullet 5).
-
The signal's instance is the DeviceMonitoring row that owns the health status; its related Device is instance.device (OneToOneField). For each pending op on that device, dispatch retry_pending_upgrade from sub-issue 04 with a randomized countdown:
pending_pks = UpgradeOperation.objects.filter(
device=instance.device, status="pending"
).values_list("pk", flat=True)
for pk in pending_pks:
countdown = random.uniform(0, PERSISTENT_RETRY_SIGNAL_JITTER)
retry_pending_upgrade.apply_async(args=[pk], countdown=countdown)
-
One configurable setting for the signal-driven dispatch jitter:
| Setting |
Default |
Purpose |
..._PERSISTENT_RETRY_SIGNAL_JITTER |
120 (2 min) |
Smaller than sub-issue 04's 5-min Beat jitter because signal wake-up is meant to feel fast |
-
Idempotency comes for free from sub-issue 04: both the signal handler and the Beat scan call retry_pending_upgrade, which uses the atomic filter(status="pending").update(status="in-progress") compare-and-swap. If both fire for the same op in the same minute, only one worker's update returns nonzero; the other exits silently. That directly handles the edge case where the signal triggers a wake-up while the op is seemingly already woken up - ignore and do nothing.
-
Testing approach: since openwisp-monitoring isn't installed in the firmware upgrader's CI, I'd construct a mock django.dispatch.Signal() locally with the same (sender, instance, status) kwargs and call the handler directly. Tests cover: status="ok" with a matching pending op dispatches one retry with countdown in [0, jitter]; non-recovery statuses dispatch nothing; no matching pending op dispatches nothing; connect_monitoring_signals silently no-ops when the import fails; signal + Beat dispatched concurrently for the same op result in exactly one upgrade_firmware.delay call.
Is your feature request related to a problem? Please describe.
Sub-issue 04's 10-minute Beat scan guarantees pending ops eventually get retried, but the latency once a device comes online can stretch to 15+ minutes on an early retry (5-min scan jitter + half the scan cadence + the next backoff window). For an operator watching a deployment in real time, that's a frustrating gap.
The faster wake-up signal is openwisp-monitoring's
health_status_changed, which fires the moment a device transitions back took. Two things make this tricky:critical → okin the same second, naively connecting a handler would fire 200 retries at once and saturate the broker. Needs jitter.Describe the solution I would implement
I would like to add an optional signal-based wake-up path that complements sub-issue 04's Beat scan without becoming a hard dependency on openwisp-monitoring.
Add a
connect_monitoring_signals()method toFirmwareUpdaterConfigand call it fromready(). Wrap thefrom openwisp_monitoring.device.signals import health_status_changedimport intry/except ImportError- if monitoring isn't installed, the connection silently no-ops and the rest ofready()finishes normally. Sub-issue 04's Beat-driven path keeps working either way.Implement the handler in a new
signals_handlers.py(or extendsignals.py). Signal signature is verified atopenwisp_monitoring/device/signals.py:3and emitted fromopenwisp_monitoring/device/base/models.py:377:health_status_changed.send(sender, instance, status). The handler reacts only whenstatus == "ok"and ignorescritical,unknown,problem, anddeactivated. Lateralok → okre-emissions could trigger duplicate dispatches, but sub-issue 04's atomic compare-and-swap absorbs them (see bullet 5).The signal's
instanceis theDeviceMonitoringrow that owns the health status; its relatedDeviceisinstance.device(OneToOneField). For each pending op on that device, dispatchretry_pending_upgradefrom sub-issue 04 with a randomized countdown:One configurable setting for the signal-driven dispatch jitter:
..._PERSISTENT_RETRY_SIGNAL_JITTERIdempotency comes for free from sub-issue 04: both the signal handler and the Beat scan call
retry_pending_upgrade, which uses the atomicfilter(status="pending").update(status="in-progress")compare-and-swap. If both fire for the same op in the same minute, only one worker's update returns nonzero; the other exits silently. That directly handles the edge case where the signal triggers a wake-up while the op is seemingly already woken up - ignore and do nothing.Testing approach: since openwisp-monitoring isn't installed in the firmware upgrader's CI, I'd construct a mock
django.dispatch.Signal()locally with the same(sender, instance, status)kwargs and call the handler directly. Tests cover:status="ok"with a matching pending op dispatches one retry with countdown in[0, jitter]; non-recovery statuses dispatch nothing; no matching pending op dispatches nothing;connect_monitoring_signalssilently no-ops when the import fails; signal + Beat dispatched concurrently for the same op result in exactly oneupgrade_firmware.delaycall.