Skip to content

[feature] Scheduled Mass Upgrades #380

@nemesifier

Description

@nemesifier

Problem statement

Currently, mass firmware upgrades start immediately after the upgrade operation is created. This is problematic in production environments where devices must remain available during business hours. Network operators need a way to plan upgrades during low usage windows, for example overnight or during scheduled maintenance periods, without manual intervention at execution time.

Describe the solution you'd like

flowchart TD
    A[User creates mass upgrade] --> B{Schedule datetime provided?}

    B -- No --> C[Execute immediately, existing behavior]
    C --> D[Mass upgrade runs via Celery tasks]
    D --> E[Upgrade completes]
    E --> F[Send generic notification]

    B -- Yes --> G[Validate scheduled datetime]
    G -- Invalid --> H[Reject with validation error]
    G -- Valid --> I[Create operation with status SCHEDULED and store UTC datetime]

    I --> J[Periodic scheduler via Celery Beat every minute]
    J --> K{Upgrade due?}

    K -- No --> J
    K -- Yes --> L[Transition status from SCHEDULED to RUNNING]

    L --> M[Runtime validation of permissions firmware and devices]

    M -- All invalid --> N[Cancel operation with FAILED status and error]
    N --> O[Send generic notification for failure]

    M -- Some or all valid --> P[Execute mass upgrade using existing logic]
    P --> Q[Upgrade completes]
    Q --> R[Send generic notification for completion]
Loading

Introduce support for scheduled mass upgrades, allowing a mass upgrade operation to start at a user defined datetime in the future.

Scheduled upgrades must be reliable, persistent across restarts, and clearly communicate timing semantics and execution state to users.

Key requirements

UI

  • On the existing mass upgrade confirmation page, allow users to optionally schedule the upgrade for a future datetime.
  • By default, the upgrade executes immediately unless a future datetime is explicitly set.
  • The scheduled datetime must:
    • be in the future
    • respect a configurable minimum delay (eg 10 minutes, to avoid accidental scheduling)
    • not exceed a configurable maximum horizon (eg default 6 months)
  • Invalid scheduling values must be rejected with a validation error.
  • Only upgrades in scheduled status are editable. Once execution starts, the operation becomes read only.
  • Timezone handling must be explicit:
    • User input should be interpreted in the user browser timezone (or another clearly justified UX choice).
    • Datetimes must be stored in UTC.
    • The server preferred timezone should be clearly indicated in the UI.

Backend / REST API

  • Extend the REST API to allow creating, editing, and cancelling scheduled mass upgrade operations.
  • Full feature parity between Django admin and REST API is required.
  • The mass upgrade status model must be extended to include a scheduled state.
  • Store the scheduled execution datetime in a nullable scheduled_at field.
  • Persist any batch execution options that are currently only passed to the immediate Celery task, especially firmwareless, so scheduled upgrades can execute later with the same user choices.
  • Status transitions are expected to include at least:
    • scheduled → running
    • scheduled → cancelled
    • scheduled → failed
  • An operation is considered editable only while its status is scheduled.
  • The exact editable fields while scheduled should be defined explicitly. A reasonable default is:
    • editable: scheduled_at, group/location targeting fields, persistent, and firmwareless
    • not editable: build/firmware target and upgrade_options
  • The admin list, detail view, and REST API responses must clearly expose:
    • whether an operation is scheduled or immediate
    • the scheduled execution datetime (UTC)

Execution model

  • Firmware upgrades already run asynchronously using Celery tasks.
  • Scheduled execution must:
    • survive server restarts and worker crashes
    • avoid using Celery eta / countdown due to well-known limitations for executing tasks in the far future
    • schedule one task per mass upgrade operation, not per device
  • Using Celery Beat with a periodic task that scans and executes due scheduled upgrades is a valid and recommended approach.
  • Contributors are expected to propose and justify the final scheduling mechanism.

Runtime validation and failure handling

  • Devices, permissions, firmware availability, and other preconditions must be re-evaluated at execution time
  • If all target devices become invalid at execution time, the upgrade must be cancelled with a visible error state.
  • Partial changes (eg some devices removed) are acceptable and should be handled by existing upgrader logic.
  • Conflicting mass upgrade operations must not be allowed:
    • If a scheduled mass upgrade exists, users must not be able to create another conflicting mass upgrade (scheduled or immediate).
    • Contributors should verify this using tests and adjust the implementation only if needed (TDD approach).
    • Conflict checks should also consider persistent pending per-device operations, because they are still active upgrade attempts waiting for retry.
    • We should explicitly define what constitutes a conflict. One proposed starting point is an active batch operation (idle, scheduled, or in-progress) targeting the same category/device population, plus any active per-device operation in in-progress or pending.

Notifications

  • A generic_notification must be sent when a mass upgrade starts.
  • A generic_notification must be sent when any mass upgrade completes (if not already implemented).
  • A generic_notification must be sent if a scheduled upgrade fails runtime validation before execution.
  • Notifications are delivered to all organization administrators and superusers via the existing notifications module, this is already implemented and no further action is required, just fire the generic_notification when due.

Non functional requirements

  • Test coverage must not decrease.
  • Time dependent logic may be tested using mocking; tests currently run with CELERY_EAGER=True.
  • Tests should cover scheduled + persistent composition: a scheduled mass upgrade starts later, upgrades online devices, moves offline devices to pending, and retries them later.
  • Documentation must be updated to describe the new feature, including updating any affected screenshots.
  • Provide a short example usage video suitable for YouTube, to be showcased on the website and in the documentation.

Describe alternatives you've considered

  • Using external cron jobs or manual scripts to delay upgrade creation. This is error prone, not user friendly, and bypasses OpenWISP’s auditability and permissions model.
  • Relying solely on server side timezone assumptions without exposing them in the UI, which risks misconfiguration.

Additional context

Scheduled execution is a common requirement in network management systems and aligns with real world operational practices.

Constraints

  • Test coverage must not decrease.
  • Basic browser tests for UI related features are required.
  • Documentation needs to be updated to include this new feature, including updating any existing screenshots that may change after implementation.
  • We also need a short example usage video for YouTube that we can showcase on the website/documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgsoc-ideaIssues part of Google Summer of Code project

    Projects

    Status

    To do (Device management)

    Status

    ToDo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions