Skip to content

mdm: apple_mdm_dep_profile_assigner cron silently drops device events when context-cancelled mid-run #46235

@robbiet480

Description

@robbiet480

Fleet versions

  • Discovered:
  • Reproduced:

Web browser and operating system:


💥  Actual behavior

When sync phase of DEP sync cron returned an error, cursor was advanced, leading to some number of events being dropped
See

if s.callback != nil {

🛠️ Expected behavior

If DEP sync phase returns an error the cron should not advance the cursor(similar to what happens if the DEP fetch returns an error). Advancing the cursor means those events that were on the page where the error occurred will not be processed again.

Check the upstream code to see if this behavior is unique to fleet or also broken upstream and if so, open an issue(and potentially a fix PR if it makes sense) or perhaps start a thread in nano's macadmins channel

🧑‍💻  Steps to reproduce

These steps:

  • Have been confirmed to consistently lead to reproduction in multiple Fleet instances.
  • Describe the workflow that led to the error, but have not yet been reproduced in multiple Fleet instances.

Reproduction steps are uncertain. In the customer's case they are running fleet in a runtime which pauses fleet between requests, meaning there are sometimes errors from the crons when fleet is unpaused, potentially several seconds later

At any rate, when this happens and we are within this code:

if s.callback != nil {

If an error is returned from the syncer we may still advance the cursor in the DB, meaning we may not have processed the prior list of events but would still move on to the next page of events, potentially skipping some hosts

🕯️ More info (optional)

Original bug report below

Summary

When the apple_mdm_dep_profile_assigner cron is terminated mid-execution (context canceled), the nanodep sync cursor in nano_dep_names.syncer_cursor can advance without the corresponding device upserts being committed to host_dep_assignments. Devices that Apple emitted during the broken window are effectively dropped: they appear assigned to Fleet in Apple Business Manager but Fleet has no host record, and Apple's DEP backend will not re-emit those events without an out-of-band trigger.

The only recovery is to manually clear the cursor per docs/Contributing/.../resetting-apple-dep-sync-cursor.md.

Environment

  • Self-hosted Fleet on Google Cloud Run
  • Cloud Run config that triggered it: cpu-throttling=true, min-instances=0 (CPU frozen between requests + scale-to-zero). The cron's background goroutines get killed before completion.
  • Other Cloud Run deployments are likely vulnerable to the same pattern any time the container is terminated mid-tick. The root cause is the cron's non-transactional cursor handling, not the specific runtime.

Reproduction

  1. Deploy Fleet on Cloud Run with cpu-throttling=true and min-instances=0.
  2. Assign macOS devices to the Fleet ADE server in Apple Business Manager with a migration deadline.
  3. Observe Fleet logs: repeated unlock failed: context canceled, update cron stats: context canceled, and pending job might still be running, wait 1m0s.
  4. Fix the runtime (--no-cpu-throttling --min-instances=1) so the cron runs cleanly.
  5. Observe that the just-assigned devices never appear in Fleet. Apple's DEP API returns cursor returned all devices previously — the events were delivered against the advancing cursor but the upserts never landed.

Expected behavior

Either:

  • The cron is transactional: the cursor only advances after device upserts commit. A context-cancel mid-run results in the same events being replayed on the next tick.
  • Or: on startup, the cron detects "previous tick crashed without completing" and either replays or resets the cursor automatically. The pending job might still be running, wait 1m0s path already knows there's a stale lock; that signal could trigger recovery.

Actual behavior

Cursor advances; device upserts are lost; Apple does not re-emit the events. Operator-side recovery (cursor reset in the DB) is required.

Evidence

Representative log entries during the broken window:

err: "context canceled"
msg: "unlock failed"
cron: "apple_mdm_dep_profile_assigner"
err: "1 error occurred:\n\t* context canceled\n\n"
msg: "update cron stats apple_mdm_dep_profile_assigner"
schedule: "apple_mdm_dep_profile_assigner"
msg: "pending job might still be running, wait 1m0s"

Even after the runtime was fixed and the cron began completed-ing cleanly each tick, multiple devices remained invisible to Fleet. ABM-side re-assignment attempts (Fleet → other MDM → Fleet) emitted only deleted events to the Fleet token; Apple's DEP API never followed up with a re-emission, so a simple re-assignment did not recover. Only a cursor reset (per the existing troubleshooting doc) brought the missing devices into Fleet.

In our case ~3 devices were affected. We confirmed by clearing the cursor and watching the next sync return a full enumeration including op_type=modified, profile_status=assigned → pushed for the previously-stuck serials.

Suggested fix

  • Wrap the cursor update + device upsert in a single transaction (in nanodep or in Fleet's wrapper layer), so context-cancel does not leave them in inconsistent states.
  • Alternatively, when the pending job might still be running path detects a stale lock from a prior crash, treat that as a signal to reset the cursor automatically (with a log line so operators are aware).
  • Either way, surfacing the "previous run did not complete cleanly" state more visibly (a counter, an alertable log line, or a Fleet UI indicator) would help operators catch this before devices go missing.

Related

Metadata

Metadata

Labels

#g-mdmMDM product group:productProduct Design department (shows up on 🦢 Drafting board)bugSomething isn't working as documentedcustomer-camelot

Type

No type
No fields configured for issues without a type.

Projects

Status
🦤 Estimated

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions