Fleet versions
Web browser and operating system:
💥 Actual behavior
When sync phase of DEP sync cron returned an error, cursor was advanced, leading to some number of events being dropped
See
🛠️ Expected behavior
If DEP sync phase returns an error the cron should not advance the cursor(similar to what happens if the DEP fetch returns an error). Advancing the cursor means those events that were on the page where the error occurred will not be processed again.
Check the upstream code to see if this behavior is unique to fleet or also broken upstream and if so, open an issue(and potentially a fix PR if it makes sense) or perhaps start a thread in nano's macadmins channel
🧑💻 Steps to reproduce
These steps:
Reproduction steps are uncertain. In the customer's case they are running fleet in a runtime which pauses fleet between requests, meaning there are sometimes errors from the crons when fleet is unpaused, potentially several seconds later
At any rate, when this happens and we are within this code:
If an error is returned from the syncer we may still advance the cursor in the DB, meaning we may not have processed the prior list of events but would still move on to the next page of events, potentially skipping some hosts
🕯️ More info (optional)
Original bug report below
Summary
When the apple_mdm_dep_profile_assigner cron is terminated mid-execution (context canceled), the nanodep sync cursor in nano_dep_names.syncer_cursor can advance without the corresponding device upserts being committed to host_dep_assignments. Devices that Apple emitted during the broken window are effectively dropped: they appear assigned to Fleet in Apple Business Manager but Fleet has no host record, and Apple's DEP backend will not re-emit those events without an out-of-band trigger.
The only recovery is to manually clear the cursor per docs/Contributing/.../resetting-apple-dep-sync-cursor.md.
Environment
- Self-hosted Fleet on Google Cloud Run
- Cloud Run config that triggered it:
cpu-throttling=true, min-instances=0 (CPU frozen between requests + scale-to-zero). The cron's background goroutines get killed before completion.
- Other Cloud Run deployments are likely vulnerable to the same pattern any time the container is terminated mid-tick. The root cause is the cron's non-transactional cursor handling, not the specific runtime.
Reproduction
- Deploy Fleet on Cloud Run with
cpu-throttling=true and min-instances=0.
- Assign macOS devices to the Fleet ADE server in Apple Business Manager with a migration deadline.
- Observe Fleet logs: repeated
unlock failed: context canceled, update cron stats: context canceled, and pending job might still be running, wait 1m0s.
- Fix the runtime (
--no-cpu-throttling --min-instances=1) so the cron runs cleanly.
- Observe that the just-assigned devices never appear in Fleet. Apple's DEP API returns
cursor returned all devices previously — the events were delivered against the advancing cursor but the upserts never landed.
Expected behavior
Either:
- The cron is transactional: the cursor only advances after device upserts commit. A context-cancel mid-run results in the same events being replayed on the next tick.
- Or: on startup, the cron detects "previous tick crashed without completing" and either replays or resets the cursor automatically. The
pending job might still be running, wait 1m0s path already knows there's a stale lock; that signal could trigger recovery.
Actual behavior
Cursor advances; device upserts are lost; Apple does not re-emit the events. Operator-side recovery (cursor reset in the DB) is required.
Evidence
Representative log entries during the broken window:
err: "context canceled"
msg: "unlock failed"
cron: "apple_mdm_dep_profile_assigner"
err: "1 error occurred:\n\t* context canceled\n\n"
msg: "update cron stats apple_mdm_dep_profile_assigner"
schedule: "apple_mdm_dep_profile_assigner"
msg: "pending job might still be running, wait 1m0s"
Even after the runtime was fixed and the cron began completed-ing cleanly each tick, multiple devices remained invisible to Fleet. ABM-side re-assignment attempts (Fleet → other MDM → Fleet) emitted only deleted events to the Fleet token; Apple's DEP API never followed up with a re-emission, so a simple re-assignment did not recover. Only a cursor reset (per the existing troubleshooting doc) brought the missing devices into Fleet.
In our case ~3 devices were affected. We confirmed by clearing the cursor and watching the next sync return a full enumeration including op_type=modified, profile_status=assigned → pushed for the previously-stuck serials.
Suggested fix
- Wrap the cursor update + device upsert in a single transaction (in nanodep or in Fleet's wrapper layer), so context-cancel does not leave them in inconsistent states.
- Alternatively, when the
pending job might still be running path detects a stale lock from a prior crash, treat that as a signal to reset the cursor automatically (with a log line so operators are aware).
- Either way, surfacing the "previous run did not complete cleanly" state more visibly (a counter, an alertable log line, or a Fleet UI indicator) would help operators catch this before devices go missing.
Related
Fleet versions
Web browser and operating system:
💥 Actual behavior
When sync phase of DEP sync cron returned an error, cursor was advanced, leading to some number of events being dropped
See
fleet/server/mdm/nanodep/sync/syncer.go
Line 184 in 6a48147
🛠️ Expected behavior
If DEP sync phase returns an error the cron should not advance the cursor(similar to what happens if the DEP fetch returns an error). Advancing the cursor means those events that were on the page where the error occurred will not be processed again.
Check the upstream code to see if this behavior is unique to fleet or also broken upstream and if so, open an issue(and potentially a fix PR if it makes sense) or perhaps start a thread in nano's macadmins channel
🧑💻 Steps to reproduce
These steps:
Reproduction steps are uncertain. In the customer's case they are running fleet in a runtime which pauses fleet between requests, meaning there are sometimes errors from the crons when fleet is unpaused, potentially several seconds later
At any rate, when this happens and we are within this code:
fleet/server/mdm/nanodep/sync/syncer.go
Line 184 in 6a48147
If an error is returned from the syncer we may still advance the cursor in the DB, meaning we may not have processed the prior list of events but would still move on to the next page of events, potentially skipping some hosts
🕯️ More info (optional)
Original bug report below
Summary
When the
apple_mdm_dep_profile_assignercron is terminated mid-execution (context canceled), the nanodep sync cursor innano_dep_names.syncer_cursorcan advance without the corresponding device upserts being committed tohost_dep_assignments. Devices that Apple emitted during the broken window are effectively dropped: they appear assigned to Fleet in Apple Business Manager but Fleet has no host record, and Apple's DEP backend will not re-emit those events without an out-of-band trigger.The only recovery is to manually clear the cursor per docs/Contributing/.../resetting-apple-dep-sync-cursor.md.
Environment
cpu-throttling=true,min-instances=0(CPU frozen between requests + scale-to-zero). The cron's background goroutines get killed before completion.Reproduction
cpu-throttling=trueandmin-instances=0.unlock failed: context canceled,update cron stats: context canceled, andpending job might still be running, wait 1m0s.--no-cpu-throttling --min-instances=1) so the cron runs cleanly.cursor returned all devices previously— the events were delivered against the advancing cursor but the upserts never landed.Expected behavior
Either:
pending job might still be running, wait 1m0spath already knows there's a stale lock; that signal could trigger recovery.Actual behavior
Cursor advances; device upserts are lost; Apple does not re-emit the events. Operator-side recovery (cursor reset in the DB) is required.
Evidence
Representative log entries during the broken window:
Even after the runtime was fixed and the cron began
completed-ing cleanly each tick, multiple devices remained invisible to Fleet. ABM-side re-assignment attempts (Fleet → other MDM → Fleet) emitted onlydeletedevents to the Fleet token; Apple's DEP API never followed up with a re-emission, so a simple re-assignment did not recover. Only a cursor reset (per the existing troubleshooting doc) brought the missing devices into Fleet.In our case ~3 devices were affected. We confirmed by clearing the cursor and watching the next sync return a full enumeration including
op_type=modified, profile_status=assigned → pushedfor the previously-stuck serials.Suggested fix
pending job might still be runningpath detects a stale lock from a prior crash, treat that as a signal to reset the cursor automatically (with a log line so operators are aware).Related
cpu_idle = falseon the Cloud Run service). That PR mitigates the most common cause of context-cancellation on the GCP path but does not fix the underlying cursor-vs-upsert ordering bug described here; this issue remains valid for any other runtime where the container can be terminated mid-tick.