mdm: `apple_mdm_dep_profile_assigner` cron silently drops device events when context-cancelled mid-run

**Fleet versions** 
  - *Discovered:* 
  - *Reproduced:* 

**Web browser and operating system**: 

<hr/>

### 💥  Actual behavior
When sync phase of DEP sync cron returned an error, cursor was advanced, leading to some number of events being dropped
See https://github.com/fleetdm/fleet/blob/6a481477e8ac4bef43b5826e6240b867ac3b2379/server/mdm/nanodep/sync/syncer.go#L184

### 🛠️ Expected behavior
If DEP sync phase returns an error the cron should not advance the cursor(similar to what happens if the DEP fetch returns an error). Advancing the cursor means those events that were on the page where the error occurred will not be processed again.

Check the upstream code to see if this behavior is unique to fleet or also broken upstream and if so, open an issue(and potentially a fix PR if it makes sense) or perhaps start a thread in nano's macadmins channel

### 🧑‍💻  Steps to reproduce



These steps:

- [ ] Have been confirmed to consistently lead to reproduction in multiple Fleet instances.
- [x] Describe the workflow that led to the error, but have not yet been reproduced in multiple Fleet instances.

Reproduction steps are uncertain. In the customer's case they are running fleet in a runtime which pauses fleet between requests, meaning there are sometimes errors from the crons when fleet is unpaused, potentially several seconds later

At any rate, when this happens and we are within this code: https://github.com/fleetdm/fleet/blob/6a481477e8ac4bef43b5826e6240b867ac3b2379/server/mdm/nanodep/sync/syncer.go#L184

If an error is returned from the syncer we may still advance the cursor in the DB, meaning we may not have processed the prior list of events but would still move on to the next page of events, potentially skipping some hosts

### 🕯️ More info _(optional)_
Original bug report below


### Summary

When the `apple_mdm_dep_profile_assigner` cron is terminated mid-execution (`context canceled`), the nanodep sync cursor in `nano_dep_names.syncer_cursor` can advance without the corresponding device upserts being committed to `host_dep_assignments`. Devices that Apple emitted during the broken window are effectively dropped: they appear assigned to Fleet in Apple Business Manager but Fleet has no host record, and Apple's DEP backend will not re-emit those events without an out-of-band trigger.

The only recovery is to manually clear the cursor per [docs/Contributing/.../resetting-apple-dep-sync-cursor.md](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/product-groups/mdm/troubleshooting/resetting-apple-dep-sync-cursor.md).

### Environment

- Self-hosted Fleet on Google Cloud Run
- Cloud Run config that triggered it: `cpu-throttling=true`, `min-instances=0` (CPU frozen between requests + scale-to-zero). The cron's background goroutines get killed before completion.
- Other Cloud Run deployments are likely vulnerable to the same pattern any time the container is terminated mid-tick. The root cause is the cron's non-transactional cursor handling, not the specific runtime.

### Reproduction

1. Deploy Fleet on Cloud Run with `cpu-throttling=true` and `min-instances=0`.
2. Assign macOS devices to the Fleet ADE server in Apple Business Manager with a migration deadline.
3. Observe Fleet logs: repeated `unlock failed: context canceled`, `update cron stats: context canceled`, and `pending job might still be running, wait 1m0s`.
4. Fix the runtime (`--no-cpu-throttling --min-instances=1`) so the cron runs cleanly.
5. Observe that the just-assigned devices never appear in Fleet. Apple's DEP API returns `cursor returned all devices previously` — the events were delivered against the advancing cursor but the upserts never landed.

### Expected behavior

Either:
- The cron is transactional: the cursor only advances after device upserts commit. A context-cancel mid-run results in the same events being replayed on the next tick.
- Or: on startup, the cron detects "previous tick crashed without completing" and either replays or resets the cursor automatically. The `pending job might still be running, wait 1m0s` path already knows there's a stale lock; that signal could trigger recovery.

### Actual behavior

Cursor advances; device upserts are lost; Apple does not re-emit the events. Operator-side recovery (cursor reset in the DB) is required.

### Evidence

Representative log entries during the broken window:

```
err: "context canceled"
msg: "unlock failed"
cron: "apple_mdm_dep_profile_assigner"
```

```
err: "1 error occurred:\n\t* context canceled\n\n"
msg: "update cron stats apple_mdm_dep_profile_assigner"
schedule: "apple_mdm_dep_profile_assigner"
```

```
msg: "pending job might still be running, wait 1m0s"
```

Even after the runtime was fixed and the cron began `completed`-ing cleanly each tick, multiple devices remained invisible to Fleet. ABM-side re-assignment attempts (Fleet → other MDM → Fleet) emitted only `deleted` events to the Fleet token; Apple's DEP API never followed up with a re-emission, so a simple re-assignment did not recover. Only a cursor reset (per the existing troubleshooting doc) brought the missing devices into Fleet.

In our case ~3 devices were affected. We confirmed by clearing the cursor and watching the next sync return a full enumeration including `op_type=modified, profile_status=assigned → pushed` for the previously-stuck serials.

### Suggested fix

- Wrap the cursor update + device upsert in a single transaction (in nanodep or in Fleet's wrapper layer), so context-cancel does not leave them in inconsistent states.
- Alternatively, when the `pending job might still be running` path detects a stale lock from a prior crash, treat that as a signal to reset the cursor automatically (with a log line so operators are aware).
- Either way, surfacing the "previous run did not complete cleanly" state more visibly (a counter, an alertable log line, or a Fleet UI indicator) would help operators catch this before devices go missing.

### Related

- Existing troubleshooting doc: [docs/Contributing/product-groups/mdm/troubleshooting/resetting-apple-dep-sync-cursor.md](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/product-groups/mdm/troubleshooting/resetting-apple-dep-sync-cursor.md)
- fleetdm/fleet-terraform#242 / fleetdm/fleet-terraform#243 — tracks and patches the related GCP deployment-side trigger (the module did not set `cpu_idle = false` on the Cloud Run service). That PR mitigates the most common cause of context-cancellation on the GCP path but does **not** fix the underlying cursor-vs-upsert ordering bug described here; this issue remains valid for any other runtime where the container can be terminated mid-tick.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mdm: `apple_mdm_dep_profile_assigner` cron silently drops device events when context-cancelled mid-run #46235

💥 Actual behavior

🛠️ Expected behavior

🧑‍💻 Steps to reproduce

🕯️ More info (optional)

Summary

Environment

Reproduction

Expected behavior

Actual behavior

Evidence

Suggested fix

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

mdm: apple_mdm_dep_profile_assigner cron silently drops device events when context-cancelled mid-run #46235

Description

💥 Actual behavior

🛠️ Expected behavior

🧑‍💻 Steps to reproduce

🕯️ More info (optional)

Summary

Environment

Reproduction

Expected behavior

Actual behavior

Evidence

Suggested fix

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

mdm: `apple_mdm_dep_profile_assigner` cron silently drops device events when context-cancelled mid-run #46235