|
| 1 | +--- |
| 2 | +id: schedule-missed-actions |
| 3 | +title: Troubleshoot missed Schedule Actions |
| 4 | +sidebar_label: Missed Schedule Actions |
| 5 | +description: Diagnose why a Schedule did not fire by alerting on the missed catchup window metric, then narrowing down to the affected Schedule with ListSchedules and DescribeSchedule. |
| 6 | +toc_max_heading_level: 4 |
| 7 | +keywords: |
| 8 | + - schedule |
| 9 | + - scheduled workflow |
| 10 | + - missed catchup window |
| 11 | + - troubleshooting |
| 12 | + - metrics |
| 13 | + - monitoring |
| 14 | +tags: |
| 15 | + - Schedules |
| 16 | + - Metrics |
| 17 | + - Observability |
| 18 | + - Troubleshooting |
| 19 | +--- |
| 20 | + |
| 21 | +When a [Schedule](/schedule) does not start a Workflow Execution at its expected time, the Action was either skipped intentionally (paused, overlap policy, end time reached) or the Temporal Service could not take the Action within the [Catchup Window](/schedule#catchup-window). This guide covers the second case. |
| 22 | + |
| 23 | +## Alert on missed catchup window |
| 24 | + |
| 25 | +The Temporal Service emits a counter each time it skips a scheduled Action because it could not run it within the configured Catchup Window. Alert on any non-zero value. |
| 26 | + |
| 27 | +### Temporal Cloud |
| 28 | + |
| 29 | +Alert on [`temporal_cloud_v1_schedule_missed_catchup_window_count`](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_schedule_missed_catchup_window_count) grouped by `temporal_namespace`. |
| 30 | + |
| 31 | +Example PromQL: |
| 32 | + |
| 33 | +``` |
| 34 | +sum by (temporal_namespace) ( |
| 35 | + increase(temporal_cloud_v1_schedule_missed_catchup_window_count[5m]) |
| 36 | +) > 0 |
| 37 | +``` |
| 38 | + |
| 39 | +### Self-hosted |
| 40 | + |
| 41 | +Alert on [`schedule_missed_catchup_window`](/references/cluster-metrics#schedule_missed_catchup_window) grouped by `namespace`. |
| 42 | + |
| 43 | +Example PromQL: |
| 44 | + |
| 45 | +``` |
| 46 | +sum by (namespace) ( |
| 47 | + increase(schedule_missed_catchup_window[5m]) |
| 48 | +) > 0 |
| 49 | +``` |
| 50 | + |
| 51 | +The metric is scoped to the Namespace, not to individual Schedules. A non-zero value tells you that at least one Schedule in the Namespace missed an Action, but not which one. |
| 52 | + |
| 53 | +## Investigate which Schedule missed an Action |
| 54 | + |
| 55 | +Once the alert fires, narrow down to the affected Schedule in two steps. |
| 56 | + |
| 57 | +### 1. List Schedules in the Namespace |
| 58 | + |
| 59 | +Enumerate the Schedules in the alerting Namespace: |
| 60 | + |
| 61 | +``` |
| 62 | +temporal schedule list --namespace <your-namespace> |
| 63 | +``` |
| 64 | + |
| 65 | +[`ListSchedules`](/cli/schedule#list) returns Schedule Ids and summary information. It does not return per-Schedule miss counters, so use it only to produce the set of Schedule Ids to inspect. |
| 66 | + |
| 67 | +### 2. Describe each Schedule |
| 68 | + |
| 69 | +For each Schedule Id returned, run: |
| 70 | + |
| 71 | +``` |
| 72 | +temporal schedule describe \ |
| 73 | + --schedule-id <your-schedule-id> \ |
| 74 | + --namespace <your-namespace> |
| 75 | +``` |
| 76 | + |
| 77 | +[`DescribeSchedule`](/cli/schedule#describe) returns full Schedule state, including the `info` block with cumulative counters. The relevant fields: |
| 78 | + |
| 79 | +| Field | Meaning | |
| 80 | +|-------|---------| |
| 81 | +| `missedCatchupWindow` | Actions skipped because they could not run within the Catchup Window. Non-zero here identifies the Schedule responsible for the alert. | |
| 82 | +| `overlapSkipped` | Actions skipped because the previous run was still in progress and the Overlap Policy is `Skip`. | |
| 83 | +| `bufferDropped` | Buffered Actions dropped because the buffer was full under `BufferOne` or `BufferAll`. | |
| 84 | +| `bufferSize` | Current depth of the Action buffer. | |
| 85 | +| `recentActions` | Most recent Action times and results. | |
| 86 | +| `runningWorkflows` | Workflow Executions currently running for this Schedule. | |
| 87 | + |
| 88 | +Scripting the fan-out against the JSON output (`temporal schedule describe -o json`) is usually faster than inspecting each Schedule interactively. |
| 89 | + |
| 90 | +## Interpret the result |
| 91 | + |
| 92 | +Once you have identified the Schedule with a non-zero `missedCatchupWindow`, use the rest of the `DescribeSchedule` output to determine impact and root cause. |
| 93 | + |
| 94 | +### Assess impact |
| 95 | + |
| 96 | +- Compare `recentActions` to the Schedule's Spec to determine how many Actions were skipped and over what time period. |
| 97 | +- If the Schedule uses the `Skip` Overlap Policy and the preceding run was long-running, the miss may reflect that run exceeding the Catchup Window, not a Service outage. |
| 98 | +- For business-critical Schedules, [Backfill](/schedule#backfill) the skipped interval once the underlying cause is resolved. |
| 99 | + |
| 100 | +### Common root causes |
| 101 | + |
| 102 | +- **Service or Namespace outage longer than the Catchup Window.** The default Catchup Window is one year, so a miss typically means the Schedule is configured with a tighter window (minimum ten seconds) and the outage exceeded it. |
| 103 | +- **Namespace rate limiting.** If scheduled starts are throttled, Actions can queue past the Catchup Window. Cross-check [`temporal_cloud_v1_schedule_rate_limited_count`](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_schedule_rate_limited_count) (Cloud) or [`schedule_rate_limited`](/references/cluster-metrics#schedule_rate_limited) (self-hosted) in the same time range. |
| 104 | +- **Buffer overruns under `BufferAll`.** Long-running Workflow Executions under `BufferAll` can push buffered Actions past the Catchup Window. Cross-check [`temporal_cloud_v1_schedule_buffer_overruns_count`](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_schedule_buffer_overruns_count) (Cloud) or [`schedule_buffer_overruns`](/references/cluster-metrics#schedule_buffer_overruns) (self-hosted) and examine `bufferSize`. |
| 105 | + |
| 106 | +### Remediate |
| 107 | + |
| 108 | +- Widen the Catchup Window if the current value is tighter than your Service's worst-case unavailability. The trade-off is that more late Actions will fire during recovery. |
| 109 | +- Revisit the Overlap Policy if runs routinely exceed the Spec interval. `BufferAll` and `Skip` have different failure modes under sustained delay. |
| 110 | +- Increase Namespace throughput limits if rate limiting is the contributing factor. |
| 111 | +- [Backfill](/schedule#backfill) the missed interval if the skipped Actions need to run. |
| 112 | + |
| 113 | +## Related reading |
| 114 | + |
| 115 | +- [Schedule concept](/schedule) |
| 116 | +- [Catchup Window](/schedule#catchup-window) |
| 117 | +- [Temporal CLI schedule reference](/cli/schedule) |
| 118 | +- [Temporal Cloud OpenMetrics metrics reference](/cloud/metrics/openmetrics/metrics-reference) |
| 119 | +- [Self-hosted cluster metrics reference](/references/cluster-metrics) |
0 commit comments