Skip to content

Commit c806759

Browse files
Add troubleshooting guide for missed Schedule Actions
Documents the workflow for diagnosing why a Schedule did not fire: alert on the missed catchup window metric (temporal_cloud_v1 for Cloud, schedule_missed_catchup_window for self-hosted), enumerate Schedules with ListSchedules, then inspect DescribeSchedule.info.missedCatchupWindow per Schedule to identify the affected one. Includes root-cause cross-checks against rate-limit and buffer-overrun metrics, plus remediation guidance. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 84cb6e3 commit c806759

3 files changed

Lines changed: 121 additions & 0 deletions

File tree

docs/troubleshooting/index.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,4 @@ Our troubleshooting guides are designed to help you quickly identify and resolve
2222
The "Context: deadline exceeded" error occurs when requests to the Temporal Service by the Client or Worker cannot be completed.
2323
This can be due to network issues, timeouts, server overload, or Query errors.
2424
- [Troubleshoot the Failed Reaching Server Error](/troubleshooting/last-connection-error): The message "Failed reaching server: last connection error" often happens due to an expired TLS certificate or during the Server startup process when Client requests reach the Server before roles are fully initialized.
25+
- [Troubleshoot missed Schedule Actions](/troubleshooting/schedule-missed-actions): When a Schedule does not fire at its expected time, alert on the missed catchup window metric, then narrow down to the affected Schedule with `ListSchedules` and `DescribeSchedule`.
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
---
2+
id: schedule-missed-actions
3+
title: Troubleshoot missed Schedule Actions
4+
sidebar_label: Missed Schedule Actions
5+
description: Diagnose why a Schedule did not fire by alerting on the missed catchup window metric, then narrowing down to the affected Schedule with ListSchedules and DescribeSchedule.
6+
toc_max_heading_level: 4
7+
keywords:
8+
- schedule
9+
- scheduled workflow
10+
- missed catchup window
11+
- troubleshooting
12+
- metrics
13+
- monitoring
14+
tags:
15+
- Schedules
16+
- Metrics
17+
- Observability
18+
- Troubleshooting
19+
---
20+
21+
When a [Schedule](/schedule) does not start a Workflow Execution at its expected time, the Action was either skipped intentionally (paused, overlap policy, end time reached) or the Temporal Service could not take the Action within the [Catchup Window](/schedule#catchup-window). This guide covers the second case.
22+
23+
## Alert on missed catchup window
24+
25+
The Temporal Service emits a counter each time it skips a scheduled Action because it could not run it within the configured Catchup Window. Alert on any non-zero value.
26+
27+
### Temporal Cloud
28+
29+
Alert on [`temporal_cloud_v1_schedule_missed_catchup_window_count`](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_schedule_missed_catchup_window_count) grouped by `temporal_namespace`.
30+
31+
Example PromQL:
32+
33+
```
34+
sum by (temporal_namespace) (
35+
increase(temporal_cloud_v1_schedule_missed_catchup_window_count[5m])
36+
) > 0
37+
```
38+
39+
### Self-hosted
40+
41+
Alert on [`schedule_missed_catchup_window`](/references/cluster-metrics#schedule_missed_catchup_window) grouped by `namespace`.
42+
43+
Example PromQL:
44+
45+
```
46+
sum by (namespace) (
47+
increase(schedule_missed_catchup_window[5m])
48+
) > 0
49+
```
50+
51+
The metric is scoped to the Namespace, not to individual Schedules. A non-zero value tells you that at least one Schedule in the Namespace missed an Action, but not which one.
52+
53+
## Investigate which Schedule missed an Action
54+
55+
Once the alert fires, narrow down to the affected Schedule in two steps.
56+
57+
### 1. List Schedules in the Namespace
58+
59+
Enumerate the Schedules in the alerting Namespace:
60+
61+
```
62+
temporal schedule list --namespace <your-namespace>
63+
```
64+
65+
[`ListSchedules`](/cli/schedule#list) returns Schedule Ids and summary information. It does not return per-Schedule miss counters, so use it only to produce the set of Schedule Ids to inspect.
66+
67+
### 2. Describe each Schedule
68+
69+
For each Schedule Id returned, run:
70+
71+
```
72+
temporal schedule describe \
73+
--schedule-id <your-schedule-id> \
74+
--namespace <your-namespace>
75+
```
76+
77+
[`DescribeSchedule`](/cli/schedule#describe) returns full Schedule state, including the `info` block with cumulative counters. The relevant fields:
78+
79+
| Field | Meaning |
80+
|-------|---------|
81+
| `missedCatchupWindow` | Actions skipped because they could not run within the Catchup Window. Non-zero here identifies the Schedule responsible for the alert. |
82+
| `overlapSkipped` | Actions skipped because the previous run was still in progress and the Overlap Policy is `Skip`. |
83+
| `bufferDropped` | Buffered Actions dropped because the buffer was full under `BufferOne` or `BufferAll`. |
84+
| `bufferSize` | Current depth of the Action buffer. |
85+
| `recentActions` | Most recent Action times and results. |
86+
| `runningWorkflows` | Workflow Executions currently running for this Schedule. |
87+
88+
Scripting the fan-out against the JSON output (`temporal schedule describe -o json`) is usually faster than inspecting each Schedule interactively.
89+
90+
## Interpret the result
91+
92+
Once you have identified the Schedule with a non-zero `missedCatchupWindow`, use the rest of the `DescribeSchedule` output to determine impact and root cause.
93+
94+
### Assess impact
95+
96+
- Compare `recentActions` to the Schedule's Spec to determine how many Actions were skipped and over what time period.
97+
- If the Schedule uses the `Skip` Overlap Policy and the preceding run was long-running, the miss may reflect that run exceeding the Catchup Window, not a Service outage.
98+
- For business-critical Schedules, [Backfill](/schedule#backfill) the skipped interval once the underlying cause is resolved.
99+
100+
### Common root causes
101+
102+
- **Service or Namespace outage longer than the Catchup Window.** The default Catchup Window is one year, so a miss typically means the Schedule is configured with a tighter window (minimum ten seconds) and the outage exceeded it.
103+
- **Namespace rate limiting.** If scheduled starts are throttled, Actions can queue past the Catchup Window. Cross-check [`temporal_cloud_v1_schedule_rate_limited_count`](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_schedule_rate_limited_count) (Cloud) or [`schedule_rate_limited`](/references/cluster-metrics#schedule_rate_limited) (self-hosted) in the same time range.
104+
- **Buffer overruns under `BufferAll`.** Long-running Workflow Executions under `BufferAll` can push buffered Actions past the Catchup Window. Cross-check [`temporal_cloud_v1_schedule_buffer_overruns_count`](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_schedule_buffer_overruns_count) (Cloud) or [`schedule_buffer_overruns`](/references/cluster-metrics#schedule_buffer_overruns) (self-hosted) and examine `bufferSize`.
105+
106+
### Remediate
107+
108+
- Widen the Catchup Window if the current value is tighter than your Service's worst-case unavailability. The trade-off is that more late Actions will fire during recovery.
109+
- Revisit the Overlap Policy if runs routinely exceed the Spec interval. `BufferAll` and `Skip` have different failure modes under sustained delay.
110+
- Increase Namespace throughput limits if rate limiting is the contributing factor.
111+
- [Backfill](/schedule#backfill) the missed interval if the skipped Actions need to run.
112+
113+
## Related reading
114+
115+
- [Schedule concept](/schedule)
116+
- [Catchup Window](/schedule#catchup-window)
117+
- [Temporal CLI schedule reference](/cli/schedule)
118+
- [Temporal Cloud OpenMetrics metrics reference](/cloud/metrics/openmetrics/metrics-reference)
119+
- [Self-hosted cluster metrics reference](/references/cluster-metrics)

sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1266,6 +1266,7 @@ module.exports = {
12661266
'troubleshooting/deadline-exceeded-error',
12671267
'troubleshooting/last-connection-error',
12681268
'troubleshooting/performance-bottlenecks',
1269+
'troubleshooting/schedule-missed-actions',
12691270
],
12701271
},
12711272
{

0 commit comments

Comments
 (0)