Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/troubleshooting/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,4 @@ Our troubleshooting guides are designed to help you quickly identify and resolve
The "Context: deadline exceeded" error occurs when requests to the Temporal Service by the Client or Worker cannot be completed.
This can be due to network issues, timeouts, server overload, or Query errors.
- [Troubleshoot the Failed Reaching Server Error](/troubleshooting/last-connection-error): The message "Failed reaching server: last connection error" often happens due to an expired TLS certificate or during the Server startup process when Client requests reach the Server before roles are fully initialized.
- [Troubleshoot missed Schedule Actions](/troubleshooting/schedule-missed-actions): When a Schedule does not fire at its expected time, alert on the missed catchup window metric, then narrow down to the affected Schedule with `ListSchedules` and `DescribeSchedule`.
119 changes: 119 additions & 0 deletions docs/troubleshooting/schedule-missed-actions.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
id: schedule-missed-actions
title: Troubleshoot missed Schedule Actions
sidebar_label: Missed Schedule Actions
description: Diagnose why a Schedule did not fire by alerting on the missed catchup window metric, then narrowing down to the affected Schedule with ListSchedules and DescribeSchedule.
toc_max_heading_level: 4
keywords:
- schedule
- scheduled workflow
- missed catchup window
- troubleshooting
- metrics
- monitoring
tags:
- Schedules
- Metrics
- Observability
- Troubleshooting
---

When a [Schedule](/schedule) does not start a Workflow Execution at its expected time, the Action was either skipped intentionally (paused, overlap policy, end time reached) or the Temporal Service could not take the Action within the [Catchup Window](/schedule#catchup-window). This guide covers the second case.

## Alert on missed catchup window

The Temporal Service emits a counter each time it skips a scheduled Action because it could not run it within the configured Catchup Window. Alert on any non-zero value.

### Temporal Cloud

Alert on [`temporal_cloud_v1_schedule_missed_catchup_window_count`](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_schedule_missed_catchup_window_count) grouped by `temporal_namespace`.

Example PromQL:

```
sum by (temporal_namespace) (
increase(temporal_cloud_v1_schedule_missed_catchup_window_count[5m])
) > 0
```

### Self-hosted

Alert on [`schedule_missed_catchup_window`](/references/cluster-metrics#schedule_missed_catchup_window) grouped by `namespace`.

Example PromQL:

```
sum by (namespace) (
increase(schedule_missed_catchup_window[5m])
) > 0
```

The metric is scoped to the Namespace, not to individual Schedules. A non-zero value tells you that at least one Schedule in the Namespace missed an Action, but not which one.

## Investigate which Schedule missed an Action

Once the alert fires, narrow down to the affected Schedule in two steps.

### 1. List Schedules in the Namespace

Enumerate the Schedules in the alerting Namespace:

```
temporal schedule list --namespace <your-namespace>
```

[`ListSchedules`](/cli/schedule#list) returns Schedule Ids and summary information. It does not return per-Schedule miss counters, so use it only to produce the set of Schedule Ids to inspect.

### 2. Describe each Schedule

For each Schedule Id returned, run:

```
temporal schedule describe \
--schedule-id <your-schedule-id> \
--namespace <your-namespace>
```

[`DescribeSchedule`](/cli/schedule#describe) returns full Schedule state, including the `info` block with cumulative counters. The relevant fields:

| Field | Meaning |
|-------|---------|
| `missedCatchupWindow` | Actions skipped because they could not run within the Catchup Window. Non-zero here identifies the Schedule responsible for the alert. |
| `overlapSkipped` | Actions skipped because the previous run was still in progress and the Overlap Policy is `Skip`. |
| `bufferDropped` | Buffered Actions dropped because the buffer was full under `BufferOne` or `BufferAll`. |
| `bufferSize` | Current depth of the Action buffer. |
| `recentActions` | Most recent Action times and results. |
| `runningWorkflows` | Workflow Executions currently running for this Schedule. |

Scripting the fan-out against the JSON output (`temporal schedule describe -o json`) is usually faster than inspecting each Schedule interactively.

## Interpret the result

Once you have identified the Schedule with a non-zero `missedCatchupWindow`, use the rest of the `DescribeSchedule` output to determine impact and root cause.

### Assess impact

- Compare `recentActions` to the Schedule's Spec to determine how many Actions were skipped and over what time period.
- If the Schedule uses the `Skip` Overlap Policy and the preceding run was long-running, the miss may reflect that run exceeding the Catchup Window, not a Service outage.
- For business-critical Schedules, [Backfill](/schedule#backfill) the skipped interval once the underlying cause is resolved.

### Common root causes

- **Service or Namespace outage longer than the Catchup Window.** The default Catchup Window is one year, so a miss typically means the Schedule is configured with a tighter window (minimum ten seconds) and the outage exceeded it.
- **Namespace rate limiting.** If scheduled starts are throttled, Actions can queue past the Catchup Window. Cross-check [`temporal_cloud_v1_schedule_rate_limited_count`](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_schedule_rate_limited_count) (Cloud) or [`schedule_rate_limited`](/references/cluster-metrics#schedule_rate_limited) (self-hosted) in the same time range.
- **Buffer overruns under `BufferAll`.** Long-running Workflow Executions under `BufferAll` can push buffered Actions past the Catchup Window. Cross-check [`temporal_cloud_v1_schedule_buffer_overruns_count`](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_schedule_buffer_overruns_count) (Cloud) or [`schedule_buffer_overruns`](/references/cluster-metrics#schedule_buffer_overruns) (self-hosted) and examine `bufferSize`.

### Remediate

- Widen the Catchup Window if the current value is tighter than your Service's worst-case unavailability. The trade-off is that more late Actions will fire during recovery.
- Revisit the Overlap Policy if runs routinely exceed the Spec interval. `BufferAll` and `Skip` have different failure modes under sustained delay.
- Increase Namespace throughput limits if rate limiting is the contributing factor.
- [Backfill](/schedule#backfill) the missed interval if the skipped Actions need to run.

## Related reading

- [Schedule concept](/schedule)
- [Catchup Window](/schedule#catchup-window)
- [Temporal CLI schedule reference](/cli/schedule)
- [Temporal Cloud OpenMetrics metrics reference](/cloud/metrics/openmetrics/metrics-reference)
- [Self-hosted cluster metrics reference](/references/cluster-metrics)
1 change: 1 addition & 0 deletions sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -1266,6 +1266,7 @@ module.exports = {
'troubleshooting/deadline-exceeded-error',
'troubleshooting/last-connection-error',
'troubleshooting/performance-bottlenecks',
'troubleshooting/schedule-missed-actions',
],
},
{
Expand Down
Loading