Skip to content

Commit a9f980c

Browse files
authored
troubleshooting: serverless workers (#4521)
* troubleshooting: serverless workers * copy edits * reorder checks * update troubleshooting guide * add brief wci explanation * add link to troubleshooting docs * address feedback * clarify WCI role
1 parent 7fcf0a7 commit a9f980c

5 files changed

Lines changed: 199 additions & 0 deletions

File tree

docs/encyclopedia/workers/serverless-workers.mdx

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,34 @@ Temporal does not need to know anything about the Worker's infrastructure.
6161

6262
With Serverless Workers, Temporal starts the Worker.
6363

64+
### Worker Controller Instance {#worker-controller-instance}
65+
66+
The Worker Controller Instance (WCI) is a system Workflow that scales Serverless Workers based on Task Queue conditions.
67+
One WCI Workflow runs per Worker Deployment Version that has a compute provider configured. The WCI runs in the same
68+
Namespace as your Worker Deployment.
69+
70+
The WCI responds to two triggers. The Matching Service notifies the WCI when a sync match failure occurs, meaning a Task
71+
arrived but no Worker was available to handle it. The WCI also polls Task Queue backlogs on a schedule to detect
72+
accumulated work. When either trigger fires, the WCI produces a scaling action, such as invoking the configured compute
73+
provider (for example, calling AWS Lambda's `InvokeFunction` API) to start new Workers.
74+
75+
You can list WCI Workflows in your Namespace:
76+
77+
```bash
78+
temporal workflow list \
79+
--namespace <NAMESPACE> \
80+
--query 'TemporalNamespaceDivision = "TemporalWorkerControllerInstance"'
81+
```
82+
83+
WCI Workflow IDs follow the pattern `temporal-sys-worker-controller-instance:<deployment-name>:<build-id>`. You can
84+
inspect a WCI Workflow's history to see its recent Activity results:
85+
86+
```bash
87+
temporal workflow show \
88+
--namespace <NAMESPACE> \
89+
--workflow-id 'temporal-sys-worker-controller-instance:<DEPLOYMENT_NAME>:<BUILD_ID>'
90+
```
91+
6492
<CaptionedImage
6593
src="/diagrams/serverless-worker-flow.svg"
6694
srcDark="/diagrams/serverless-worker-flow-dark.svg"

docs/production-deployment/worker-deployments/serverless-workers/aws-lambda.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -596,3 +596,5 @@ You can verify the invocation by checking:
596596
- **Temporal UI:** The Workflow execution should show task completions in the event history.
597597
- **AWS CloudWatch Logs:** The Lambda function's log group (`/aws/lambda/my-temporal-worker`) should show invocation
598598
logs with the Worker startup, task processing, and graceful shutdown.
599+
600+
If the Workflow does not progress or the Lambda is not invoked, see [Troubleshoot Serverless Workers](/troubleshooting/serverless-workers).

docs/troubleshooting/index.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,4 @@ Our troubleshooting guides are designed to help you quickly identify and resolve
2222
The "Context: deadline exceeded" error occurs when requests to the Temporal Service by the Client or Worker cannot be completed.
2323
This can be due to network issues, timeouts, server overload, or Query errors.
2424
- [Troubleshoot the Failed Reaching Server Error](/troubleshooting/last-connection-error): The message "Failed reaching server: last connection error" often happens due to an expired TLS certificate or during the Server startup process when Client requests reach the Server before roles are fully initialized.
25+
- [Troubleshoot Serverless Workers](/troubleshooting/serverless-workers): Diagnose issues with Serverless Workers on AWS Lambda by tracing the invocation flow from Task Queue to Worker execution.
Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
---
2+
id: serverless-workers
3+
title: Troubleshoot Serverless Workers
4+
sidebar_label: Serverless Workers
5+
description:
6+
Diagnose and fix issues with Temporal Serverless Workers on AWS Lambda by tracing the invocation flow from Task Queue
7+
to Worker execution.
8+
toc_max_heading_level: 4
9+
keywords:
10+
- serverless
11+
- lambda
12+
- troubleshooting
13+
- worker
14+
- invocation
15+
tags:
16+
- Workers
17+
- Serverless
18+
- Troubleshooting
19+
- AWS Lambda
20+
---
21+
22+
:::tip SUPPORT, STABILITY, and DEPENDENCY INFO
23+
24+
Serverless Workers are in [Pre-release](/evaluate/development-production-features/release-stages#pre-release).
25+
26+
APIs are experimental and may be subject to backwards-incompatible changes.
27+
28+
:::
29+
30+
import Tabs from '@theme/Tabs';
31+
import TabItem from '@theme/TabItem';
32+
33+
This page walks through the Serverless Worker invocation flow and helps you identify where a failure is occurring.
34+
35+
When a Serverless Worker invocation works correctly, the following sequence happens:
36+
37+
1. You deploy the Worker function on Lambda.
38+
2. You configure a [Worker Deployment Version](/worker-versioning#worker-deployment-version) with a compute provider. This starts a [Worker Controller Instance (WCI)](/serverless-workers#how-invocation-works) Workflow and a validation invocation of the Lambda function.
39+
3. The Lambda polls the Temporal Service successfully, binding the [Task Queue](/encyclopedia/task-queues) configured on the Worker to the Worker Deployment Version.
40+
4. The WCI continuously monitors the associated Task Queue on a schedule. The [Matching Service](/clusters#matching-service) also notifies the WCI Workflow of sync match failures immediately as they happen.
41+
5. A Task arrives on the Task Queue and the WCI detects the backlog.
42+
6. The WCI invokes the Lambda function.
43+
7. The Lambda function starts, the Worker connects to Temporal and polls the Task Queue.
44+
8. The Worker processes Tasks and shuts down gracefully.
45+
46+
Start by determining whether the Lambda function is being invoked at all, then narrow down from there.
47+
48+
## Is the Lambda function being invoked? {#is-lambda-invoked}
49+
50+
Check the Lambda function's CloudWatch metrics or invocation logs.
51+
52+
In the AWS Console, go to **Lambda > Functions > your function > Monitor**. Look for recent invocations in the
53+
**Invocations** graph. You can also check **CloudWatch > Log groups > /aws/lambda/your-function-name** for execution
54+
logs.
55+
56+
If there are no invocations, continue to [Lambda is not being invoked](#lambda-not-invoked).
57+
58+
If the Lambda is being invoked but Workflows are not progressing, skip to
59+
[Lambda is invoked but Tasks are not completing](#lambda-invoked-not-completing).
60+
61+
## Lambda is not being invoked {#lambda-not-invoked}
62+
63+
Work through the following checks in order.
64+
65+
### Validate the connection to Lambda {#validate-connection}
66+
67+
Start by verifying that Temporal can reach the Lambda function. Go to **Workers > Deployments > select your
68+
deployment**, open the **Actions** menu on the version, and click **Validate Connection**. A successful validation
69+
confirms that the Worker Deployment Version has a compute provider configured, that Temporal can assume the invocation
70+
role, and that the Lambda function can be invoked.
71+
72+
If validation fails, verify that the Lambda function ARN and invocation role ARN in the Worker Deployment Version
73+
configuration are correct. Verify the invocation role was created using the
74+
[CloudFormation template](/production-deployment/worker-deployments/serverless-workers/aws-lambda#create-invocation-role)
75+
and that the External ID matches the value in the Worker Deployment Version configuration.
76+
77+
If the Worker Deployment Version does not have a compute provider configured, no
78+
[Worker Controller Instance (WCI)](/serverless-workers#how-invocation-works) Workflow exists and the Lambda is never
79+
automatically invoked. A common cause is manually invoking the Lambda function before creating the Worker Deployment
80+
Version in the UI or CLI. When the Lambda runs, the Worker connects to Temporal and polls the Task Queue. That polling
81+
registers the Worker Deployment Version and binds the Task Queue on the server, but the version has no compute provider.
82+
To fix the issue, create or update the Worker Deployment Version with the compute provider flags as described in the
83+
[deploy guide](/production-deployment/worker-deployments/serverless-workers/aws-lambda#create-worker-deployment-version).
84+
85+
### Check that the version is set as current {#check-version-current}
86+
87+
The Worker Deployment Version must be set as the current version for new Tasks to route to it. If you created the
88+
version through the CLI, you need to
89+
[set it as current](/production-deployment/worker-deployments/serverless-workers/aws-lambda#set-current-version).
90+
91+
You can verify the current version with `temporal worker deployment describe`.
92+
93+
### Check that the WCI is detecting Tasks {#check-wci-detecting-tasks}
94+
95+
If the connection validates successfully but the Lambda is still not being invoked, the
96+
[Worker Controller Instance (WCI)](/serverless-workers#worker-controller-instance) may not be detecting Tasks on the
97+
Task Queue.
98+
99+
Check which Task Queues are bound to the Worker Deployment Version and whether there is a backlog:
100+
101+
```bash
102+
temporal worker deployment describe-version \
103+
--namespace <NAMESPACE> \
104+
--deployment-name <DEPLOYMENT_NAME> \
105+
--build-id <BUILD_ID> \
106+
--report-task-queue-stats
107+
```
108+
109+
If no Task Queues are listed, the binding has not been established. The server binds a Task Queue to a Worker Deployment
110+
Version when a Worker with that deployment version successfully connects and polls the Task Queue.
111+
112+
A common cause is a failed first invocation. When you create a Worker Deployment Version, the WCI invokes the Lambda to
113+
validate the configuration. If that first invocation fails (for example, due to missing environment variables, incorrect
114+
TLS configuration, or missing dependencies), the Worker never connects to Temporal and never polls. Without a successful
115+
poll, the Task Queue binding is never created.
116+
117+
To diagnose a failed first invocation, invoke the Lambda function manually from the AWS Console. The console displays
118+
the execution result and any errors directly, making it easier to identify configuration issues than searching through
119+
CloudWatch logs. Once the Lambda runs successfully and the Worker connects to Temporal, the Task Queue binding is
120+
established.
121+
122+
## Lambda is invoked but Tasks are not completing {#lambda-invoked-not-completing}
123+
124+
If CloudWatch shows Lambda invocations but Workflows are not progressing, the problem is in the Worker's execution
125+
within the Lambda function.
126+
127+
### Check Lambda execution logs {#check-execution-logs}
128+
129+
Check CloudWatch logs for errors during Worker startup. In the AWS Console, go to **CloudWatch > Log groups >
130+
/aws/lambda/your-function-name** and look for recent error messages.
131+
132+
Common errors include:
133+
134+
- **Connection failures**: The Worker cannot reach the Temporal Service. Check that the `TEMPORAL_ADDRESS` and
135+
`TEMPORAL_API_KEY` environment variables (or `temporal.toml` config file) are correctly set on the Lambda function.
136+
For self-hosted deployments, verify
137+
[network reachability](/production-deployment/worker-deployments/serverless-workers/self-hosted-setup#ensure-reachability).
138+
- **TLS errors**: The TLS certificate or key is missing, expired, or does not match the Namespace.
139+
- **Authentication errors**: The API key is invalid or does not have access to the Namespace.
140+
141+
### Check for Lambda timeout {#check-lambda-timeout}
142+
143+
If the Lambda function reaches its configured timeout before the Worker finishes processing, AWS terminates the
144+
invocation.
145+
146+
The Worker begins graceful shutdown before the Lambda deadline. If Activities take longer than the available execution
147+
window, the Activities are abandoned mid-execution and retried on the next invocation.
148+
149+
For long-running Activities, increase the Lambda timeout and the Worker's shutdown buffer together. See
150+
[Tuning for long-running Activities](/serverless-workers#tuning-for-long-running-activities) for guidance on how these
151+
values relate.
152+
153+
### Check that the deployment name and build ID match {#check-deployment-match}
154+
155+
If CloudWatch shows rapid, repeated invocations with no Workflow progress, the deployment name or build ID in the Worker
156+
code may not match the Worker Deployment Version configuration.
157+
158+
The deployment name and build ID in your Lambda function code must exactly match the values you used when creating the
159+
Worker Deployment Version. Compare the values in your code against the WCI Workflow ID
160+
(`temporal-sys-worker-controller-instance:<deployment-name>:<build-id>`) and the output of
161+
`temporal worker deployment describe`.
162+
163+
A mismatch causes an invocation loop: the WCI invokes the Lambda, the Worker starts and polls with a different
164+
deployment version than the WCI expects, the Task is not processed, and the WCI invokes the Lambda again.
165+
166+
To fix the loop, update the deployment name and build ID in the Worker code to match the Worker Deployment Version, then
167+
redeploy the Lambda function.

sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1400,6 +1400,7 @@ module.exports = {
14001400
'troubleshooting/deadline-exceeded-error',
14011401
'troubleshooting/last-connection-error',
14021402
'troubleshooting/performance-bottlenecks',
1403+
'troubleshooting/serverless-workers',
14031404
],
14041405
},
14051406
{

0 commit comments

Comments
 (0)