From 3e1a0751fecb63d085b48e82c7b921d02ae3988e Mon Sep 17 00:00:00 2001 From: Ben Echols Date: Thu, 26 Mar 2026 16:04:13 -0700 Subject: [PATCH 1/2] Add cloud auth and observability guidance --- docs/best-practices/cloud-access-control.mdx | 61 ++++++++++++++++--- docs/cloud/get-started/api-keys.mdx | 6 ++ docs/cloud/get-started/service-accounts.mdx | 5 ++ docs/cloud/metrics/index.mdx | 11 ++++ .../openmetrics/metrics-integrations.mdx | 9 +++ docs/cloud/worker-health.mdx | 4 ++ 6 files changed, 89 insertions(+), 7 deletions(-) diff --git a/docs/best-practices/cloud-access-control.mdx b/docs/best-practices/cloud-access-control.mdx index a103067183..4dc77b16af 100644 --- a/docs/best-practices/cloud-access-control.mdx +++ b/docs/best-practices/cloud-access-control.mdx @@ -22,6 +22,13 @@ Temporal Cloud supports two secure authentication methods for Workers: Both options help secure communication between workers and Temporal Cloud. Choosing the right method and managing it properly is key to maintaining security and minimizing downtime. +Use this page to define your operating model for machine access to Temporal Cloud. For setup steps and product-specific +mechanics, see [Manage API keys](/cloud/api-keys) and [Manage service accounts](/cloud/service-accounts). + +Related guidance: +- [Namespace best practices](/best-practices/managing-namespace) +- [Multi-tenant application patterns](/production-deployment/multi-tenant-patterns) + The high-level end-to-end rotation process is: 1. **Generate new credentials**: Create new certificates or API keys in Temporal Cloud before the current ones expire @@ -45,17 +52,57 @@ In the case that you are using multiple certificates signed by the same CA, and One convention is to give certificates a common name that matches the namespace. If you do this when using the same CA for dev and prod, then you can leverage Certificate Filters to prevent access to production environments. This is described in detail under the [authorization section](https://docs.temporal.io/cloud/certificates#control-authorization) of the documentation. -## Best practices: -#### 1. Establish clear guidelines on authentication methods: Teams should standardize on either [mTLS certificates](https://docs.temporal.io/cloud/certificates) or [API keys](https://docs.temporal.io/cloud/api-keys) for the following operations: +## Best practices + +### Establish clear guidelines on authentication methods + +Teams should standardize on either [mTLS certificates](https://docs.temporal.io/cloud/certificates) or +[API keys](https://docs.temporal.io/cloud/api-keys) for the following operations: - Connect Temporal clients to Temporal Cloud (e.g. Worker processes) - Automation (e.g. Temporal Cloud [Operations API](https://docs.temporal.io/ops), [Terraform provider](https://docs.temporal.io/cloud/terraform-provider), [Temporal CLI](https://docs.temporal.io/cli/setup-cli)) - By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations because API keys are easier to manage and rotate for most teams. In addition, you can control account-level and namespace-level roles for service accounts. +By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations because API keys are easier to manage and rotate for most teams. In addition, you can control account-level and namespace-level roles for service accounts. + +If your organization requires mutual authentication and stronger cryptographic guarantees, then it is encouraged for your teams to use mTLS certificates to authenticate Temporal clients to Temporal Cloud and use API keys for automation (because Temporal Cloud [Operations API](https://docs.temporal.io/ops) and [Terraform provider](https://docs.temporal.io/cloud/terraform-provider) only supports API key for authentication). + +### Default operating model for service accounts and API keys + +For most organizations, use the following defaults: + +- Create one Service Account per service or worker deployment, not one shared Service Account for an entire team +- Scope credentials to the smallest practical set of Namespaces +- Use account-level Service Accounts only when a service genuinely needs cross-Namespace or account-wide access +- Prefer Namespace-scoped Service Accounts when a service should only access one Namespace + +This approach gives you cleaner ownership, easier rotation, and better auditability than sharing a single machine +identity across multiple services. + +### Use access boundaries that match your Namespace boundaries + +The way you partition Namespaces should usually match the way you partition machine identities. + +- If multiple services share a Namespace, you may still want one Service Account per service so that each deployment can + rotate credentials independently. +- If you split workloads into separate Namespaces for security, capacity, or team ownership reasons, those Namespaces + should usually have separate Service Accounts and API keys as well. +- If you use Namespace-per-tenant isolation, expect your credential model and RBAC model to become correspondingly more + granular. + +For more on topology tradeoffs, see [Namespace best practices](/best-practices/managing-namespace) and +[Multi-tenant application patterns](/production-deployment/multi-tenant-patterns). + +### Rotate credentials without downtime + +Use the following sequence for both API keys and client certificates: - If your organization requires mutual authentication and stronger cryptographic guarantees, then it is encouraged for your teams to use mTLS certificates to authenticate Temporal clients to Temporal Cloud and use API keys for automation (because Temporal Cloud [Operations API](https://docs.temporal.io/ops) and [Terraform provider](https://docs.temporal.io/cloud/terraform-provider) only supports API key for authentication) +1. Create the replacement credential before the existing one expires. +2. Configure your secret store or deployment system so both old and new credentials can be used during the transition. +3. Roll your Workers and clients to load the new credential. +4. Validate connectivity and normal Workflow execution using the new credential. +5. Remove the old credential only after all clients and Workers have switched. -#### 2. Use Certificate Filters to restrict access when using shared CAs (e.g., `dev` vs `prod`): +### Use Certificate Filters to restrict access when using shared CAs (e.g., `dev` vs `prod`) - Certificate Filters are an additional way of validating using the client certificate presented during client authentication. Give certificates a common name that matches the namespace. This is not a requirement. +Certificate Filters are an additional way of validating using the client certificate presented during client authentication. Give certificates a common name that matches the namespace. This is not a requirement. - If you do this when using the same CA for dev and prod environments, then you can leverage Certificate Filters to prevent access to production. +If you do this when using the same CA for dev and prod environments, then you can leverage Certificate Filters to prevent access to production. diff --git a/docs/cloud/get-started/api-keys.mdx b/docs/cloud/get-started/api-keys.mdx index 99ffbc6dad..7808c4e71c 100644 --- a/docs/cloud/get-started/api-keys.mdx +++ b/docs/cloud/get-started/api-keys.mdx @@ -57,6 +57,9 @@ The authentication process follows this pathway: unexpected or unauthorized activity. - **Use a Key Management System (KMS)**: Employ a Key Management System to minimize the risk of key leaks. +For guidance on which identities should own API keys, when to use Namespace-scoped Service Accounts, and how to align +API keys with your Namespace topology, see [Managing Temporal Cloud access control](/best-practices/cloud-access-control). + ### API key use cases API keys are used for the following scenarios: @@ -223,6 +226,9 @@ Temporal API keys automatically expire based on the specified expiration time. F 1. Switch clients to load the new key and start using it. 1. Delete the old key after it is no longer in use. +For a broader machine-identity rotation strategy across API keys and Service Accounts, see +[Managing Temporal Cloud access control](/best-practices/cloud-access-control). + ## Manage API keys for Service Accounts {#serviceaccount-api-keys} Global Administrators and Account Owners can manage and generate API keys for _all_ Service Accounts in their account. diff --git a/docs/cloud/get-started/service-accounts.mdx b/docs/cloud/get-started/service-accounts.mdx index d90380b0a3..0a6230cdad 100644 --- a/docs/cloud/get-started/service-accounts.mdx +++ b/docs/cloud/get-started/service-accounts.mdx @@ -33,6 +33,11 @@ With the addition of Service Accounts, Temporal Cloud now supports 2 identity ty Service Accounts use API Keys as the authentication mechanism to connect to Temporal Cloud. You should use Service Accounts to represent a non-human identity when authenticating to Temporal Cloud for operations automation or the Temporal SDKs and the Temporal CLI for Workflow Execution and management. +For guidance on how to structure Service Accounts across services, Namespaces, and teams, see +[Managing Temporal Cloud access control](/best-practices/cloud-access-control). A common default is one Service Account +per service or worker deployment, with Namespace-scoped Service Accounts preferred when a service only needs access to a +single Namespace. + :::tip Namespace Admins can now manage and create [Namespace-scoped Service Accounts](/cloud/service-accounts#scoped), regardless of their Account Role. diff --git a/docs/cloud/metrics/index.mdx b/docs/cloud/metrics/index.mdx index 2eb5600d89..eed425713a 100644 --- a/docs/cloud/metrics/index.mdx +++ b/docs/cloud/metrics/index.mdx @@ -35,6 +35,17 @@ SDK metrics monitor individual workers and your code's behavior. Cloud metrics monitor Temporal behavior. When used together, Temporal Cloud and SDK metrics measure the health and performance of your full Temporal infrastructure, including the Temporal Cloud Service and user-supplied Temporal Workers. +Use the following rule of thumb when deciding which signal to rely on: + +| Question | Primary signal | +|---|---| +| Is Temporal Cloud accepting and serving work normally? | Cloud metrics | +| Are Tasks backing up in a Task Queue? | Cloud metrics plus SDK Schedule-To-Start metrics | +| Are my Workers saturated, under-provisioned, or misconfigured? | SDK metrics | +| Is my application logic, downstream dependency, or Activity behavior unhealthy? | SDK metrics and traces | + +For a Worker-focused view of how to combine these signals, see [Monitor worker health](/cloud/worker-health). + Cloud Metrics for all Namespaces in your account are available from two sources: - [OpenMetrics Endpoint](/cloud/metrics/openmetrics) - A Prometheus-compatible scrapable endpoint. diff --git a/docs/cloud/metrics/openmetrics/metrics-integrations.mdx b/docs/cloud/metrics/openmetrics/metrics-integrations.mdx index da3b92dd56..7282154995 100644 --- a/docs/cloud/metrics/openmetrics/metrics-integrations.mdx +++ b/docs/cloud/metrics/openmetrics/metrics-integrations.mdx @@ -39,6 +39,15 @@ Temporal Cloud OpenMetrics support is available in [Public Preview](/evaluate/d Datadog provides a serverless integration with the OpenMetrics endpoint. This integration will scrape metrics, store them in Datadog, and provides a default dashboard with some built in monitors. See the [integration page](https://docs.datadoghq.com/integrations/temporal-cloud-openmetrics/) for more details. +For Datadog users, treat this integration as the Cloud-side half of your observability setup: + +- Use OpenMetrics in Datadog to monitor Temporal Cloud behavior such as Task Queue backlog, poll success, and rate limiting. +- Use SDK metrics from your Workers to monitor saturation, Schedule-To-Start latency, slot availability, and sticky cache behavior. +- Use tracing separately when you need execution-path debugging through your application and Activity code. + +If you only ingest Cloud metrics, you will miss many worker-side bottlenecks. For recommended Worker monitors, see +[Monitor worker health](/cloud/worker-health). + ### Grafana Cloud Grafana provides a serverless integration with the OpenMetrics endpoint for Grafana Cloud. This integration will scrape metrics, store them in Grafana Cloud, and provides a default dashboard diff --git a/docs/cloud/worker-health.mdx b/docs/cloud/worker-health.mdx index 2470ee2853..5e065ec29e 100644 --- a/docs/cloud/worker-health.mdx +++ b/docs/cloud/worker-health.mdx @@ -39,6 +39,10 @@ This page is a guide to monitoring a Temporal Worker fleet and covers the follow - [How to detect misconfigured Workers](#detect-misconfigured-workers) - [How to configure Sticky cache](#configure-sticky-cache) +This page assumes you are monitoring both Worker-side SDK metrics and Cloud-side metrics. Use SDK metrics to understand +what your Workers are doing, and Cloud metrics to understand what Temporal Cloud is seeing at the Task Queue and service +level. For an overview of how these signals fit together, see [Temporal Cloud observability and metrics](/cloud/metrics). + ## Minimal Observations {#minimal-observations} These alerts should be configured and understood first to gain intelligence into your application health and behaviors. From 3ae963ed6e3ffba523e4c3e607d359033f4b7e92 Mon Sep 17 00:00:00 2001 From: Ben Echols Date: Thu, 26 Mar 2026 16:28:10 -0700 Subject: [PATCH 2/2] Clarify service account scope and API key rotation --- docs/best-practices/cloud-access-control.mdx | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/best-practices/cloud-access-control.mdx b/docs/best-practices/cloud-access-control.mdx index 4dc77b16af..030fe5aac2 100644 --- a/docs/best-practices/cloud-access-control.mdx +++ b/docs/best-practices/cloud-access-control.mdx @@ -70,9 +70,9 @@ If your organization requires mutual authentication and stronger cryptographic g For most organizations, use the following defaults: - Create one Service Account per service or worker deployment, not one shared Service Account for an entire team -- Scope credentials to the smallest practical set of Namespaces - Use account-level Service Accounts only when a service genuinely needs cross-Namespace or account-wide access - Prefer Namespace-scoped Service Accounts when a service should only access one Namespace +- Grant Service Accounts namespace-level access only to the specific Namespaces they need This approach gives you cleaner ownership, easier rotation, and better auditability than sharing a single machine identity across multiple services. @@ -93,11 +93,11 @@ For more on topology tradeoffs, see [Namespace best practices](/best-practices/m ### Rotate credentials without downtime -Use the following sequence for both API keys and client certificates: +Use the following sequence when rotating credentials: 1. Create the replacement credential before the existing one expires. -2. Configure your secret store or deployment system so both old and new credentials can be used during the transition. -3. Roll your Workers and clients to load the new credential. +2. For API keys, create the new valid key while the old key still works, then roll your Workers and clients to use the new key. +3. For client certificates, stage the new certificate before removing the old one when your deployment process supports that transition. 4. Validate connectivity and normal Workflow execution using the new credential. 5. Remove the old credential only after all clients and Workers have switched.