-
Notifications
You must be signed in to change notification settings - Fork 308
Add cloud auth and observability guidance #4351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
@@ -22,6 +22,13 @@ Temporal Cloud supports two secure authentication methods for Workers: | |||||||
|
|
||||||||
| Both options help secure communication between workers and Temporal Cloud. Choosing the right method and managing it properly is key to maintaining security and minimizing downtime. | ||||||||
|
|
||||||||
| Use this page to define your operating model for machine access to Temporal Cloud. For setup steps and product-specific | ||||||||
| mechanics, see [Manage API keys](/cloud/api-keys) and [Manage service accounts](/cloud/service-accounts). | ||||||||
|
|
||||||||
| Related guidance: | ||||||||
| - [Namespace best practices](/best-practices/managing-namespace) | ||||||||
| - [Multi-tenant application patterns](/production-deployment/multi-tenant-patterns) | ||||||||
|
|
||||||||
| The high-level end-to-end rotation process is: | ||||||||
|
|
||||||||
| 1. **Generate new credentials**: Create new certificates or API keys in Temporal Cloud before the current ones expire | ||||||||
|
|
@@ -45,17 +52,57 @@ In the case that you are using multiple certificates signed by the same CA, and | |||||||
|
|
||||||||
| One convention is to give certificates a common name that matches the namespace. If you do this when using the same CA for dev and prod, then you can leverage Certificate Filters to prevent access to production environments. This is described in detail under the [authorization section](https://docs.temporal.io/cloud/certificates#control-authorization) of the documentation. | ||||||||
|
|
||||||||
| ## Best practices: | ||||||||
| #### 1. Establish clear guidelines on authentication methods: Teams should standardize on either [mTLS certificates](https://docs.temporal.io/cloud/certificates) or [API keys](https://docs.temporal.io/cloud/api-keys) for the following operations: | ||||||||
| ## Best practices | ||||||||
|
|
||||||||
| ### Establish clear guidelines on authentication methods | ||||||||
|
|
||||||||
| Teams should standardize on either [mTLS certificates](https://docs.temporal.io/cloud/certificates) or | ||||||||
| [API keys](https://docs.temporal.io/cloud/api-keys) for the following operations: | ||||||||
| - Connect Temporal clients to Temporal Cloud (e.g. Worker processes) | ||||||||
| - Automation (e.g. Temporal Cloud [Operations API](https://docs.temporal.io/ops), [Terraform provider](https://docs.temporal.io/cloud/terraform-provider), [Temporal CLI](https://docs.temporal.io/cli/setup-cli)) | ||||||||
|
|
||||||||
| By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations because API keys are easier to manage and rotate for most teams. In addition, you can control account-level and namespace-level roles for service accounts. | ||||||||
| By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations because API keys are easier to manage and rotate for most teams. In addition, you can control account-level and namespace-level roles for service accounts. | ||||||||
|
|
||||||||
| If your organization requires mutual authentication and stronger cryptographic guarantees, then it is encouraged for your teams to use mTLS certificates to authenticate Temporal clients to Temporal Cloud and use API keys for automation (because Temporal Cloud [Operations API](https://docs.temporal.io/ops) and [Terraform provider](https://docs.temporal.io/cloud/terraform-provider) only supports API key for authentication). | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
|
|
||||||||
| ### Default operating model for service accounts and API keys | ||||||||
|
|
||||||||
| For most organizations, use the following defaults: | ||||||||
|
|
||||||||
| - Create one Service Account per service or worker deployment, not one shared Service Account for an entire team | ||||||||
| - Use account-level Service Accounts only when a service genuinely needs cross-Namespace or account-wide access | ||||||||
| - Prefer Namespace-scoped Service Accounts when a service should only access one Namespace | ||||||||
| - Grant Service Accounts namespace-level access only to the specific Namespaces they need | ||||||||
|
|
||||||||
| This approach gives you cleaner ownership, easier rotation, and better auditability than sharing a single machine | ||||||||
| identity across multiple services. | ||||||||
|
|
||||||||
| ### Use access boundaries that match your Namespace boundaries | ||||||||
|
|
||||||||
| The way you partition Namespaces should usually match the way you partition machine identities. | ||||||||
|
|
||||||||
| - If multiple services share a Namespace, you may still want one Service Account per service so that each deployment can | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are just line wrapping artifacts, shouldn't affect anything |
||||||||
| rotate credentials independently. | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
| - If you split workloads into separate Namespaces for security, capacity, or team ownership reasons, those Namespaces | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
| should usually have separate Service Accounts and API keys as well. | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
| - If you use Namespace-per-tenant isolation, expect your credential model and RBAC model to become correspondingly more | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
| granular. | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
|
|
||||||||
| For more on topology tradeoffs, see [Namespace best practices](/best-practices/managing-namespace) and | ||||||||
| [Multi-tenant application patterns](/production-deployment/multi-tenant-patterns). | ||||||||
|
|
||||||||
| ### Rotate credentials without downtime | ||||||||
|
|
||||||||
| Use the following sequence when rotating credentials: | ||||||||
|
|
||||||||
| If your organization requires mutual authentication and stronger cryptographic guarantees, then it is encouraged for your teams to use mTLS certificates to authenticate Temporal clients to Temporal Cloud and use API keys for automation (because Temporal Cloud [Operations API](https://docs.temporal.io/ops) and [Terraform provider](https://docs.temporal.io/cloud/terraform-provider) only supports API key for authentication) | ||||||||
| 1. Create the replacement credential before the existing one expires. | ||||||||
| 2. For API keys, create the new valid key while the old key still works, then roll your Workers and clients to use the new key. | ||||||||
| 3. For client certificates, stage the new certificate before removing the old one when your deployment process supports that transition. | ||||||||
| 4. Validate connectivity and normal Workflow execution using the new credential. | ||||||||
| 5. Remove the old credential only after all clients and Workers have switched. | ||||||||
|
|
||||||||
| #### 2. Use Certificate Filters to restrict access when using shared CAs (e.g., `dev` vs `prod`): | ||||||||
| ### Use Certificate Filters to restrict access when using shared CAs (e.g., `dev` vs `prod`) | ||||||||
|
|
||||||||
| Certificate Filters are an additional way of validating using the client certificate presented during client authentication. Give certificates a common name that matches the namespace. This is not a requirement. | ||||||||
| Certificate Filters are an additional way of validating using the client certificate presented during client authentication. Give certificates a common name that matches the namespace. This is not a requirement. | ||||||||
|
|
||||||||
| If you do this when using the same CA for dev and prod environments, then you can leverage Certificate Filters to prevent access to production. | ||||||||
| If you do this when using the same CA for dev and prod environments, then you can leverage Certificate Filters to prevent access to production. | ||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -35,6 +35,17 @@ SDK metrics monitor individual workers and your code's behavior. | |
| Cloud metrics monitor Temporal behavior. | ||
| When used together, Temporal Cloud and SDK metrics measure the health and performance of your full Temporal infrastructure, including the Temporal Cloud Service and user-supplied Temporal Workers. | ||
|
|
||
| Use the following rule of thumb when deciding which signal to rely on: | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dustin-temporal please review |
||
|
|
||
| | Question | Primary signal | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This reads like it's for an LLM to reference, but if that's what we're going for then I'm good with it. |
||
| |---|---| | ||
| | Is Temporal Cloud accepting and serving work normally? | Cloud metrics | | ||
| | Are Tasks backing up in a Task Queue? | Cloud metrics plus SDK Schedule-To-Start metrics | | ||
| | Are my Workers saturated, under-provisioned, or misconfigured? | SDK metrics | | ||
| | Is my application logic, downstream dependency, or Activity behavior unhealthy? | SDK metrics and traces | | ||
|
|
||
| For a Worker-focused view of how to combine these signals, see [Monitor worker health](/cloud/worker-health). | ||
|
|
||
| Cloud Metrics for all Namespaces in your account are available from two sources: | ||
|
|
||
| - [OpenMetrics Endpoint](/cloud/metrics/openmetrics) - A Prometheus-compatible scrapable endpoint. | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -39,6 +39,15 @@ Temporal Cloud OpenMetrics support is available in [Public Preview](/evaluate/d | |||||
|
|
||||||
| Datadog provides a serverless integration with the OpenMetrics endpoint. This integration will scrape metrics, store them in Datadog, and provides a default dashboard with some built in monitors. See the [integration page](https://docs.datadoghq.com/integrations/temporal-cloud-openmetrics/) for more details. | ||||||
|
|
||||||
| For Datadog users, treat this integration as the Cloud-side half of your observability setup: | ||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dustin-temporal please review |
||||||
|
|
||||||
| - Use OpenMetrics in Datadog to monitor Temporal Cloud behavior such as Task Queue backlog, poll success, and rate limiting. | ||||||
| - Use SDK metrics from your Workers to monitor saturation, Schedule-To-Start latency, slot availability, and sticky cache behavior. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| - Use tracing separately when you need execution-path debugging through your application and Activity code. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
We don't have a good Datadog tracing integration, so I think this is misleading
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Datadog supports ingesting OTLP traces directly to their backend. It is just in private preview. Private documentation here: https://docs.datadoghq.com/opentelemetry/setup/otlp_ingest/traces/?tab=javascript We can have a quick chat with them to get them "whitelist" Temporal for trace ingestion (they just need to add HTTP header for Temporal, very quick thing). And here we can just say, contact Datadog Opentelemetry Team for instructions of ingesting Trace to Datadog. (the Trace endpoint has been available for more than 1 year, the blocker for DD to announce public availability is the pricing. I heard they are looking to announce Datadog being an OTel native backend this year DASH. This very very very much likely will be part of that announcement) |
||||||
|
|
||||||
| If you only ingest Cloud metrics, you will miss many worker-side bottlenecks. For recommended Worker monitors, see | ||||||
| [Monitor worker health](/cloud/worker-health). | ||||||
|
|
||||||
| ### Grafana Cloud | ||||||
|
|
||||||
| Grafana provides a serverless integration with the OpenMetrics endpoint for Grafana Cloud. This integration will scrape metrics, store them in Grafana Cloud, and provides a default dashboard | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -39,6 +39,10 @@ This page is a guide to monitoring a Temporal Worker fleet and covers the follow | |||||
| - [How to detect misconfigured Workers](#detect-misconfigured-workers) | ||||||
| - [How to configure Sticky cache](#configure-sticky-cache) | ||||||
|
|
||||||
| This page assumes you are monitoring both Worker-side SDK metrics and Cloud-side metrics. Use SDK metrics to understand | ||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dustin-temporal please review |
||||||
| what your Workers are doing, and Cloud metrics to understand what Temporal Cloud is seeing at the Task Queue and service | ||||||
| level. For an overview of how these signals fit together, see [Temporal Cloud observability and metrics](/cloud/metrics). | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| ## Minimal Observations {#minimal-observations} | ||||||
|
|
||||||
| These alerts should be configured and understood first to gain intelligence into your application health and behaviors. | ||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.