Skip to content

Commit 35b8e68

Browse files
committed
Fix demo readme and grafana dashboard for autoscaling demo (#251)
<!--- Note to EXTERNAL Contributors --> <!-- Thanks for opening a PR! If it is a significant code change, please **make sure there is an open issue** for this. We work best with you when we have accepted the idea first before you code. --> <!--- For ALL Contributors 👇 --> ## What was changed Fix demo readme and grafana dashboard for autoscaling demo ## Why? So that autoscaling demo grafana works out of the box ## Checklist <!--- add/delete as needed ---> 1. Closes <!-- add issue number here --> 2. How was this tested: <!--- Please describe how you tested your changes/how we can test them --> 3. Any docs updates needed? <!--- update README if applicable or point out where to update docs.temporal.io -->
1 parent 1123b6b commit 35b8e68

6 files changed

Lines changed: 24 additions & 21 deletions

File tree

api/v1alpha1/workerresourcetemplate_types.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,10 @@ type WorkerResourceTemplateSpec struct {
3434
// PodDisruptionBudgets and other resources that select pods.
3535
//
3636
// spec.metrics[*].external.metric.selector.matchLabels: {} (or with user labels)
37-
// The controller appends temporal_temporal_worker_deployment_name, temporal_worker_build_id, and
37+
// The controller appends temporal_worker_deployment_name, temporal_worker_build_id, and
3838
// temporal_namespace to any External metric selector where matchLabels is present.
3939
// User labels (e.g. task_type: "Activity") coexist alongside the injected keys.
40-
// Do not set temporal_temporal_worker_deployment_name, temporal_worker_build_id, or
40+
// Do not set temporal_worker_deployment_name, temporal_worker_build_id, or
4141
// temporal_namespace manually — the webhook will reject them.
4242
// +kubebuilder:validation:Required
4343
// +kubebuilder:pruning:PreserveUnknownFields

internal/demo/README.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -263,11 +263,12 @@ Stop the load generator (`Ctrl-C`) and watch the HPA scale back down as in-fligh
263263
`approximate_backlog_count` measures tasks queued in Temporal but not yet started on a worker. Adding it as a second HPA metric means the HPA scales up on *arriving* work even before slots are full — important for bursty traffic.
264264
265265
> **Note:** Temporal Cloud emits `temporal_approximate_backlog_count` with a combined
266-
> `version="namespace/twd-name:build-id"` label that contains characters invalid in
267-
> Kubernetes label values (`/` and `:`). The recording rule in
268-
> `prometheus-stack-values.yaml` uses `label_replace` to extract `twd_name` and
269-
> `build_id` as separate k8s-compatible labels, producing `temporal_backlog_count_by_version`.
270-
> The HPA then selects on those labels — the same pair used by Phase 1.
266+
> `worker_version="<worker-deployment-name>_<build-id>"` label that easily exceeds Kubernetes max label
267+
> length of 63 characters. The recording rule in `prometheus-stack-values.yaml` uses `label_replace`
268+
> to extract `temporal_worker_deployment_name` and `temporal_worker_build_id` as separate k8s-compatible
269+
> labels, producing `temporal_backlog_count_by_version`. The HPA then selects on those labels — the same
270+
> pair used by Phase 1. Temporal Cloud is in the process of rolling out the new separate labels, so this
271+
> workaround is required until then.
271272
272273
**Step 1 — Create the Temporal Cloud credentials secret.**
273274

internal/demo/k8s/grafana-dashboard.json

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -95,8 +95,8 @@
9595
},
9696
"targets": [
9797
{
98-
"expr": "temporal_slot_utilization",
99-
"legendFormat": "{{temporal_worker_deployment_name}} / {{temporal_worker_build_id}}",
98+
"expr": "temporal_slot_utilization{temporal_worker_deployment_name=\"default_helloworld\"}",
99+
"legendFormat": "{{worker_type}} - {{temporal_worker_deployment_name}} / {{temporal_worker_build_id}}",
100100
"refId": "A"
101101
}
102102
],
@@ -127,7 +127,7 @@
127127
},
128128
"targets": [
129129
{
130-
"expr": "temporal_backlog_count_by_version{task_type=\"Workflow\"}",
130+
"expr": "temporal_backlog_count_by_version{task_type=\"Workflow\", temporal_worker_deployment_name=\"default_helloworld\"}",
131131
"legendFormat": "{{temporal_worker_deployment_name}} / {{temporal_worker_build_id}}",
132132
"refId": "A"
133133
}
@@ -152,7 +152,7 @@
152152
},
153153
"targets": [
154154
{
155-
"expr": "temporal_backlog_count_by_version{task_type=\"Activity\"}",
155+
"expr": "temporal_backlog_count_by_version{task_type=\"Activity\", temporal_worker_deployment_name=\"default_helloworld\"}",
156156
"legendFormat": "{{temporal_worker_deployment_name}} / {{temporal_worker_build_id}}",
157157
"refId": "A"
158158
}

internal/demo/k8s/prometheus-adapter-values.yaml

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,21 +21,22 @@ prometheus:
2121
rules:
2222
external:
2323
# Phase 1: slot utilization per worker version.
24-
# HPA selector: metric.name=temporal_slot_utilization, matchLabels: twd_name + build_id.
25-
# The worker emits twd_name and build_id as separate Prometheus labels (both valid
26-
# Kubernetes label values), and the recording rule in prometheus-stack-values.yaml
27-
# aggregates them into temporal_slot_utilization.
24+
# HPA selector: metric.name=temporal_slot_utilization,
25+
# matchLabels: temporal_worker_deployment_name + temporal_worker_build_id + temporal_namespace.
26+
# The worker emits those as separate Prometheus labels (all valid Kubernetes label values), and
27+
# the recording rule in prometheus-stack-values.yaml aggregates them into temporal_slot_utilization.
2828
- seriesQuery: 'temporal_slot_utilization{}'
2929
metricsQuery: 'avg(<<.Series>>{<<.LabelMatchers>>})'
3030
name:
3131
as: "temporal_slot_utilization"
3232
resources:
33-
namespaced: false # cluster-scoped: HPAs in any namespace can consume this metric
33+
namespaced: false # cluster-scoped: HPAs in any k8s namespace can consume this metric
3434

3535
# Phase 2: approximate backlog count per worker version (from Temporal Cloud).
3636
# Uses the temporal_backlog_count_by_version recording rule.
3737
# cluster-scoped so HPAs in any namespace can consume it; temporal_worker_deployment_name
38-
# + build_id matchLabels in the HPA are sufficient to select the right series.
38+
# + temporal_worker_build_id + temporal_namespace matchLabels in the HPA are sufficient to
39+
# select the right series.
3940
- seriesQuery: 'temporal_backlog_count_by_version{}'
4041
metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>})'
4142
name:

internal/demo/k8s/prometheus-stack-values.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -61,8 +61,8 @@ additionalPrometheusRulesMap:
6161
# Slot utilization ratio per worker version, filtered to activity workers only.
6262
# Range: 0.0 (idle) to 1.0 (fully saturated).
6363
#
64-
# The worker emits twd_name and build_id as separate labels (both valid
65-
# Kubernetes label values), so no label_replace is needed here.
64+
# The worker emits temporal_worker_deployment_name, temporal_worker_build_id, temporal_namespace, and worker_type
65+
# as separate labels (both valid Kubernetes label values), so no label_replace is needed here.
6666
- record: temporal_slot_utilization
6767
expr: |
6868
sum by (temporal_worker_deployment_name, temporal_worker_build_id, temporal_namespace, worker_type) (
@@ -86,7 +86,8 @@ additionalPrometheusRulesMap:
8686
# Backlog count per worker version, shaped to match the label format that
8787
# Temporal Cloud will emit natively in a future release. This recording rule
8888
# is a temporary shim: once Temporal Cloud emits temporal_worker_deployment_name and
89-
# build_id as separate labels, this rule can be deleted with no other changes.
89+
# temporal_worker_build_id as separate labels, this rule can be deleted with no
90+
# other changes. Note: this rule only works with Build IDs that don't have underscores.
9091
#
9192
# Current Temporal Cloud label:
9293
# worker_version="{k8s-namespace}_{twd-name}_{build-id}"

internal/demo/util/observability.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ func configureObservability(deploymentName, buildID, temporalNamespace string, m
3737
m = opentelemetry.NewMetricsHandler(opentelemetry.MetricsHandlerOptions{
3838
Meter: metric.NewMeterProvider(metric.WithReader(exporter)).Meter("worker"),
3939
InitialAttributes: attribute.NewSet(
40-
attribute.String("temporal_temporal_worker_deployment_name", deploymentNameCleanForLabel),
40+
attribute.String("temporal_worker_deployment_name", deploymentNameCleanForLabel),
4141
attribute.String("temporal_worker_build_id", buildID),
4242
attribute.String("temporal_namespace", temporalNamespace),
4343
),

0 commit comments

Comments
 (0)