Skip to content

Commit f260801

Browse files
committed
ci(deploy): serialize deploys against the shared Tokyo stack
Today an OtelCollector deploy on chore/e2e-required-merge-gate failed with "ECS rolled back. PRIMARY=...:7 expected ...:6" — two parallel Deploy App Services runs (push + dispatch on the same SHA) both registered task definitions, and one's UpdateService rolled the other back. Add per-workflow concurrency groups against the shared Tokyo stack to prevent the race. cancel-in-progress: false because a half-applied ECS rolling update / EC2 userdata swap is worse than waiting. - deploy-api.yml: group=deploy-api-shared - deploy-app-services.yml: group=deploy-app-services-shared - deploy-runner.yml: group=deploy-runner-shared - _deploy-single-service.yml: group=deploy-single-service-<service_name> (parameterized so different services parallelize but same-service callers serialize — belt-and-suspenders against the outer group)
1 parent c22216d commit f260801

4 files changed

Lines changed: 36 additions & 0 deletions

File tree

.github/workflows/_deploy-single-service.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,14 @@ on:
2929
type: string
3030
required: true
3131

32+
# Belt-and-suspenders against the outer-workflow concurrency: also
33+
# serialize per-service here, so any caller (deploy-app-services or
34+
# a future direct dispatch) racing on the same ECS service queues
35+
# instead of fighting over TD revisions. Different services parallelize.
36+
concurrency:
37+
group: deploy-single-service-${{ inputs.service_name }}
38+
cancel-in-progress: false
39+
3240
permissions:
3341
contents: read
3442
id-token: write

.github/workflows/deploy-api.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,15 @@ on:
6464
- 'apps/api-client-go/**'
6565
- '.github/workflows/deploy-api.yml'
6666

67+
# Serialize against the shared Tokyo ECS service. Concurrent Api
68+
# deploys race for task-definition revision numbers and one's
69+
# UpdateService rolls back the other (see #724 OtelCollector race).
70+
# cancel-in-progress: false because a half-applied ECS rolling update
71+
# is worse than waiting.
72+
concurrency:
73+
group: deploy-api-shared
74+
cancel-in-progress: false
75+
6776
permissions:
6877
contents: read
6978

.github/workflows/deploy-app-services.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,16 @@ on:
4040
- 'apps/ssh-gateway/**'
4141
- '.github/workflows/deploy-app-services.yml'
4242

43+
# Serialize against the shared Tokyo ECS services. Concurrent
44+
# Proxy/OtelCollector/SshGateway deploys race for task-definition
45+
# revisions — observed live: two parallel runs both registered TDs,
46+
# one's UpdateService rolled the other back with "ECS rolled back.
47+
# PRIMARY=...:7 expected ...:6". cancel-in-progress: false because a
48+
# half-applied rolling update is worse than waiting.
49+
concurrency:
50+
group: deploy-app-services-shared
51+
cancel-in-progress: false
52+
4353
permissions:
4454
contents: read
4555

.github/workflows/deploy-runner.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,15 @@ on:
6767
- '.github/workflows/build-c.yml'
6868
- '.github/workflows/build-runner-binary.yml'
6969

70+
# Serialize against the shared Tokyo runner binary in S3 + EC2 user
71+
# data. Concurrent runner deploys race on the same artifact path and
72+
# the same userdata-poll loop on the EC2 host. cancel-in-progress:
73+
# false because killing a deploy mid-rollout leaves the EC2 host on
74+
# an unknown binary.
75+
concurrency:
76+
group: deploy-runner-shared
77+
cancel-in-progress: false
78+
7079
permissions:
7180
contents: read
7281

0 commit comments

Comments
 (0)