Skip to content

Create SolidQueue worker ECS task#1592

Merged
sarahseewhy merged 5 commits into
mainfrom
create-solidqueue-worker-ecs-task
Jun 6, 2025
Merged

Create SolidQueue worker ECS task#1592
sarahseewhy merged 5 commits into
mainfrom
create-solidqueue-worker-ecs-task

Conversation

@sarahseewhy
Copy link
Copy Markdown
Contributor

@sarahseewhy sarahseewhy commented Jun 2, 2025

What problem does this pull request solve?

Trello card: https://trello.com/c/f7js82Cm/635-run-solid-queue-as-its-own-ecs-task

Depends on this PR being merged first: #1591

This pull request creates new ECS task, service, and relevant configuration to run Solid Queue as a separate worker task in our ECS cluster. It contains the same ENV variables as the forms-runner web app (including a specific worker Sentry DSN).

Where possible I reused code, for example the task_container_definition, with a few overrides (similar to the mailchimp-sync approach).

The new ECS task should have egress access to the VPC, RDS, and the internet, but no ingress access (which happens by default unless otherwise specified in Terraform). The new ECS task should also exist in the same VPC and private subnet group as the other forms-runner tasks/services.

I used the same security group rules configured in modules/ecs-service/security-groups.tf since they should match the existing forms-runner ECS task with the exception of the restricted ingress access.

I've added a HealthCheck which depends on a PR in forms-runner (some Rails code to create the healthcheck file, see here).

I can also confirm that logs from the new ECS task are shipped to Splunk, use query index="gds_dsp_dev_forms" log_stream="forms-runner-dev-queue-worker*" to view.

I've tested the changes in dev (and locally) and have attached some screen shots:

Screenshot 2025-05-07 at 16 50 12 Screenshot 2025-05-07 at 16 44 59

Note on autoscaling

After conversations with Andy and Sean, we agreed to hold off on autoscaling for now. We want to see how the worker task handles existing load and we can adjust as part of a follow up.

Happy to talk through this decision and would love to hear thoughts on the desired_count = 3.

For context, the queue length hasn't exceeded 1 in production in the last three months.

Checklist

  • ECS task that runs bin/jobs
  • egress to VPC, RDS, and internet; no ingress
  • Healthcheck enabled
  • Logs go to Splunk
  • Configured to alert Sentry

Things to consider when reviewing

  • Naming. I'm always looking for improved naming.
  • File and resource location.
  • The network configuration is as expected and matches what's described in the Trello card.

cadmiumcat
cadmiumcat previously approved these changes Jun 3, 2025
Copy link
Copy Markdown
Contributor

@cadmiumcat cadmiumcat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌 Super clear PR
And it looks sensible. I agree that we should start with a count of 3 for now and see how we go/enable autoscaling when we're ready.

My only (non-blocking) question was whether we'd want to set the desired count to a different number in each environment?

@sarahseewhy sarahseewhy marked this pull request as ready for review June 3, 2025 13:24
@sarahseewhy
Copy link
Copy Markdown
Contributor Author

🙌 Super clear PR And it looks sensible. I agree that we should start with a count of 3 for now and see how we go/enable autoscaling when we're ready.

My only (non-blocking) question was whether we'd want to set the desired count to a different number in each environment?

I think this is a good question and a good suggestion. I've just added a commit to make the desired_count match the minimum_capacity we already set for each environment.

@sarahseewhy sarahseewhy force-pushed the create-solidqueue-worker-ecs-task branch from 1e1c42f to 9ce2839 Compare June 3, 2025 15:21
cadmiumcat
cadmiumcat previously approved these changes Jun 3, 2025
Copy link
Copy Markdown
Contributor

@cadmiumcat cadmiumcat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

* Create ECS service, ECS task definition, security group, and relevant security group rules. I used a pattern I noticed in mailchimp sync where I took the exported ECS task container definition and override some parts of it
* Provide the worker task access to a subset of secrets (I checked with the devs and it will only need access to these secrets, the worker won't need access to NOTIFY_API_KEY or SUBMISSION_STATUS_API)
* Create a specific IAM task execution role for the queue worker with permissions to read relevant Parameter Store values (again, I don't think the queue worker needs access to all the secrets forms-runner has access to).
* This code somewhat duplicates what's in `ecs-service/iam.tf` and the policies declared in `forms-runner/main.tf` but I think the duplication is simpler than forcing the ECS task into the ecs-service module format.
* The worker will use forms-runner task role (`module.ecs_service.task_definition.task_role_arn`) to ensure it has access to necessary forms-runner related resources (I believe this is required because the worker errors if I use a custom role is since it requires access to the submission_email_ses_bounces_and_complaints SQS queue).
* Make the `desired_count` have a minimum capacity for each environment (for production this is 6, for all other environments it's 1).
* The capacity is set relatively low for environments (except production) because so far the demand on the queue hasn't been high (queue length rarely exceeds 1) -- Cat raised the good point that we don't want to over-provision
@sarahseewhy sarahseewhy force-pushed the create-solidqueue-worker-ecs-task branch from 9ce2839 to 526b2d8 Compare June 4, 2025 13:37
* Add SSM parameter created from PR #1591
* I rebased PR #1591 on to this branch in the GitHub UI (not a good idea), and it absorbed the queue_worker.tf file from PR #1591 as a separate file instead of merging the files together.
* This happened because I didn't take into account the lack of shared file history from git's perspective.
* The queue_worker.tf on the `create-solidqueue-worker-ecs-task` branch was never branched off from the queue_worker.tf in main (i.e., it was created independently rather than modified from the original). Therefore, Git doesn’t know they’re related. It treats them as two different files that just happen to have the same name.
@sarahseewhy sarahseewhy merged commit 4cb1165 into main Jun 6, 2025
3 checks passed
@sarahseewhy sarahseewhy deleted the create-solidqueue-worker-ecs-task branch June 6, 2025 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants