-
Notifications
You must be signed in to change notification settings - Fork 5
feat(auto-cancel): event driven migration #3281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d05c501 to
01e2bf6
Compare
graham228221
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good I think! Just a couple of minor comments/questions.
It is possible to use @johnduffell's newer "SrCDK" constructs for this though, although not sure if that's only applicable for Typescript lambdas?
| Code: | ||
| S3Bucket: support-service-lambdas-dist | ||
| S3Key: !Sub membership/${Stage}/zuora-callout-apis/zuora-callout-apis.jar | ||
| Handler: com.gu.autoCancel.AutoCancelHandler::handleRequest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be deleting the AutoCancelHandler file, or is it useful to keep around for some purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved with the SrCDK migration - cfn.yaml is gone and AutoCancelHandler is replaced by AutoCancelSqsHandler.
great question and firstly I want to say amazing work picking this very long term problem up to fix! 👏 I would say convert CFN to basic GuCDK if doing any major work if at all possible, just because we get a fair bit of free devex stuff from it, and it is more maintainable. Regarding specifically the SrCDK queue driven lambda, I think it SrCDK would possibly need a few tweaks but certainly wouldn't be impossible as most things can be overridden in terms of lambda runtime and memory etc. In fact we should maybe have a SrScalaLambda construct in due course. Actually auto cancel was one of the use cases I had in mind for the api gateway sqs construct in conjunction with the sqs lambda construct. Happy to meet to talk through it and also do a bit of pairing, as I am sure there would be a few extensions needed to SrCDK stuff. Of course if this work is ready and keen to get it out, it's down to your judgemenet about how much extra migration work to do (in the extreme case you could consider to rewrite it freshly as typescript at the same time) |
handlers/zuora-callout-apis/cfn.yaml
Outdated
| Code: | ||
| S3Bucket: support-service-lambdas-dist | ||
| S3Key: !Sub membership/${Stage}/zuora-callout-apis/zuora-callout-apis.jar | ||
| Handler: com.gu.autoCancel.AutoCancelQueueWriter::handleRequest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can actually avoid the lambda completely by getting API gatway to write directly to the queue.
To make matters simpler, this incantation [1] replicates the API gatway event in the message payload, reducing the amount of refactoring needed to turn an api gatway lambda into an api gateway->sqs lambda
[1] https://github.com/guardian/support-service-lambdas/blob/main/cdk/lib/cdk/ApiGatewayToSqs.ts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @johnduffell , you're right! I initially thought we'd lose the authentication capability by going API Gateway → SQS directly, but I see now that ApiGatewayToSqs preserves
queryStringParameters in the message payload.
So I could:
- Use ApiGatewayToSqs to eliminate the AutoCancelQueueWriter Lambda entirely
- Move the Auth validation into AutoCancelSqsHandler (the SQS processor) by extracting apiToken from the message's queryStringParameters
The only trade-off is that invalid requests (bad auth) would be queued before being rejected, but they'd fail fast in the processor and go to DLQ. Given the simplicity
gain, this seems acceptable.
I will try todo the SrCDK migration, could help after i'v done it with the review ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @johnduffell , its done, could you check if i did everything in the right way please ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's a good point that all requests are accepted, as API gateway won't do any checking at all.
In actual fact, that's probably a good thing for this - if we reject it, zuora will just retry with the same invalid key several (3?) times. If we fix it quickly, it might be ok, but if we don't, we're better to preserve it in our DLQ and decide what to do with it/requeue in due course.
Thanks for doing the updates, I will take a look now I've got about 10 minutes.
01e2bf6 to
75cba52
Compare
…g intermediate QueueWriter lambda
0e58e26 to
7192424
Compare
| export GITHUB_RUN_NUMBER=$(( $GITHUB_RUN_NUMBER + $LAST_TEAMCITY_BUILD )) | ||
| sbt "project ${{ matrix.subproject }}" riffRaffUpload | ||
| zuora-callout-apis: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a matrix up the top for CDK'ed scala lambdas
| gu-cdk-build: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The matrix assumes subproject name matches CDK template name. Here zuora-callout-apis builds zuora-auto-cancel-*.template.json and needs an extra S3 upload step, so the standard pattern doesn't fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok fair point, might be worth adding a comment in the GHA to explain.
Extra info::
Since this is the original lambda, the auto cancel and one or two others (stripe card updated and something else) were combined together because I didn't know how to do split sbt builds at that time, and I think maybe there wasn't any cloudformation originally. So that is why it's not consistent.
Now it looks like it really is only the payment failure lambda as the others have been separated or retired, so if you fancy a separate health task to rename it to be consistent with the others, then it would tidy it up!
| new Policy(this, 'SQSSendToEmailQueue', { | ||
| statements: [ | ||
| new PolicyStatement({ | ||
| actions: ['sqs:SendMessage', 'sqs:GetQueueUrl'], | ||
| resources: [ | ||
| `arn:aws:sqs:${this.region}:${this.account}:comms-${this.stage}-EmailQueue`, | ||
| ], | ||
| }), | ||
| ], | ||
| }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normally I'd say use AllowSqsSendPolicy [1] but this queue has the stage in the middle of the queue name which is not supported. I assume it's an existing queue. If so I would say leave it as-is.
[1]
| export class AllowSqsSendPolicy extends GuAllowPolicy { |
cdk/lib/zuora-auto-cancel.ts
Outdated
| // Add maxConcurrency to the event source mapping to limit concurrent Zuora API calls | ||
| // SrSqsLambda creates an event source with batchSize: 1, we need to add ScalingConfig | ||
| lambda.node.findAll().forEach((child) => { | ||
| const cfnResource = child.node.defaultChild; | ||
| if (cfnResource instanceof CfnEventSourceMapping) { | ||
| cfnResource.scalingConfig = { maximumConcurrency: 5 }; | ||
| } | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this is but I assume you were having trouble with it not processing the queue in order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for rate-limiting concurrent Lambda invocations to avoid Zuora 429 errors. SrSqsLambda doesn't expose maxConcurrency, hence the escape hatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it ok and I noted your other comment about doing a followup PR to expose it, that's a good idea and thanks!
| } | ||
| }); | ||
|
|
||
| // IAM Policies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // IAM Policies |
| # Upload JAR to S3 first (needed before CloudFormation can create the Lambda) | ||
| zuora-auto-cancel-upload: | ||
| type: aws-s3 | ||
| app: zuora-auto-cancel | ||
| parameters: | ||
| bucketSsmLookup: true | ||
| prefixStack: true | ||
| prefixApp: true | ||
| prefixStage: true | ||
| publicReadAcl: false | ||
| cacheControl: private |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only needed on the first ever deployment, normally you can get around it by doing a partial deploy
https://github.com/guardian/support-service-lambdas/blob/main/handlers/HOWTO-create-lambda.md#important-deploying-a-lambda-with-riffraff-for-the-first-time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed because this creates a new stack (support-CODE-zuora-auto-cancel) - the old one was membership-CODE-zuora-auto-cancel. Will remove after PROD deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could but I would recommend leaving it out and following the steps in the doc above, just to save going back later for cleanup.
| cloudFormationStackByTags: false | ||
| prependStackToCloudFormationStackName: false | ||
| createStackIfAbsent: false | ||
| createStackIfAbsent: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wasn't aware of this option, presumably it defaults to true (so can be omitted?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deployment failed without it. Not sure if there's a default - seemed required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very odd as I haven't seen it elsewhere and the docs say it defaults to true
https://riffraff.gutools.co.uk/docs/magenta-lib/types#:~:text=%3CpackageName%3E-,createStackIfAbsent,true,-manageStackPolicy
Not a major issue to leave it in though!
| ); | ||
|
|
||
| // API Gateway -> SQS integration (replaces the AutoCancelQueueWriter lambda) | ||
| new ApiGatewayToSqs(this, 'ApiGatewayToSqs', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the existing hooks in zuora are set up as something like https://sdmfghsjkdhf.execute.api-gateway.amazonaws.com/PROD/hook?key=ASDF, after this is deployed they can use the proper https://zuora-auto-cancel.membership.guardianapis.com/hook?key=ASDF which is easier to verify and find. Then the old stack can be dropped once it has no traffic.
handlers/zuora-callout-apis/src/main/scala/com/gu/autoCancel/AutoCancelSqsHandler.scala
Outdated
Show resolved
Hide resolved
|
|
||
| // Add maxConcurrency to the event source mapping to limit concurrent Zuora API calls | ||
| // SrSqsLambda creates an event source with batchSize: 1, we need to add ScalingConfig | ||
| // Rate-limit concurrent Zuora API calls to avoid 429 errors (escape hatch - SrSqsLambda doesn't expose maxConcurrency) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to edit SrSqsLambda to expose it as an optional value. The code is just in this repo, and it seems like a common use case to support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea - I'll do that in a follow-up PR.
handlers/zuora-callout-apis/src/main/scala/com/gu/autoCancel/AutoCancelSqsHandler.scala
Outdated
Show resolved
Hide resolved
handlers/zuora-callout-apis/src/main/scala/com/gu/autoCancel/AutoCancelSqsHandler.scala
Outdated
Show resolved
Hide resolved
| @@ -0,0 +1,208 @@ | |||
| package com.gu.autoCancel | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering - do we need/is it worth adding any tests for this file? I assume some of it is based on existing code from another handler, which may? not have tests either!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original handler had minimal tests. I can add some basic unit tests for the message parsing if you think it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be worth at least one, partly because it gives an example payload. I am happy to go either way on it though as you've made such a big improvement so far!
handlers/zuora-callout-apis/src/main/scala/com/gu/autoCancel/AutoCancelSqsHandler.scala
Outdated
Show resolved
Hide resolved
|
going back and forth with piecemeal suggestions would have been frustating on both sides, hopefully it's helpful for me to make some of my suggestions in a separate PR, see what you think? #3328 |
|
Thanks for putting this together! I merged your branch - the refactoring makes the code much cleaner, especially the separation of ProcessCalloutSteps and the ZuoraEmailSteps as a class. |
|
Just to add: I've deployed to CODE via Riff-Raff and everything looks good. |
What does this change?
Migrates the auto-cancel lambda from synchronous API Gateway processing to an event-driven SQS-based architecture. This addresses the Zuora 429 rate limit errors that occur
when thousands of callouts arrive simultaneously at invoice due dates.
The new architecture:
Why?
When invoices become due, Zuora triggers thousands of auto-cancel callouts simultaneously. The current synchronous architecture causes:
The SQS-based approach decouples ingestion from processing, allowing controlled throughput to Zuora.
How to test
Deployment
Standard Riff-Raff deployment. The CFN changes will create new resources before updating the API Gateway integration, so there should be no downtime.