-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#1578 Retry After Pinpoint SMS Temporary Failure #2244
base: main
Are you sure you want to change the base?
#1578 Retry After Pinpoint SMS Temporary Failure #2244
Conversation
Args: | ||
sms_status (SmsStatusRecord): The status record update | ||
event_timestamp (str | None, optional): Timestamp the Pinpoint event came in. Defaults to None. | ||
event_in_seconds (int, optional): How many seconds Twilio updates have retried. Defaults to 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ticket doesn't ask for anything to do with Twilio. Why are we handling Twilio events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had copied the existing sms_status_update method and left the default for that parameter as-is. It could be removed and then the call to _get_notification could use a hard coded default value instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd run this past Kyle.
(60, 6), # 60 seconds +/- 6 seconds (10%) | ||
(900, 90), # 15 minutes +/- 90 seconds (10%) | ||
(3600, 360), # 1 hour +/- 360 seconds (10%) | ||
(21600, 2160), # 6 hours +/- 2160 seconds (10%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these be constants?
Proposing an initial limit on the retry attempts to two attempts with delays of 60 seconds and 10 minutes. This is to work within a 900s limit imposed by AWS SQS when using SQS as a broker for Celery and using countdown or eta with async_apply. 1st retry: 60 seconds +/- 10% Will create a follow-up ticket to investigate longer delays. |
Description
issue #1578
SMS requests resulting in _SMS.FAILURE events with retriable status are retried
Temporary/Retriable failure is not reported to the client
_SMS.FAILURE STATUS_REASON_RETRYABLE
UNREACHABLE, UNKNOWN, CARRIER_UNREACHABLE, TTL_EXPIRED
Retry a maximum of 2 times within a 3 day window.
Retries are delayed using increasing delays with each attempt.
A +/- 10% jitter is applied to the delay.
1st retry: 60 seconds +/- 10%
2nd retry: 10 minutes (600 seconds) +/- 10%
This retry schedule should address transient errors as well as short term service disruptions or unavailability.
Retriable failures that exceed the retry count or retry window are marked as permanent failures and reported to client
NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE
Cost computation (price_millicents) has been updated to sum successive retries and/or the final SUCCESS/FAILURE.
_SMS.BUFFERED is ignored for price updates since it should have a reported price_millicents according to AWS docs.
How Has This Been Tested?
Unit testing
tests/app/celery/test_process_pinpoint_receipt_tasks.py
==== 43 passed in 84.13s (0:01:24) ====
tests/app/celery/test_process_delivery_status_result_tasks.py
==== 86 passed in 32.81s ====
tests/app/notifications/test_process_notifications.py
==== 69 passed, 1 skipped in 137.49s (0:02:17) ====
Note:
tests/app/celery/test_process_pinpoint_receipt_tasks.py :: test_process_pinpoint_results_notification_final_status
These record status's were previously reported as STATUS_REASON_RETRYABLE
This now reflects the final status after inability to attempt further retries
('_SMS.FAILURE', 'UNREACHABLE', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
('_SMS.FAILURE', 'UNKNOWN', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
('_SMS.FAILURE', 'CARRIER_UNREACHABLE', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
('_SMS.FAILURE', 'TTL_EXPIRED', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
Dev/QA Testing:
Expected logs through retry logic:
_SMS.FAILURE, UNREACHABLE -> STATUS_REASON_RETRYABLE
on STATUS_REASON_RETRYABLE:
3.a. '''Attempt retry %s logic | reference: %s | notification_id: %s | retry_delay: %s | retry_count: %s'''
3.b'''Retry attempted %s logic | reference: %s | notification_id: %s | notification_status: %s | notification_status_reason: %s | sms_status: %s | sms_status_reason: %s | retry_count: %s'''
Checklist