Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1578 Retry After Pinpoint SMS Temporary Failure #2244

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

ChrisJohnson-CDJ
Copy link

@ChrisJohnson-CDJ ChrisJohnson-CDJ commented Jan 15, 2025

Description

issue #1578

SMS requests resulting in _SMS.FAILURE events with retriable status are retried
Temporary/Retriable failure is not reported to the client

_SMS.FAILURE STATUS_REASON_RETRYABLE
UNREACHABLE, UNKNOWN, CARRIER_UNREACHABLE, TTL_EXPIRED

Retry a maximum of 2 times within a 3 day window.

Retries are delayed using increasing delays with each attempt.
A +/- 10% jitter is applied to the delay.

1st retry: 60 seconds +/- 10%
2nd retry: 10 minutes (600 seconds) +/- 10%

This retry schedule should address transient errors as well as short term service disruptions or unavailability.

Retriable failures that exceed the retry count or retry window are marked as permanent failures and reported to client
NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE

Cost computation (price_millicents) has been updated to sum successive retries and/or the final SUCCESS/FAILURE.
_SMS.BUFFERED is ignored for price updates since it should have a reported price_millicents according to AWS docs.

How Has This Been Tested?

Unit testing

tests/app/celery/test_process_pinpoint_receipt_tasks.py
==== 43 passed in 84.13s (0:01:24) ====

tests/app/celery/test_process_delivery_status_result_tasks.py
==== 86 passed in 32.81s ====

tests/app/notifications/test_process_notifications.py
==== 69 passed, 1 skipped in 137.49s (0:02:17) ====

Note:

tests/app/celery/test_process_pinpoint_receipt_tasks.py :: test_process_pinpoint_results_notification_final_status

These record status's were previously reported as STATUS_REASON_RETRYABLE
This now reflects the final status after inability to attempt further retries

('_SMS.FAILURE', 'UNREACHABLE', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
('_SMS.FAILURE', 'UNKNOWN', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
('_SMS.FAILURE', 'CARRIER_UNREACHABLE', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
('_SMS.FAILURE', 'TTL_EXPIRED', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),

Dev/QA Testing:

Expected logs through retry logic:

_SMS.FAILURE, UNREACHABLE -> STATUS_REASON_RETRYABLE

on STATUS_REASON_RETRYABLE:

  1. '''Initial retry %s logic | reference: %s | notification_id: %s | notification_status: %s | notification_status_reason: %s | sms_status: %s | sms_status_reason: %s'''
  2. '''Retry SMS conditional | retry_attempts: %s | max_retries: %s | sent_at: %s | elapsed: %s | is_retryable_count %s | is_retryable_window %s'''
  3. if can_retry_sms_request
    3.a. '''Attempt retry %s logic | reference: %s | notification_id: %s | retry_delay: %s | retry_count: %s'''
    3.b'''Retry attempted %s logic | reference: %s | notification_id: %s | notification_status: %s | notification_status_reason: %s | sms_status: %s | sms_status_reason: %s | retry_count: %s'''
  4. ''Final retry %s logic | reference: %s | notification_id: %s | notification_status: %s | notification_status_reason: %s | sms_status: %s | sms_status_reason: %s'''

Checklist

  • I have assigned myself to this PR
  • PR has an appropriate title: #9999 - What the thing does
  • PR has a detailed description, including links to specific documentation
  • I have added the appropriate labels to the PR.
  • I did not remove any parts of the template, such as checkboxes even if they are not used
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to any documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works. Testing guidelines
  • I have ensured the latest main is merged into my branch and all checks are green prior to review
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • The ticket was moved into the DEV test column when I began testing this change

Args:
sms_status (SmsStatusRecord): The status record update
event_timestamp (str | None, optional): Timestamp the Pinpoint event came in. Defaults to None.
event_in_seconds (int, optional): How many seconds Twilio updates have retried. Defaults to 300

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ticket doesn't ask for anything to do with Twilio. Why are we handling Twilio events?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had copied the existing sms_status_update method and left the default for that parameter as-is. It could be removed and then the call to _get_notification could use a hard coded default value instead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd run this past Kyle.

Comment on lines 329 to 332
(60, 6), # 60 seconds +/- 6 seconds (10%)
(900, 90), # 15 minutes +/- 90 seconds (10%)
(3600, 360), # 1 hour +/- 360 seconds (10%)
(21600, 2160), # 6 hours +/- 2160 seconds (10%)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be constants?

@ChrisJohnson-CDJ
Copy link
Author

ChrisJohnson-CDJ commented Jan 16, 2025

Proposing an initial limit on the retry attempts to two attempts with delays of 60 seconds and 10 minutes. This is to work within a 900s limit imposed by AWS SQS when using SQS as a broker for Celery and using countdown or eta with async_apply.

1st retry: 60 seconds +/- 10%
2nd retry: 10 minutes (600 seconds) +/- 10%

Will create a follow-up ticket to investigate longer delays.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/quotas-queues.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants