Skip to content

#1578 - Retry After Pinpoint SMS Temporary Failure #2244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 58 commits into from
Feb 4, 2025

Conversation

ChrisJohnson-CDJ
Copy link

@ChrisJohnson-CDJ ChrisJohnson-CDJ commented Jan 15, 2025

Description

issue #1578

SMS requests resulting in _SMS.FAILURE events with retriable status are retried
Temporary/Retriable failure is not reported to the client

_SMS.FAILURE STATUS_REASON_RETRYABLE
UNREACHABLE, UNKNOWN, CARRIER_UNREACHABLE, TTL_EXPIRED

Retry a maximum of 2 times within a 3 day window.

Retries are delayed using increasing delays with each attempt.
A +/- 10% jitter is applied to the delay.

1st retry: 60 seconds +/- 10%
2nd retry: 10 minutes (600 seconds) +/- 10%

This retry schedule should address transient errors as well as short term service disruptions or unavailability.

Retriable failures that exceed the retry count or retry window are marked as permanent failures and reported to client
NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE

Cost computation (price_millicents) has been updated to sum successive retries and/or the final SUCCESS/FAILURE.
_SMS.BUFFERED is ignored for price updates since it should have a reported price_millicents according to AWS docs.

How Has This Been Tested?

Unit testing

tests/app/notifications/test_process_notifications.py
==== 73 passed, 1 skipped in 121.73s (0:02:01) ====

tests/app/dao/notification_dao/test_notification_dao.py
==== 216 passed, 4 skipped in 90.20s (0:01:30) ====

tests/app/celery/test_process_pinpoint_receipt_tasks.py
==== 70 passed in 20.96s ====

tests/app/celery/test_process_delivery_status_result_tasks.py
==== 117 passed in 28.59s ====

First retry
image

Second retry
image

Third retry -> Permanent Failure
image

Database updates for all attempts
image

Note:

tests/app/celery/test_process_pinpoint_receipt_tasks.py :: test_process_pinpoint_results_notification_final_status

These record status's were previously reported as STATUS_REASON_RETRYABLE
This now reflects the final status after inability to attempt further retries

('_SMS.FAILURE', 'UNREACHABLE', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
('_SMS.FAILURE', 'UNKNOWN', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
('_SMS.FAILURE', 'CARRIER_UNREACHABLE', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
('_SMS.FAILURE', 'TTL_EXPIRED', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),

Dev/QA Testing:

Initial Notification

  1. Log processing details:
    LOG 'Processing pinpoint result. | reference: %s | event_type: %s | record_status: %s | message_parts: %s | price_millicents: %s | provider_updated_at: %s'

  2. Log initial logic:
    LOG 'Initial %s logic | reference: %s | notification_id: %s | status: %s | status_reason: %s'

  3. Log final logic:
    LOG 'Final %s logic | reference: %s | notification_id: %s | status: %s | status_reason: %s | cost_in_millicents: %s'

  4. Verify notification status:
    notification.status == 'delivered'

  5. Extract notification_id and reference from the "Final %s logic" message.


First Retry Attempt

  1. Update notification status:
    Update notification by notification.id, set notification.status = 'sending'

  2. Create and send a Kinesis event payload:

    • Reference: initial notification reference
    • Event: _SMS.FAILURE, UNREACHABLE
  3. Log processing details:
    LOG 'Processing pinpoint result. | reference: %s | event_type: %s | record_status: %s | message_parts: %s | price_millicents: %s | provider_updated_at: %s'

  4. Log retry logic:
    LOG 'Entering %s retryable failure logic to process event | reference: %s | notification_id: %s | current_status: %s | current_status_reason: %s | event_sms_status: %s | event_sms_status_reason: %s'

  5. Log requeue attempt:
    LOG 'Notification updated prior to requeue attempt | notification_id: %s | notification_status: %s | cost_in_milicents %s'
    LOG 'Attempting %s requeue | notification_id: %s | retry_delay: %s seconds | retry_count: %s'

  6. Log requeue result:
    LOG 'Requeued notification for delayed %s delivery | notification_id: %s | retry_delay: %s seconds'

  7. Log SMS delivery:
    LOG 'Start sending SMS for notification id: %s'
    LOG 'Successfully sent sms for notification id: %s'
    LOG 'Saved provider reference: {reference} for notification id: {notification.id}'

  8. Verify notification is delivered after a 60-second delay.

  9. Log final logic:
    LOG 'Processing pinpoint result. | reference: %s | event_type: %s | record_status: %s | message_parts: %s | price_millicents: %s | provider_updated_at: %s'
    LOG 'Initial %s logic | reference: %s | notification_id: %s | status: %s | status_reason: %s'
    LOG 'Final %s logic | reference: %s | notification_id: %s | status: %s | status_reason: %s | cost_in_millicents: %s'

  10. Note the new reference (1st retry notification).


Second Retry Attempt

  1. Update notification status:
    Update notification by notification.id, set notification.status = 'sending'

  2. Create and send a Kinesis event payload:

    • Reference: 1st retry notification reference
    • Event: _SMS.FAILURE, UNREACHABLE
  3. Repeat logging steps from the first retry attempt.

  4. Verify notification is delivered after a 10-minute delay.

  5. Log final logic and note the new reference (2nd retry notification).


Third Retry Attempt (Retry Limit Exceeded)

  1. Update notification status:
    Update notification by notification.id, set notification.status = 'sending'

  2. Create and send a Kinesis event payload:

    • Reference: 2nd retry notification reference
    • Event: _SMS.FAILURE, UNREACHABLE
  3. Log processing details:
    LOG 'Processing pinpoint result. | reference: %s | event_type: %s | record_status: %s | message_parts: %s | price_millicents: %s | provider_updated_at: %s'

  4. Log retry failure logic:
    LOG 'Entering %s retryable failure logic to process event | reference: %s | notification_id: %s | current_status: %s | current_status_reason: %s | event_sms_status: %s | event_sms_status_reason: %s'

  5. Log final logic:
    LOG 'Final %s logic | reference: %s | notification_id: %s | status: %s | status_reason: %s | cost_in_millicents: %s'

  6. Verify notification is updated to PERMANENT_FAILURE, UNDELIVERABLE.

Checklist

  • I have assigned myself to this PR
  • PR has an appropriate title: #9999 - What the thing does
  • PR has a detailed description, including links to specific documentation
  • I have added the appropriate labels to the PR.
  • I did not remove any parts of the template, such as checkboxes even if they are not used
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to any documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works. Testing guidelines
  • I have ensured the latest main is merged into my branch and all checks are green prior to review
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • The ticket was moved into the DEV test column when I began testing this change

@ChrisJohnson-CDJ
Copy link
Author

ChrisJohnson-CDJ commented Jan 16, 2025

Proposing an initial limit on the retry attempts to two attempts with delays of 60 seconds and 10 minutes. This is to work within a 900s limit imposed by AWS SQS when using SQS as a broker for Celery and using countdown or eta with async_apply.

1st retry: 60 seconds +/- 10%
2nd retry: 10 minutes (600 seconds) +/- 10%

Will create a follow-up ticket to investigate longer delays.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/quotas-queues.html

@ChrisJohnson-CDJ
Copy link
Author

Test notes from initial qa/dev testing are pending but were successful on Perf deployment

statsd_client.incr(f'clients.sms.{sms_status.provider}.status_update.success')
except Exception:
current_app.logger.exception(
'Failed to check_and_queue_callback_task for notification: %s', notification.id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be either function? I have a feeling this was copy/pasted and the original is wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the second callback function and success into nested try/catch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slipped by your updates.

Comment on lines 276 to 281
current_app.logger.critical(
'SQS resource failed to queue message for sqs queue "%s". notification_id: %s | Exception: %s',
prefixed_queue_name,
notification.id,
e,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use exception and you don't need the e.

def test_update_sms_retry_count_redis_exception(mocker):
mocker.patch(
'app.celery.process_delivery_status_result_tasks.redis_store.set',
side_effect=Exception,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have been an exception from their docs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameterized to use common redis errors


def test_update_sms_retry_count_value_error(mocker):
mocker.patch('app.celery.process_delivery_status_result_tasks.redis_store.set')
mocker.patch('app.celery.process_delivery_status_result_tasks.redis_store.incr', return_value='not an integer')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with redis exceptions.

raise ValueError(
f"Expected an integer value for id '{notification_retry_id}', but got: {value} (type: {type(value)})"
)
except Exception:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the redis base exception if you are not going to be specific.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this.

(3, 600),
],
)
def test_ut_get_sms_retry_delay_returns_within_delay_range(retry_count, expected_base_delay):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is a risk. The jitter may return values correct in one test and incorrect in another. We are also using a test-set jitter rather than what the app code uses.

Copy link
Author

@ChrisJohnson-CDJ ChrisJohnson-CDJ Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated test to mock randint to validate the just the base delay

)


def test_ut_sms_attempt_retry_queued_if_retryable(mocker, sample_notification):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably parameterize this with the expected values. e.g.

...parameterize("retry_count", [value for value in range(CARRIER_SMS_MAX_RETRIES)])

Or something to that effect.
image

sms_attempt_retry(sms_status)

updated_notification = get_notification_by_id(notification.id)
assert updated_notification.cost_in_millicents == 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
instead of hardcoding 10 it should be notification.cost_in_millicents + sms_status.price_millicents

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saved off the orig cost and added to sms_status price since notification is updated at that point

Comment on lines 673 to 674
status_reason=None,
cost_in_millicents=5,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
These two fields shouldn't be necessary. We don't want to put more than is necessary because it can complicate things and cause issues if things change in the future.

Same below with SmsStatusRecord and any fields that have defaults and are not necessary for this test.

sms_status = SmsStatusRecord(
None, notification.reference, NOTIFICATION_TEMPORARY_FAILURE, STATUS_REASON_RETRYABLE, PINPOINT_PROVIDER
)
mocker.patch('app.celery.process_delivery_status_result_tasks.update_sms_retry_count', side_effect=Exception)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That code can throw type, value, or a generic exception, all need to be tested (parametrize).

def test_ut_sms_attempt_retry_not_queued_if_exception_on_retry_count(mocker, sample_notification):
notification = sample_notification(
status=NOTIFICATION_SENDING,
status_reason=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment regarding unnecessary fields.

sample_notification,
current_status,
):
status_reason = None if (current_status == NOTIFICATION_DELIVERED) else 'Because I said so!'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status_reason should only be set in the case of permanent failure now, if I understand the flow correctly.

# MaxNumberOfMessages ranges from 1-10, want to ensure only one message was added to the queue
messages = sqs_client.receive_message(QueueUrl=q_url, MaxNumberOfMessages=10)
assert len(messages.get('Messages')) == 1
assert (monotonic() - start) < 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to use something like (time calc) < (delay_seconds + 1) and/or add a comment regarding why we're testing for 1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, make the logic clearer on what we expect and why

while messages.get('Messages') is None:
messages = sqs_client.receive_message(QueueUrl=q_url, MaxNumberOfMessages=10)
# prevent infinite loop if nothing is added to the queue
if monotonic() - start > 2:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, use delay_seconds in the equation.

Comment on lines 1133 to 1135
# verify it took at least 1 sec to get message
delay = monotonic() - start
assert delay > 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention delay_seconds in the comment and/or modify the code to use delay_seconds.

nit:
delay can be calculated and asserted on the same line

assert task_body.get('task') == 'deliver_sms'

task_args = task_body.get('args')
assert task_args == [str(notification.id), str(None)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None should not be cast to a string. It should not even be necessary? I am fairly certain this would make the sms_sender_id come in as 'None' rather than None.

@MackHalliday MackHalliday self-requested a review January 31, 2025 17:56
# Our clients are not prepared to deal with pinpoint payloads
if not _get_include_payload_status(notification):
sms_status.payload = None

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added second callback task as nested try/catch to ensure accurate error logging

side_effect=exception,
)

with pytest.raises(Exception) as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we're looking for the specific exception here, but I won't block it for that.

Copy link
Member

@kalbfled kalbfled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't insist that you completely revamp your testing strategy now. In the future, please focus on testing behavior (i.e. given this input, observe this output) rather than implementation (i.e. the function call raises an exception or calls this other function). Mock less.

@@ -538,6 +550,392 @@ def test_get_include_payload_status_exception(notify_api, mocker, sample_notific
assert not _get_include_payload_status(sample_notification())


@pytest.mark.parametrize('exception', [ConnectionError, RedisError, ResponseError, TimeoutError])
def test_update_sms_retry_count_redis_set_exception(mocker, exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test of implementation. What is the desired behavior in response to the exception? What input might raise this exception?



@pytest.mark.parametrize('exception', [ConnectionError, RedisError, ResponseError, TimeoutError])
def test_update_sms_retry_count_redis_incr_exception(mocker, exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. What is the significance of this in terms of the observed behavior of the application?

try:
check_and_queue_va_profile_notification_status_callback(notification)
statsd_client.incr(f'clients.sms.{sms_status.provider}.status_update.success')
except Exception:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto specifics

try:
retry_count_redis_ttl = int(CARRIER_SMS_MAX_RETRY_WINDOW.total_seconds())
retry_count = update_sms_retry_count(notification_retry_id, ttl=retry_count_redis_ttl)
except Exception:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more specific

sms_sender_id=notification.sms_sender_id,
delay_seconds=retry_delay,
)
except Exception:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more specific to avoid hiding fixable problems

Copy link
Member

@kalbfled kalbfled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not insisting now on catching more specific exceptions, but please do that in the future. Catching "Exception" can hide problems.

@EvanParish EvanParish merged commit 092fd37 into main Feb 4, 2025
13 checks passed
@EvanParish EvanParish deleted the 1578-retry-after-pinpoint-sms-temp-failure branch February 4, 2025 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants