#1578 - Retry After Pinpoint SMS Temporary Failure #2244

ChrisJohnson-CDJ · 2025-01-15T14:16:49Z

Description

SMS requests resulting in _SMS.FAILURE events with retriable status are retried
Temporary/Retriable failure is not reported to the client

_SMS.FAILURE STATUS_REASON_RETRYABLE
UNREACHABLE, UNKNOWN, CARRIER_UNREACHABLE, TTL_EXPIRED

Retry a maximum of 2 times within a 3 day window.

Retries are delayed using increasing delays with each attempt.
A +/- 10% jitter is applied to the delay.

1st retry: 60 seconds +/- 10%
2nd retry: 10 minutes (600 seconds) +/- 10%

This retry schedule should address transient errors as well as short term service disruptions or unavailability.

Retriable failures that exceed the retry count or retry window are marked as permanent failures and reported to client
NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE

Cost computation (price_millicents) has been updated to sum successive retries and/or the final SUCCESS/FAILURE.
_SMS.BUFFERED is ignored for price updates since it should have a reported price_millicents according to AWS docs.

How Has This Been Tested?

Unit testing

tests/app/notifications/test_process_notifications.py
==== 73 passed, 1 skipped in 121.73s (0:02:01) ====

tests/app/dao/notification_dao/test_notification_dao.py
==== 216 passed, 4 skipped in 90.20s (0:01:30) ====

tests/app/celery/test_process_pinpoint_receipt_tasks.py
==== 70 passed in 20.96s ====

tests/app/celery/test_process_delivery_status_result_tasks.py
==== 117 passed in 28.59s ====

First retry

Second retry

Third retry -> Permanent Failure

Database updates for all attempts

Note:

tests/app/celery/test_process_pinpoint_receipt_tasks.py :: test_process_pinpoint_results_notification_final_status

These record status's were previously reported as STATUS_REASON_RETRYABLE
This now reflects the final status after inability to attempt further retries

('_SMS.FAILURE', 'UNREACHABLE', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
('_SMS.FAILURE', 'UNKNOWN', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
('_SMS.FAILURE', 'CARRIER_UNREACHABLE', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
('_SMS.FAILURE', 'TTL_EXPIRED', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),

Dev/QA Testing:

Initial Notification

Log processing details:
LOG 'Processing pinpoint result. | reference: %s | event_type: %s | record_status: %s | message_parts: %s | price_millicents: %s | provider_updated_at: %s'
Log initial logic:
LOG 'Initial %s logic | reference: %s | notification_id: %s | status: %s | status_reason: %s'
Log final logic:
LOG 'Final %s logic | reference: %s | notification_id: %s | status: %s | status_reason: %s | cost_in_millicents: %s'
Verify notification status:
notification.status == 'delivered'
Extract notification_id and reference from the "Final %s logic" message.

First Retry Attempt

Update notification status:
Update notification by notification.id, set notification.status = 'sending'
Create and send a Kinesis event payload:
- Reference: initial notification reference
- Event: _SMS.FAILURE, UNREACHABLE
Log processing details:
LOG 'Processing pinpoint result. | reference: %s | event_type: %s | record_status: %s | message_parts: %s | price_millicents: %s | provider_updated_at: %s'
Log retry logic:
LOG 'Entering %s retryable failure logic to process event | reference: %s | notification_id: %s | current_status: %s | current_status_reason: %s | event_sms_status: %s | event_sms_status_reason: %s'
Log requeue attempt:
LOG 'Notification updated prior to requeue attempt | notification_id: %s | notification_status: %s | cost_in_milicents %s'
LOG 'Attempting %s requeue | notification_id: %s | retry_delay: %s seconds | retry_count: %s'
Log requeue result:
LOG 'Requeued notification for delayed %s delivery | notification_id: %s | retry_delay: %s seconds'
Log SMS delivery:
LOG 'Start sending SMS for notification id: %s'
LOG 'Successfully sent sms for notification id: %s'
LOG 'Saved provider reference: {reference} for notification id: {notification.id}'
Verify notification is delivered after a 60-second delay.
Log final logic:
LOG 'Processing pinpoint result. | reference: %s | event_type: %s | record_status: %s | message_parts: %s | price_millicents: %s | provider_updated_at: %s'
LOG 'Initial %s logic | reference: %s | notification_id: %s | status: %s | status_reason: %s'
LOG 'Final %s logic | reference: %s | notification_id: %s | status: %s | status_reason: %s | cost_in_millicents: %s'
Note the new reference (1st retry notification).

Second Retry Attempt

Update notification status:
Update notification by notification.id, set notification.status = 'sending'
Create and send a Kinesis event payload:
- Reference: 1st retry notification reference
- Event: _SMS.FAILURE, UNREACHABLE
Repeat logging steps from the first retry attempt.
Verify notification is delivered after a 10-minute delay.
Log final logic and note the new reference (2nd retry notification).

Third Retry Attempt (Retry Limit Exceeded)

Update notification status:
Update notification by notification.id, set notification.status = 'sending'
Create and send a Kinesis event payload:
- Reference: 2nd retry notification reference
- Event: _SMS.FAILURE, UNREACHABLE
Log processing details:
LOG 'Processing pinpoint result. | reference: %s | event_type: %s | record_status: %s | message_parts: %s | price_millicents: %s | provider_updated_at: %s'
Log retry failure logic:
LOG 'Entering %s retryable failure logic to process event | reference: %s | notification_id: %s | current_status: %s | current_status_reason: %s | event_sms_status: %s | event_sms_status_reason: %s'
Log final logic:
LOG 'Final %s logic | reference: %s | notification_id: %s | status: %s | status_reason: %s | cost_in_millicents: %s'
Verify notification is updated to PERMANENT_FAILURE, UNDELIVERABLE.

Checklist

app/celery/process_delivery_status_result_tasks.py

ChrisJohnson-CDJ · 2025-01-16T16:13:40Z

Proposing an initial limit on the retry attempts to two attempts with delays of 60 seconds and 10 minutes. This is to work within a 900s limit imposed by AWS SQS when using SQS as a broker for Celery and using countdown or eta with async_apply.

1st retry: 60 seconds +/- 10%
2nd retry: 10 minutes (600 seconds) +/- 10%

Will create a follow-up ticket to investigate longer delays.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/quotas-queues.html

ChrisJohnson-CDJ · 2025-01-30T19:38:15Z

Test notes from initial qa/dev testing are pending but were successful on Perf deployment

app/celery/process_delivery_status_result_tasks.py

k-macmillan · 2025-01-30T21:43:38Z

app/celery/process_delivery_status_result_tasks.py

+            statsd_client.incr(f'clients.sms.{sms_status.provider}.status_update.success')
+        except Exception:
+            current_app.logger.exception(
+                'Failed to check_and_queue_callback_task for notification: %s', notification.id


It could be either function? I have a feeling this was copy/pasted and the original is wrong.

Moving the second callback function and success into nested try/catch

This slipped by your updates.

k-macmillan · 2025-01-30T21:50:11Z

app/notifications/process_notifications.py

+        current_app.logger.critical(
+            'SQS resource failed to queue message for sqs queue "%s". notification_id: %s | Exception: %s',
+            prefixed_queue_name,
+            notification.id,
+            e,
+        )


Just use exception and you don't need the e.

k-macmillan · 2025-01-30T21:54:24Z

tests/app/celery/test_process_delivery_status_result_tasks.py

+def test_update_sms_retry_count_redis_exception(mocker):
+    mocker.patch(
+        'app.celery.process_delivery_status_result_tasks.redis_store.set',
+        side_effect=Exception,


Should have been an exception from their docs.

Parameterized to use common redis errors

k-macmillan · 2025-01-30T21:56:01Z

tests/app/celery/test_process_delivery_status_result_tasks.py

+
+def test_update_sms_retry_count_value_error(mocker):
+    mocker.patch('app.celery.process_delivery_status_result_tasks.redis_store.set')
+    mocker.patch('app.celery.process_delivery_status_result_tasks.redis_store.incr', return_value='not an integer')


Same with redis exceptions.

k-macmillan · 2025-01-30T22:06:51Z

app/celery/process_delivery_status_result_tasks.py

+        raise ValueError(
+            f"Expected an integer value for id '{notification_retry_id}', but got: {value} (type: {type(value)})"
+        )
+    except Exception:


Please use the redis base exception if you are not going to be specific.

Missed this.

k-macmillan · 2025-01-30T22:10:26Z

tests/app/celery/test_process_delivery_status_result_tasks.py

+        (3, 600),
+    ],
+)
+def test_ut_get_sms_retry_delay_returns_within_delay_range(retry_count, expected_base_delay):


This test is a risk. The jitter may return values correct in one test and incorrect in another. We are also using a test-set jitter rather than what the app code uses.

Updated test to mock randint to validate the just the base delay

k-macmillan · 2025-01-31T14:27:53Z

tests/app/celery/test_process_delivery_status_result_tasks.py

+    )
+
+
+def test_ut_sms_attempt_retry_queued_if_retryable(mocker, sample_notification):


Should probably parameterize this with the expected values. e.g.

...parameterize("retry_count", [value for value in range(CARRIER_SMS_MAX_RETRIES)])

Or something to that effect.

k-macmillan · 2025-01-31T14:34:09Z

tests/app/celery/test_process_delivery_status_result_tasks.py

+    sms_attempt_retry(sms_status)
+
+    updated_notification = get_notification_by_id(notification.id)
+    assert updated_notification.cost_in_millicents == 10


nit:
instead of hardcoding 10 it should be notification.cost_in_millicents + sms_status.price_millicents

saved off the orig cost and added to sms_status price since notification is updated at that point

k-macmillan · 2025-01-31T14:36:30Z

tests/app/celery/test_process_delivery_status_result_tasks.py

+        status_reason=None,
+        cost_in_millicents=5,


nit:
These two fields shouldn't be necessary. We don't want to put more than is necessary because it can complicate things and cause issues if things change in the future.

Same below with SmsStatusRecord and any fields that have defaults and are not necessary for this test.

k-macmillan · 2025-01-31T14:38:41Z

tests/app/celery/test_process_delivery_status_result_tasks.py

+    sms_status = SmsStatusRecord(
+        None, notification.reference, NOTIFICATION_TEMPORARY_FAILURE, STATUS_REASON_RETRYABLE, PINPOINT_PROVIDER
+    )
+    mocker.patch('app.celery.process_delivery_status_result_tasks.update_sms_retry_count', side_effect=Exception)


That code can throw type, value, or a generic exception, all need to be tested (parametrize).

k-macmillan · 2025-01-31T14:39:13Z

tests/app/celery/test_process_delivery_status_result_tasks.py

+def test_ut_sms_attempt_retry_not_queued_if_exception_on_retry_count(mocker, sample_notification):
+    notification = sample_notification(
+        status=NOTIFICATION_SENDING,
+        status_reason=None,


Same comment regarding unnecessary fields.

k-macmillan · 2025-01-31T15:14:26Z

tests/app/dao/notification_dao/test_notification_dao.py

+    sample_notification,
+    current_status,
+):
+    status_reason = None if (current_status == NOTIFICATION_DELIVERED) else 'Because I said so!'


status_reason should only be set in the case of permanent failure now, if I understand the flow correctly.

k-macmillan · 2025-01-31T15:21:19Z

tests/app/notifications/test_process_notifications.py

+    # MaxNumberOfMessages ranges from 1-10, want to ensure only one message was added to the queue
+    messages = sqs_client.receive_message(QueueUrl=q_url, MaxNumberOfMessages=10)
+    assert len(messages.get('Messages')) == 1
+    assert (monotonic() - start) < 1


Probably want to use something like (time calc) < (delay_seconds + 1) and/or add a comment regarding why we're testing for 1.

Good point, make the logic clearer on what we expect and why

k-macmillan · 2025-01-31T15:23:03Z

tests/app/notifications/test_process_notifications.py

+    while messages.get('Messages') is None:
+        messages = sqs_client.receive_message(QueueUrl=q_url, MaxNumberOfMessages=10)
+        # prevent infinite loop if nothing is added to the queue
+        if monotonic() - start > 2:


Same, use delay_seconds in the equation.

k-macmillan · 2025-01-31T15:24:18Z

tests/app/notifications/test_process_notifications.py

+    # verify it took at least 1 sec to get message
+    delay = monotonic() - start
+    assert delay > 1


Please mention delay_seconds in the comment and/or modify the code to use delay_seconds.

nit:
delay can be calculated and asserted on the same line

k-macmillan · 2025-01-31T15:29:01Z

tests/app/notifications/test_process_notifications.py

+    assert task_body.get('task') == 'deliver_sms'
+
+    task_args = task_body.get('args')
+    assert task_args == [str(notification.id), str(None)]


None should not be cast to a string. It should not even be necessary? I am fairly certain this would make the sms_sender_id come in as 'None' rather than None.

ChrisJohnson-CDJ · 2025-02-03T17:55:16Z

app/celery/process_delivery_status_result_tasks.py

+    # Our clients are not prepared to deal with pinpoint payloads
+    if not _get_include_payload_status(notification):
+        sms_status.payload = None
+


Added second callback task as nested try/catch to ensure accurate error logging

k-macmillan · 2025-02-03T19:21:05Z

tests/app/celery/test_process_delivery_status_result_tasks.py

+        side_effect=exception,
+    )
+
+    with pytest.raises(Exception) as e:


Usually we're looking for the specific exception here, but I won't block it for that.

kalbfled

I don't insist that you completely revamp your testing strategy now. In the future, please focus on testing behavior (i.e. given this input, observe this output) rather than implementation (i.e. the function call raises an exception or calls this other function). Mock less.

kalbfled · 2025-02-03T19:32:25Z

tests/app/celery/test_process_delivery_status_result_tasks.py

@@ -538,6 +550,392 @@ def test_get_include_payload_status_exception(notify_api, mocker, sample_notific
    assert not _get_include_payload_status(sample_notification())


+@pytest.mark.parametrize('exception', [ConnectionError, RedisError, ResponseError, TimeoutError])
+def test_update_sms_retry_count_redis_set_exception(mocker, exception):


This is a test of implementation. What is the desired behavior in response to the exception? What input might raise this exception?

kalbfled · 2025-02-03T19:40:53Z

tests/app/celery/test_process_delivery_status_result_tasks.py

+
+
+@pytest.mark.parametrize('exception', [ConnectionError, RedisError, ResponseError, TimeoutError])
+def test_update_sms_retry_count_redis_incr_exception(mocker, exception):


Ditto. What is the significance of this in terms of the observed behavior of the application?

tests/app/celery/test_process_delivery_status_result_tasks.py

kalbfled · 2025-02-03T21:50:09Z

app/celery/process_delivery_status_result_tasks.py

+        try:
+            check_and_queue_va_profile_notification_status_callback(notification)
+            statsd_client.incr(f'clients.sms.{sms_status.provider}.status_update.success')
+        except Exception:


ditto specifics

app/celery/process_delivery_status_result_tasks.py

kalbfled · 2025-02-03T21:51:25Z

app/celery/process_delivery_status_result_tasks.py

+    try:
+        retry_count_redis_ttl = int(CARRIER_SMS_MAX_RETRY_WINDOW.total_seconds())
+        retry_count = update_sms_retry_count(notification_retry_id, ttl=retry_count_redis_ttl)
+    except Exception:


more specific

app/celery/process_delivery_status_result_tasks.py

kalbfled · 2025-02-03T21:53:52Z

app/celery/process_delivery_status_result_tasks.py

+                sms_sender_id=notification.sms_sender_id,
+                delay_seconds=retry_delay,
+            )
+        except Exception:


more specific to avoid hiding fixable problems

kalbfled

I'm not insisting now on catching more specific exceptions, but please do that in the future. Catching "Exception" can hide problems.

Chris Johnson added 6 commits January 14, 2025 14:22

initial retry functionality

2dc5977

fix dependency and start testing

bc4cb13

bugfix and additional tests

8f29a0d

bugfix

ac3ca39

additional unit tests

e5cd5ce

cleanup and additional unit tests

419f94a

cris-oddball reviewed Jan 15, 2025

View reviewed changes

app/celery/process_delivery_status_result_tasks.py Outdated Show resolved Hide resolved

ChrisJohnson-CDJ added the patch label Jan 15, 2025

ChrisJohnson-CDJ self-assigned this Jan 15, 2025

cris-oddball reviewed Jan 15, 2025

View reviewed changes

app/celery/process_delivery_status_result_tasks.py Outdated Show resolved Hide resolved

additional unit tests

083c803

ChrisJohnson-CDJ added minor and removed patch labels Jan 15, 2025

Chris Johnson added 7 commits January 15, 2025 12:59

update max retries

8f52355

code review address feedback

10ff76d

code review updates

7f4ee44

code review updates

bc6b6bb

Merge branch 'main' into 1578-retry-after-pinpoint-sms-temp-failure

f76561b

update logic flow

e73f749

update retry limit

32f7e8f

add unit test

552e23d

ChrisJohnson-CDJ temporarily deployed to dev January 16, 2025 17:47 — with GitHub Actions Inactive

Chris Johnson added 2 commits January 16, 2025 15:06

add logging

86520f9

add logging

ff54187

ChrisJohnson-CDJ temporarily deployed to dev January 16, 2025 20:33 — with GitHub Actions Inactive

fix

cc53c6e

ChrisJohnson-CDJ temporarily deployed to dev January 16, 2025 20:54 — with GitHub Actions Inactive

Chris Johnson added 2 commits January 17, 2025 11:35

bugfix and additional unit testing

c4e74cc

Merge branch 'main' into 1578-retry-after-pinpoint-sms-temp-failure

7fbd4fc

ChrisJohnson-CDJ and others added 9 commits January 30, 2025 07:35

Merge branch 'main' into 1578-retry-after-pinpoint-sms-temp-failure

3d809d0

update unittests

d0facad

update unittests

4847da4

update unittests

2e7c4a7

update unittests

6d78729

add task message unit test

99156d4

added more tests for send_notification_to_queue_delayed

78c1ed7

unit test cleanup

9f741f2

add rate limit test back in

74723e0

ChrisJohnson-CDJ temporarily deployed to perf January 30, 2025 18:54 — with GitHub Actions Inactive

cris-oddball reviewed Jan 30, 2025

View reviewed changes

app/celery/process_delivery_status_result_tasks.py Outdated Show resolved Hide resolved

pr review update

486981c

k-macmillan reviewed Jan 30, 2025

View reviewed changes

k-macmillan requested changes Jan 30, 2025

View reviewed changes

k-macmillan requested changes Jan 31, 2025

View reviewed changes

MackHalliday self-requested a review January 31, 2025 17:56

Chris Johnson added 2 commits January 31, 2025 18:42

pr tech review update

cf673ed

refactor sms_attempt_retry and tests to reduce complexity

c3d99cb

ChrisJohnson-CDJ commented Feb 3, 2025

View reviewed changes

k-macmillan reviewed Feb 3, 2025

View reviewed changes

pr review fix exception logging

fe6fd3d

kalbfled requested changes Feb 3, 2025

View reviewed changes

Chris Johnson added 2 commits February 4, 2025 10:29

pr review remove ut/it test labels

89f5fd7

pr review style

62c1323

ChrisJohnson-CDJ temporarily deployed to dev February 4, 2025 16:20 — with GitHub Actions Inactive

k-macmillan approved these changes Feb 4, 2025

View reviewed changes

kalbfled approved these changes Feb 4, 2025

View reviewed changes

EvanParish merged commit 092fd37 into main Feb 4, 2025
13 checks passed

EvanParish deleted the 1578-retry-after-pinpoint-sms-temp-failure branch February 4, 2025 17:14

		)


		def test_ut_sms_attempt_retry_queued_if_retryable(mocker, sample_notification):



		@pytest.mark.parametrize('exception', [ConnectionError, RedisError, ResponseError, TimeoutError])
		def test_update_sms_retry_count_redis_incr_exception(mocker, exception):

#1578 - Retry After Pinpoint SMS Temporary Failure #2244

#1578 - Retry After Pinpoint SMS Temporary Failure #2244

Uh oh!

Conversation

ChrisJohnson-CDJ commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Dev/QA Testing:

Initial Notification

First Retry Attempt

Second Retry Attempt

Third Retry Attempt (Retry Limit Exceeded)

Checklist

Uh oh!

Uh oh!

Uh oh!

ChrisJohnson-CDJ commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisJohnson-CDJ commented Jan 30, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChrisJohnson-CDJ Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kalbfled left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChrisJohnson-CDJ commented Jan 15, 2025 •

edited

Loading

ChrisJohnson-CDJ commented Jan 16, 2025 •

edited

Loading

ChrisJohnson-CDJ Jan 31, 2025 •

edited

Loading