-
Notifications
You must be signed in to change notification settings - Fork 8
#1578 - Retry After Pinpoint SMS Temporary Failure #2244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Proposing an initial limit on the retry attempts to two attempts with delays of 60 seconds and 10 minutes. This is to work within a 900s limit imposed by AWS SQS when using SQS as a broker for Celery and using countdown or eta with async_apply. 1st retry: 60 seconds +/- 10% Will create a follow-up ticket to investigate longer delays. |
Test notes from initial qa/dev testing are pending but were successful on Perf deployment |
statsd_client.incr(f'clients.sms.{sms_status.provider}.status_update.success') | ||
except Exception: | ||
current_app.logger.exception( | ||
'Failed to check_and_queue_callback_task for notification: %s', notification.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be either function? I have a feeling this was copy/pasted and the original is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the second callback function and success into nested try/catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This slipped by your updates.
current_app.logger.critical( | ||
'SQS resource failed to queue message for sqs queue "%s". notification_id: %s | Exception: %s', | ||
prefixed_queue_name, | ||
notification.id, | ||
e, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use exception and you don't need the e.
def test_update_sms_retry_count_redis_exception(mocker): | ||
mocker.patch( | ||
'app.celery.process_delivery_status_result_tasks.redis_store.set', | ||
side_effect=Exception, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have been an exception from their docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameterized to use common redis errors
|
||
def test_update_sms_retry_count_value_error(mocker): | ||
mocker.patch('app.celery.process_delivery_status_result_tasks.redis_store.set') | ||
mocker.patch('app.celery.process_delivery_status_result_tasks.redis_store.incr', return_value='not an integer') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with redis exceptions.
raise ValueError( | ||
f"Expected an integer value for id '{notification_retry_id}', but got: {value} (type: {type(value)})" | ||
) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the redis base exception if you are not going to be specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed this.
(3, 600), | ||
], | ||
) | ||
def test_ut_get_sms_retry_delay_returns_within_delay_range(retry_count, expected_base_delay): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is a risk. The jitter may return values correct in one test and incorrect in another. We are also using a test-set jitter rather than what the app code uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated test to mock randint to validate the just the base delay
) | ||
|
||
|
||
def test_ut_sms_attempt_retry_queued_if_retryable(mocker, sample_notification): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sms_attempt_retry(sms_status) | ||
|
||
updated_notification = get_notification_by_id(notification.id) | ||
assert updated_notification.cost_in_millicents == 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
instead of hardcoding 10 it should be notification.cost_in_millicents + sms_status.price_millicents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
saved off the orig cost and added to sms_status price since notification is updated at that point
status_reason=None, | ||
cost_in_millicents=5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
These two fields shouldn't be necessary. We don't want to put more than is necessary because it can complicate things and cause issues if things change in the future.
Same below with SmsStatusRecord
and any fields that have defaults and are not necessary for this test.
sms_status = SmsStatusRecord( | ||
None, notification.reference, NOTIFICATION_TEMPORARY_FAILURE, STATUS_REASON_RETRYABLE, PINPOINT_PROVIDER | ||
) | ||
mocker.patch('app.celery.process_delivery_status_result_tasks.update_sms_retry_count', side_effect=Exception) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That code can throw type, value, or a generic exception, all need to be tested (parametrize).
def test_ut_sms_attempt_retry_not_queued_if_exception_on_retry_count(mocker, sample_notification): | ||
notification = sample_notification( | ||
status=NOTIFICATION_SENDING, | ||
status_reason=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment regarding unnecessary fields.
sample_notification, | ||
current_status, | ||
): | ||
status_reason = None if (current_status == NOTIFICATION_DELIVERED) else 'Because I said so!' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status_reason
should only be set in the case of permanent failure now, if I understand the flow correctly.
# MaxNumberOfMessages ranges from 1-10, want to ensure only one message was added to the queue | ||
messages = sqs_client.receive_message(QueueUrl=q_url, MaxNumberOfMessages=10) | ||
assert len(messages.get('Messages')) == 1 | ||
assert (monotonic() - start) < 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably want to use something like (time calc) < (delay_seconds + 1)
and/or add a comment regarding why we're testing for 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, make the logic clearer on what we expect and why
while messages.get('Messages') is None: | ||
messages = sqs_client.receive_message(QueueUrl=q_url, MaxNumberOfMessages=10) | ||
# prevent infinite loop if nothing is added to the queue | ||
if monotonic() - start > 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, use delay_seconds in the equation.
# verify it took at least 1 sec to get message | ||
delay = monotonic() - start | ||
assert delay > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention delay_seconds in the comment and/or modify the code to use delay_seconds.
nit:
delay can be calculated and asserted on the same line
assert task_body.get('task') == 'deliver_sms' | ||
|
||
task_args = task_body.get('args') | ||
assert task_args == [str(notification.id), str(None)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
should not be cast to a string. It should not even be necessary? I am fairly certain this would make the sms_sender_id
come in as 'None'
rather than None
.
# Our clients are not prepared to deal with pinpoint payloads | ||
if not _get_include_payload_status(notification): | ||
sms_status.payload = None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added second callback task as nested try/catch to ensure accurate error logging
side_effect=exception, | ||
) | ||
|
||
with pytest.raises(Exception) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we're looking for the specific exception here, but I won't block it for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't insist that you completely revamp your testing strategy now. In the future, please focus on testing behavior (i.e. given this input, observe this output) rather than implementation (i.e. the function call raises an exception or calls this other function). Mock less.
@@ -538,6 +550,392 @@ def test_get_include_payload_status_exception(notify_api, mocker, sample_notific | |||
assert not _get_include_payload_status(sample_notification()) | |||
|
|||
|
|||
@pytest.mark.parametrize('exception', [ConnectionError, RedisError, ResponseError, TimeoutError]) | |||
def test_update_sms_retry_count_redis_set_exception(mocker, exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a test of implementation. What is the desired behavior in response to the exception? What input might raise this exception?
|
||
|
||
@pytest.mark.parametrize('exception', [ConnectionError, RedisError, ResponseError, TimeoutError]) | ||
def test_update_sms_retry_count_redis_incr_exception(mocker, exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. What is the significance of this in terms of the observed behavior of the application?
try: | ||
check_and_queue_va_profile_notification_status_callback(notification) | ||
statsd_client.incr(f'clients.sms.{sms_status.provider}.status_update.success') | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto specifics
try: | ||
retry_count_redis_ttl = int(CARRIER_SMS_MAX_RETRY_WINDOW.total_seconds()) | ||
retry_count = update_sms_retry_count(notification_retry_id, ttl=retry_count_redis_ttl) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more specific
sms_sender_id=notification.sms_sender_id, | ||
delay_seconds=retry_delay, | ||
) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more specific to avoid hiding fixable problems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not insisting now on catching more specific exceptions, but please do that in the future. Catching "Exception" can hide problems.
Description
issue #1578
SMS requests resulting in _SMS.FAILURE events with retriable status are retried
Temporary/Retriable failure is not reported to the client
_SMS.FAILURE STATUS_REASON_RETRYABLE
UNREACHABLE, UNKNOWN, CARRIER_UNREACHABLE, TTL_EXPIRED
Retry a maximum of 2 times within a 3 day window.
Retries are delayed using increasing delays with each attempt.
A +/- 10% jitter is applied to the delay.
1st retry: 60 seconds +/- 10%
2nd retry: 10 minutes (600 seconds) +/- 10%
This retry schedule should address transient errors as well as short term service disruptions or unavailability.
Retriable failures that exceed the retry count or retry window are marked as permanent failures and reported to client
NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE
Cost computation (price_millicents) has been updated to sum successive retries and/or the final SUCCESS/FAILURE.
_SMS.BUFFERED is ignored for price updates since it should have a reported price_millicents according to AWS docs.
How Has This Been Tested?
Unit testing
tests/app/notifications/test_process_notifications.py
==== 73 passed, 1 skipped in 121.73s (0:02:01) ====
tests/app/dao/notification_dao/test_notification_dao.py
==== 216 passed, 4 skipped in 90.20s (0:01:30) ====
tests/app/celery/test_process_pinpoint_receipt_tasks.py
==== 70 passed in 20.96s ====
tests/app/celery/test_process_delivery_status_result_tasks.py
==== 117 passed in 28.59s ====
First retry

Second retry

Third retry -> Permanent Failure

Database updates for all attempts

Note:
tests/app/celery/test_process_pinpoint_receipt_tasks.py :: test_process_pinpoint_results_notification_final_status
These record status's were previously reported as STATUS_REASON_RETRYABLE
This now reflects the final status after inability to attempt further retries
('_SMS.FAILURE', 'UNREACHABLE', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
('_SMS.FAILURE', 'UNKNOWN', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
('_SMS.FAILURE', 'CARRIER_UNREACHABLE', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
('_SMS.FAILURE', 'TTL_EXPIRED', NOTIFICATION_PERMANENT_FAILURE, STATUS_REASON_UNDELIVERABLE),
Dev/QA Testing:
Initial Notification
Log processing details:
LOG 'Processing pinpoint result. | reference: %s | event_type: %s | record_status: %s | message_parts: %s | price_millicents: %s | provider_updated_at: %s'
Log initial logic:
LOG 'Initial %s logic | reference: %s | notification_id: %s | status: %s | status_reason: %s'
Log final logic:
LOG 'Final %s logic | reference: %s | notification_id: %s | status: %s | status_reason: %s | cost_in_millicents: %s'
Verify notification status:
notification.status == 'delivered'
Extract
notification_id
andreference
from the "Final %s logic" message.First Retry Attempt
Update notification status:
Update notification by notification.id, set notification.status = 'sending'
Create and send a Kinesis event payload:
_SMS.FAILURE, UNREACHABLE
Log processing details:
LOG 'Processing pinpoint result. | reference: %s | event_type: %s | record_status: %s | message_parts: %s | price_millicents: %s | provider_updated_at: %s'
Log retry logic:
LOG 'Entering %s retryable failure logic to process event | reference: %s | notification_id: %s | current_status: %s | current_status_reason: %s | event_sms_status: %s | event_sms_status_reason: %s'
Log requeue attempt:
LOG 'Notification updated prior to requeue attempt | notification_id: %s | notification_status: %s | cost_in_milicents %s'
LOG 'Attempting %s requeue | notification_id: %s | retry_delay: %s seconds | retry_count: %s'
Log requeue result:
LOG 'Requeued notification for delayed %s delivery | notification_id: %s | retry_delay: %s seconds'
Log SMS delivery:
LOG 'Start sending SMS for notification id: %s'
LOG 'Successfully sent sms for notification id: %s'
LOG 'Saved provider reference: {reference} for notification id: {notification.id}'
Verify notification is delivered after a 60-second delay.
Log final logic:
LOG 'Processing pinpoint result. | reference: %s | event_type: %s | record_status: %s | message_parts: %s | price_millicents: %s | provider_updated_at: %s'
LOG 'Initial %s logic | reference: %s | notification_id: %s | status: %s | status_reason: %s'
LOG 'Final %s logic | reference: %s | notification_id: %s | status: %s | status_reason: %s | cost_in_millicents: %s'
Note the new reference (1st retry notification).
Second Retry Attempt
Update notification status:
Update notification by notification.id, set notification.status = 'sending'
Create and send a Kinesis event payload:
_SMS.FAILURE, UNREACHABLE
Repeat logging steps from the first retry attempt.
Verify notification is delivered after a 10-minute delay.
Log final logic and note the new reference (2nd retry notification).
Third Retry Attempt (Retry Limit Exceeded)
Update notification status:
Update notification by notification.id, set notification.status = 'sending'
Create and send a Kinesis event payload:
_SMS.FAILURE, UNREACHABLE
Log processing details:
LOG 'Processing pinpoint result. | reference: %s | event_type: %s | record_status: %s | message_parts: %s | price_millicents: %s | provider_updated_at: %s'
Log retry failure logic:
LOG 'Entering %s retryable failure logic to process event | reference: %s | notification_id: %s | current_status: %s | current_status_reason: %s | event_sms_status: %s | event_sms_status_reason: %s'
Log final logic:
LOG 'Final %s logic | reference: %s | notification_id: %s | status: %s | status_reason: %s | cost_in_millicents: %s'
Verify notification is updated to
PERMANENT_FAILURE, UNDELIVERABLE
.Checklist