Add notification duration metrics (#4772)#4788
Conversation
…ature/4772-add-notification-duration-metrics_21_05
|
|
||
| if ((this->type != MqttNotification) && (this->type != KafkaNotification)) | ||
| { | ||
| if ((this->lastNotification > 0) && (this->lastNotificationDuration >= 0)) |
There was a problem hiding this comment.
Why the condition on lastNotification is > and the condition in lastNotificationDuration is >= ?
There was a problem hiding this comment.
lastNotification > 0 is used because 0 means no notification has ever been sent.
For lastNotificationDuration, 0 ms is a valid duration, so the condition must be >= 0. Using > 0 caused flaky tests because very fast notifications could be measured as either 0 ms or at least 1 ms.
When the field is missing in MongoDB, it is initialized to -1, so >= 0 ensures that only real lastNotificationDuration are serialized.
There was a problem hiding this comment.
So if the notification is too fast (less than 1ms) then it doesn't count for duration counters (and the lastNotificationDuration is not printed)? That's fine (just to confirm)
However, note this situation:
- Notification 1: 10ms
- Notification 2: 20ms
- Notification 3: 0ms (too fast)
It's ok not printing lastNotificationDuration this time but accumulatedNotificationDuration would be printed with 30 value.
Thus, I understand each field should be checked individually.
There was a problem hiding this comment.
If a notification takes 0 ms (less than 1 ms), it does count and both fields will be printed.
The serialization code checks:
if ((this->lastNotification > 0) && (this->lastNotificationDuration >= 0))
As 0 is >= 0, they are included in the JSON response. In fact, if we didn't print these metrics when they are 0, they gave me flaky tests. 0 is a valid measurement.
The fields are only omitted when lastNotificationDuration < 0 . The default value is -1 when no notification has been sent yet.
| | `lastNotificationDuration` | Only on retrieval | number | Not editable, only present in GET operations. Duration of the last notification attempt in milliseconds. | | ||
| | `accumulatedNotificationDuration` | Only on retrieval | number | Not editable, only present in GET operations. Sum of durations of all notification attempts in milliseconds. | |
There was a problem hiding this comment.
They are incluced only in HTTP notifications or also in MQTT/KAFKA?
There was a problem hiding this comment.
The duration metrics are only applicable to HTTP notifications. I’ve clarified this in the documentation.
| if (cssP != NULL) | ||
| { | ||
| if (cssP->lastNotificationTime > cSubP->lastNotificationTime) | ||
| if (cssP->lastNotificationTime >= cSubP->lastNotificationTime) |
There was a problem hiding this comment.
Seems to be duplicated in L1288... weird
There was a problem hiding this comment.
It is not a duplicate line. The >= condition (cssP->lastNotificationTime >= cSubP->lastNotificationTime) is used to synchronize lastNotificationDuration . I have documented this in the code, and it is tested by 4772_notification_duration_metrics/duration_metrics_slow_sync.test and
4772_notification_duration_metrics/refresh_cache_notif_equal_timestamps.test (we can remove the latter if it seems too forced). The >= condition also handles the case where two notifications occur in the same second.
| "description": "HTTP sub", | ||
| "id": "REGEX([0-9a-f\-]{24})", | ||
| "notification": { | ||
| "accumulatedNotificationDuration": REGEX(\d+), |
There was a problem hiding this comment.
In all places we have REGEX(\d+) for the new counters
I'd suggest to have some "real" tests using some delay endpoint in accumulator-server.py using the /givemeDelay endpoint (which imposes T seconds delay).
Thus, we could trigger three notifications so:
- After the first notification lastNotificationDuration is T*1000 and accumulatedNotificationDuration is T
- After the first notification lastNotificationDuration is T*1000 and accumulatedNotificationDuration is
2*T*1000 - After the first notification lastNotificationDuration is T*1000 and accumulatedNotificationDuration is
3*T*1000
The miliseconds introduce some variability, but we would do some rouding based in REGEX(). Eg T=10
2*T*1000 = 2 * 10 * 10000 = 20000 => REGEX(20\d\d\d) (range 20000 to 20099, i.e. 20-21s)
There was a problem hiding this comment.
Currently the /givemeDelay is like this:
@app.route("/givemeDelay", methods=['POST'])
def givemeDelay():
sleep(60)
return Response(status=200)
T = 60 seconds is probably too long for this kind of test. I'd suggest to make this configurable in accumulator-server.py, then use a shorter T (eg. T = 10)
There was a problem hiding this comment.
Test added: 4772_notification_duration_metrics/duration_metrics_accumulator_delay.test
| "description": "HTTP sub", | ||
| "id": "REGEX([0-9a-f\-]{24})", | ||
| "notification": { | ||
| "accumulatedNotificationDuration": REGEX(\d+), |
There was a problem hiding this comment.
If a subscription originally uses HTTP, then changed to MQTT/KAFKA (where I understand these new counters are not used but I have asked for confirmation in another comment), then changed back to HTTP, what happens with the counters? Are reseted or preserved?
I thinks it's better to reset (and specify it in documentation)
(This is a kind of corner case, as a subscription rarely changes its notification type, but we should have it covered)
There was a problem hiding this comment.
I’ve added a test and the necessary changes to cover PATCH transitions from HTTP to MQTT/Kafka and back to HTTP. The changes will be included in the next commit : 0c5d6b1
There was a problem hiding this comment.
Looking to the tests it seems that when we change HTTP -> MQTT/KAFKA -> HTTP the metrics are reseted (ie. lastDuration to 0 and accumulatedDuration to 0). Is my understanding correct?
There was a problem hiding this comment.
When the protocol is switched from HTTP to MQTT/Kafka, the database fields lastNotificationDuration and accumulatedNotificationDuration are removed from the MongoDB using $unset .
fiware-orion/src/lib/mongoBackend/mongoUpdateSubscription.cpp
Lines 464 to 469 in 9edc6f4
In cache, lastNotificationDuration is reset to -1 and notificationDurationDelta to 0 .
fiware-orion/src/lib/mongoBackend/mongoUpdateSubscription.cpp
Lines 332 to 337 in 9edc6f4
As lastNotificationDuration is less than 0 , Orion does not include either of the two fields in the subscription response, and they will not appear again unless the subscription is switched back to HTTP and a new notification is sent.
| A `condition` contains the following subfields: | ||
|
|
||
| | Parameter | Optional | Type | Description | | ||
| |--------------|----------|-------|-------------------------------------------------------------------------------------------------------------------------------| |
There was a problem hiding this comment.
CNR entry about the changes in this PR should be included
| "lastNotificationDuration": REGEX((9|10|11)\d\d\d), | ||
| "lastSuccess": REGEX(1[6-9]\d+), |
There was a problem hiding this comment.
lastNotificationDuration": REGEX((9|10|11)\d\d\d) the duration is expected to be around 10,000 milliseconds. the actual measured duration might fluctuate slightly (e.g., 9,988 ms or 10,018 ms ). Using (9|10|11)\d\d\d , preventing flaky test failures.
"lastSuccess": REGEX(1[6-9]\d+) In MongoDB this field is stored as a Unix timestamp (seconds since epoch). We use this expression in tests like these, e.g:
Co-authored-by: Fermín Galán Márquez <fgalan@users.noreply.github.com>
|
This commit adds the duration metrics to the existing MongoDB update: 9edc6f4 |
Issue #4772