Skip to content

Add notification duration metrics (#4772)#4788

Open
orianar wants to merge 20 commits into
masterfrom
feature/4772-add-notification-duration-metrics_21_05
Open

Add notification duration metrics (#4772)#4788
orianar wants to merge 20 commits into
masterfrom
feature/4772-add-notification-duration-metrics_21_05

Conversation

@orianar

@orianar orianar commented May 23, 2026

Copy link
Copy Markdown
Collaborator

Issue #4772

@orianar orianar changed the title [WIP] Add notification duration metrics (#4772) Add notification duration metrics (#4772) Jun 1, 2026
@orianar orianar requested a review from fgalan June 1, 2026 03:38
…ature/4772-add-notification-duration-metrics_21_05

if ((this->type != MqttNotification) && (this->type != KafkaNotification))
{
if ((this->lastNotification > 0) && (this->lastNotificationDuration >= 0))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the condition on lastNotification is > and the condition in lastNotificationDuration is >= ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastNotification > 0 is used because 0 means no notification has ever been sent.

For lastNotificationDuration, 0 ms is a valid duration, so the condition must be >= 0. Using > 0 caused flaky tests because very fast notifications could be measured as either 0 ms or at least 1 ms.

When the field is missing in MongoDB, it is initialized to -1, so >= 0 ensures that only real lastNotificationDuration are serialized.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the notification is too fast (less than 1ms) then it doesn't count for duration counters (and the lastNotificationDuration is not printed)? That's fine (just to confirm)

However, note this situation:

  • Notification 1: 10ms
  • Notification 2: 20ms
  • Notification 3: 0ms (too fast)

It's ok not printing lastNotificationDuration this time but accumulatedNotificationDuration would be printed with 30 value.

Thus, I understand each field should be checked individually.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a notification takes 0 ms (less than 1 ms), it does count and both fields will be printed.

The serialization code checks:
if ((this->lastNotification > 0) && (this->lastNotificationDuration >= 0))

As 0 is >= 0, they are included in the JSON response. In fact, if we didn't print these metrics when they are 0, they gave me flaky tests. 0 is a valid measurement.

The fields are only omitted when lastNotificationDuration < 0 . The default value is -1 when no notification has been sent yet.

Comment thread doc/manuals/orion-api.md Outdated
Comment on lines +5070 to +5071
| `lastNotificationDuration` | Only on retrieval | number | Not editable, only present in GET operations. Duration of the last notification attempt in milliseconds. |
| `accumulatedNotificationDuration` | Only on retrieval | number | Not editable, only present in GET operations. Sum of durations of all notification attempts in milliseconds. |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are incluced only in HTTP notifications or also in MQTT/KAFKA?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duration metrics are only applicable to HTTP notifications. I’ve clarified this in the documentation.

Comment thread src/lib/cache/subCache.cpp Outdated
if (cssP != NULL)
{
if (cssP->lastNotificationTime > cSubP->lastNotificationTime)
if (cssP->lastNotificationTime >= cSubP->lastNotificationTime)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be duplicated in L1288... weird

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a duplicate line. The >= condition (cssP->lastNotificationTime >= cSubP->lastNotificationTime) is used to synchronize lastNotificationDuration . I have documented this in the code, and it is tested by 4772_notification_duration_metrics/duration_metrics_slow_sync.test and
4772_notification_duration_metrics/refresh_cache_notif_equal_timestamps.test (we can remove the latter if it seems too forced). The >= condition also handles the case where two notifications occur in the same second.

Comment thread src/lib/mongoBackend/mongoUpdateSubscription.cpp
"description": "HTTP sub",
"id": "REGEX([0-9a-f\-]{24})",
"notification": {
"accumulatedNotificationDuration": REGEX(\d+),

@fgalan fgalan Jun 4, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all places we have REGEX(\d+) for the new counters

I'd suggest to have some "real" tests using some delay endpoint in accumulator-server.py using the /givemeDelay endpoint (which imposes T seconds delay).

Thus, we could trigger three notifications so:

  • After the first notification lastNotificationDuration is T*1000 and accumulatedNotificationDuration is T
  • After the first notification lastNotificationDuration is T*1000 and accumulatedNotificationDuration is 2*T*1000
  • After the first notification lastNotificationDuration is T*1000 and accumulatedNotificationDuration is 3*T*1000

The miliseconds introduce some variability, but we would do some rouding based in REGEX(). Eg T=10

2*T*1000 = 2 * 10 * 10000 = 20000 => REGEX(20\d\d\d) (range 20000 to 20099, i.e. 20-21s)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the /givemeDelay is like this:

@app.route("/givemeDelay", methods=['POST'])
def givemeDelay():
    sleep(60)
    return Response(status=200)

T = 60 seconds is probably too long for this kind of test. I'd suggest to make this configurable in accumulator-server.py, then use a shorter T (eg. T = 10)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test added: 4772_notification_duration_metrics/duration_metrics_accumulator_delay.test

"description": "HTTP sub",
"id": "REGEX([0-9a-f\-]{24})",
"notification": {
"accumulatedNotificationDuration": REGEX(\d+),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a subscription originally uses HTTP, then changed to MQTT/KAFKA (where I understand these new counters are not used but I have asked for confirmation in another comment), then changed back to HTTP, what happens with the counters? Are reseted or preserved?

I thinks it's better to reset (and specify it in documentation)

(This is a kind of corner case, as a subscription rarely changes its notification type, but we should have it covered)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve added a test and the necessary changes to cover PATCH transitions from HTTP to MQTT/Kafka and back to HTTP. The changes will be included in the next commit : 0c5d6b1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking to the tests it seems that when we change HTTP -> MQTT/KAFKA -> HTTP the metrics are reseted (ie. lastDuration to 0 and accumulatedDuration to 0). Is my understanding correct?

@orianar orianar Jun 13, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the protocol is switched from HTTP to MQTT/Kafka, the database fields lastNotificationDuration and accumulatedNotificationDuration are removed from the MongoDB using $unset .

if (subUp.notificationProvided)
{
if (subUp.notification.type != ngsiv2::HttpNotification)
{
unsetHttpNotificationDurationMetrics(&unsetB);
}

In cache, lastNotificationDuration is reset to -1 and notificationDurationDelta to 0 .

if (subUp.notificationProvided &&
subUp.notification.type != ngsiv2::HttpNotification)
{
lastNotificationDuration = -1;
notificationDurationDelta = 0;
}

As lastNotificationDuration is less than 0 , Orion does not include either of the two fields in the subscription response, and they will not appear again unless the subscription is switched back to HTTP and a new notification is sent.

Comment thread doc/manuals/orion-api.md
A `condition` contains the following subfields:

| Parameter | Optional | Type | Description |
|--------------|----------|-------|-------------------------------------------------------------------------------------------------------------------------------|

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CNR entry about the changes in this PR should be included

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added CNR entry

@orianar orianar requested a review from fgalan June 10, 2026 21:57
Comment thread src/lib/cache/subCache.cpp Outdated
Comment on lines +344 to +345
"lastNotificationDuration": REGEX((9|10|11)\d\d\d),
"lastSuccess": REGEX(1[6-9]\d+),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this, pls?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastNotificationDuration": REGEX((9|10|11)\d\d\d) the duration is expected to be around 10,000 milliseconds. the actual measured duration might fluctuate slightly (e.g., 9,988 ms or 10,018 ms ). Using (9|10|11)\d\d\d , preventing flaky test failures.

"lastSuccess": REGEX(1[6-9]\d+) In MongoDB this field is stored as a Unix timestamp (seconds since epoch). We use this expression in tests like these, e.g:

orianar and others added 2 commits June 12, 2026 05:44
Co-authored-by: Fermín Galán Márquez <fgalan@users.noreply.github.com>
@orianar

orianar commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

This commit adds the duration metrics to the existing MongoDB update: 9edc6f4

@orianar orianar requested a review from fgalan June 12, 2026 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants