Skip to content

Create TelemetryWdmTrait definition #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

SurekhaMuralidharan777
Copy link

WDM telemetry trait

@SurekhaMuralidharan777
Copy link
Author

@SurekhaMuralidharan777
Copy link
Author

@pidarped adding to this review ( unable to add as reviewer since I am not owner)

Copy link
Member

@robszewczyk robszewczyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would very much request we fix the clerical issues -- copyright etc.

I have questions about the design and suggestions for how to improve it. If the design proposed makes sense to you, I would very much like to see it changed. If this was something that was already considered and rejected, please post a rebuttal as to why the "last failure" is a preferred way to log the event.

@@ -0,0 +1,94 @@
//
// Copyright (c) 2016 Nest Labs, Inc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the copyright header: 2020, Google LLC, Apache license.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// This field holds the count of successful updates since the
// last restart.
uint32 successful_updates_count = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these metrics aggregated across different subscriptions?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is accumulated across multiple subscriptions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this looks like an aggregate of all Updates(presumably unconditional and conditional). It might be good to separate conditional vs unconditional updates also?



// This field holds the error code of the last update failure.
int32 last_update_failure_reason = 7;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the last failure update, it would also make some sense to note the address of the subscriber.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

};

// Events associated with WDM updates
message TelemetryWdmUpdateStatsEvent {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to log the update and subscription failures as first class events rather than remembering the last event in a different event? The semantics of most of the field in this trait make quite a lot of sense as properties. As events I would suggest we expose 3 different events: subscription failure (fields: reason), update failure (fields: reason, profile, status_code), and counts (fields: successful_updates, unsuccessful_updates, successful_subscriptions, subscription_attempts). If desired, the last subscription failure and last update failure could also be exposed as properties, so they are available for inspection on connectivity recovery

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the last failures would better serve as trait properties. That makes sense especially if these events are not uploaded periodically and visibility of the last failures might fade unless preserved as properties.
If we break these into multiple events, then the subscription and update failures could be triggered on the edges while the counter is a periodic one with a relatively long cadence(~20-30 mins)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to what was discussed offline , update on edge events for failure with properties in case the events are dropped lost due to lack of buffer space.

@SurekhaMuralidharan777
Copy link
Author

Updated latest version based on review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants