Skip to content

Conversation

AndreKurait
Copy link
Member

@AndreKurait AndreKurait commented Apr 29, 2025

Description

Add stage name to metrics with trace sampling.

Fix dependency ordering with otel collector and service container, fixes otel sdk connection error during task shutdown / startup when the service is running without the collector.

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-1949

Testing

Verified metrics with dimension and dashboards in jenkins environment

C&R Dashboard Example:
image

Check List

  • New functionality includes testing
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@AndreKurait AndreKurait marked this pull request as ready for review April 29, 2025 14:15
otlp:
protocols:
grpc:
endpoint: :4317
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears the default here changed from :4317 to localhost:4317 which broke some of our inter-container communication without moving this back.

@@ -1,5 +1,5 @@
exporters:
logging:
debug:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to upgrade otel, logging is deprecated/removed

@@ -1,3 +1,3 @@
FROM grafana/grafana:11.1.0
FROM grafana/grafana:11.6.1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't needed, i can revert if we want

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all tests are passing with this version update, then let's keep it.

@@ -1,4 +1,4 @@
FROM public.ecr.aws/aws-observability/aws-otel-collector:v0.38.0
FROM public.ecr.aws/aws-observability/aws-otel-collector:v0.43.3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, does this version update have any breaking changes impact our codebase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see earlier comment which i adjusted to fix, as well as logging -> debug

It appears the default here changed from :4317 to localhost:4317 which broke some of our inter-container communication without moving this back.

action: delete
extensions:
health_check:
endpoint: :13133
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will having an endpoint for health checks help us differentiate between an actual request vs health check requests in our kafka logs? If yes, then I think that Capture and Replay Dashboard widget for request/response mismatches should exclude health check requests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is otel healthcheck endpoint, will not appear on any dashboards or kafka

'REGION': props.region,
'MA_STAGE': props.stage
'MA_STAGE': props.stage,
'MA_QUALIFIER': props.stage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change this to props.qualifier ?? props.stage ? in case in the future we want to diverge the use-cases of qualifier from stage

exporters:
awsemf:
namespace: 'OpenSearchMigrations'
dimension_rollup_option: NoDimensionRollup # Reduce number of metrics by only publishing with all dimensions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this ? Our dashboards will make an assumption that dimensions are guaranteed to be present.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only publishes with the most granular dimension, if this was excluded, as an example, we would be publishing the metrics as stage-specific metrics as well as not stage specific

Copy link
Collaborator

@jugal-chauhan jugal-chauhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this PR as the changes are consistent and all tests appear to be passing.

@AndreKurait AndreKurait merged commit 6b663c1 into opensearch-project:main May 30, 2025
58 checks passed
Copy link

codecov bot commented May 30, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (3db3f15) to head (57401a4).
Report is 148 commits behind head on main.

Additional details and impacted files
@@     Coverage Diff      @@
##   main   #1488   +/-   ##
============================
============================

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants