Skip to content

Issue #2689: Unsatisfactory behavior of snapshot selector#2697

Closed
PeterF778 wants to merge 4 commits intosignalfx:mainfrom
PeterF778:2689_Unsatisfactory_behavior_of_snapshot_selector
Closed

Issue #2689: Unsatisfactory behavior of snapshot selector#2697
PeterF778 wants to merge 4 commits intosignalfx:mainfrom
PeterF778:2689_Unsatisfactory_behavior_of_snapshot_selector

Conversation

@PeterF778
Copy link
Copy Markdown

Change the algorithm for snapshot profiling selection to be exclusively based on trace-id. Removing the concepts of snapshot Volume, SnapshotVolumePropagator, and ProbabilisticSnapshotSelector. Updating unit tests.

Change the algorithm for snapshot profiling selection to be exclusively based on trace-id.
Removing the concepts of snapshot Volume, SnapshotVolumePropagator, and ProbabilisticSnapshotSelector.
Updating unit tests.
@PeterF778 PeterF778 requested review from a team as code owners March 10, 2026 21:16
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 10, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@PeterF778
Copy link
Copy Markdown
Author

recheck

Copy link
Copy Markdown
Contributor

@breedx-splk breedx-splk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow thanks, this is sooooo much nicer! 🏆

PeterF778 and others added 2 commits March 12, 2026 11:48
…hot/TraceIdBasedSnapshotSelector.java

Co-authored-by: jason plumb <75337021+breedx-splk@users.noreply.github.com>
…hot/TraceIdBasedSnapshotSelector.java

Co-authored-by: jason plumb <75337021+breedx-splk@users.noreply.github.com>
* capable agents can make the same snapshotting decision, if necessary.
*/
@Override
public <C> Context extract(Context context, C carrier, TextMapGetter<C> getter) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that previously the decision whether to profile or not was propagated from upstream service using the volume baggage entry. So the first called service decided in some way whether to profile or not an other services followed that decision.
After these changes every service will decide independently whether to profile or not. If they are using the same algorithm with the same probability then they'll reach the same decision. If they use a different probability then they may reach a different decision. Do we need to confirm with someone that removing the behavior that the first service decides whether to profile or not is ok?
Secondly if we remove it now and replace the selection algorithm then services running old and new code will not reach the same profiling decision. Is this ok?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that previously the decision whether to profile or not was propagated from upstream service using the volume baggage entry. So the first called service decided in some way whether to profile or not an other services followed that decision. After these changes every service will decide independently whether to profile or not. If they are using the same algorithm with the same probability then they'll reach the same decision. If they use a different probability then they may reach a different decision. Do we need to confirm with someone that removing the behavior that the first service decides whether to profile or not is ok? Secondly if we remove it now and replace the selection algorithm then services running old and new code will not reach the same profiling decision. Is this ok?

One important detail is that while the original intention of the design was to propagate the profiling decision from upstream, it actually did not work. See the results of my testing that are quoted in the ticket. So I think that we do not really have to worry about "breaking" that behavior, because it had been already broken.

Furthermore, I do not think that the intention of the old design was appropriate. I do not see a use case for profiling downstream services because upstream service said so. It looks to me that that design was a copycat from AppD agent which does send downstream a correlation token asking for taking a snapshot. But that was different. The reason for taking such snapshots in AppD were to preserve information about a particular transaction instance (request) - normally the AppD agent sends summaries only. In OTel, there's no need for that, as this functionality comes free with every request/trace.

Copy link
Copy Markdown
Collaborator

@laurit laurit Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explaination.

…ector

# Conflicts:
#	profiler/src/test/java/com/splunk/opentelemetry/profiler/snapshot/SnapshotProfilingConfigurationCustomizerProviderTest.java
#	profiler/src/test/java/com/splunk/opentelemetry/profiler/snapshot/SnapshotVolumePropagatorComponentProviderTest.java
@robsunday
Copy link
Copy Markdown
Contributor

This PR cannot be merged because one of the commits is not signed.

@robsunday robsunday closed this Mar 19, 2026
@github-actions github-actions bot locked and limited conversation to collaborators Mar 19, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants