-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Stratified Sampling Policy for Tailsampling processor #41877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Please check the CLA and mark ready for review again. |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib. Important reminders:
A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better! |
|
@atoulme , The suggested changed have been made. We appreciate your insights or suggestions on this. Also, if there’s anything further we can clarify or update in the PR to move it forward, please let us know |
|
Please take a look at the lint issues with the CI. These are minor and can be addressed quickly. @portertech please review? |
| // HashSalt allows one to configure the hashing salts. This is important in scenarios where multiple layers of collectors | ||
| // have different sampling rates: if they use the same salt all passing one layer may pass the other even if they have | ||
| // different sampling rates, configuring different salts avoids that. | ||
| HashSalt string `mapstructure:"hash_salt"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider using the pkg/sampling support in this repository instead of a hash-based approach. OpenTelemetry systems are expected to observe the W3C TraceContext Level 2 specification, which means there are 56 bits of randomness available in one of two ways implemented by that library. We do not encourage hash-based sampling, see the approach we've taken to upgrade in probabilisticsampling processor, which is also the subject of this (current) blog post draft: open-telemetry/opentelemetry.io#7735.
Moreover, there are other probability samplers in this component's configuration: I would expect them all to use the same approach, whatever it is, and would prefer to keep this code as simple as possible.
|
|
||
| // StratifiedProbabilisticCfg holds the configurable settings to create a stratified probabilistic | ||
| // sampling policy evaluator. | ||
| type StratifiedProbabilisticCfg struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the description I read, I am not sure the term "Stratified" has been quite earned, though from the PR description of ("at least once") sampling, there's something useful here. Note that the composite sampling policy of this component is similar to what you're proposing, too, except (IIUC) you're adding an at-least-once fallback instead of a default-bucket approach.
If, as I take it, what you're trying to achieve is not based specifically on this at-least-once principal, but instead you are aiming just to achieve good coverage across all values in a key-space, then I support, but it leaves me with questions for this configuration struct. I would imagine wanting a rate-limited sampler that tries to achieve balance, which means estimating the most-frequent values in the set and assigning (somehow) the percentage to use for the remaining bunch. (This is what the composite sampler policy in this component does.) I believe from looking into this problem, that the best answer would be only to configure a rate limit and nothing else; let the component figure out what sampling probabilities to use for which strata and also let the component control the relative weight of the "other bunch", which is to say: how much weight of the distribution falls into the default bucket vs how much is explicitly managed with a fixed-size lookup table used to calculate the probability that will achieve the intended rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/probabilisticsamplerprocessor/README.md, which describes two modes of sampling that do not use any "salt".
|
@jmacd Thank you for your response! We'll explore the pkg/sampling package for possible alternatives to implement hash-based sampling. We couldn’t find an existing policy that allows sampling traces based on specific use-case semantics. That’s why we’ve been exploring a stratified probabilistic sampling approach, which ensures at least one trace is sampled per distinct use case. This method provides a more accurate reflection of how the application is actually used, and significantly improves the usability of the sampled trace set for downstream analysis. If we rely solely on automated components to determine sampling probabilities (e.g., purely based on frequency or volume), the resulting trace set often lacks the diversity and completeness required for effective observability and diagnostics. We’d appreciate any suggestions or guidance you may have on achieving such a sampling. |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
@jmacd , just a gentle reminder to let us know your suggestions. |
|
@dhanyarmathews Thank you for bringing this topic to the Sampling SIG today! I will be glad to help review and resolve the concerns I had, and thank you for your patience. |
|
👋 Would this be a good candidate to add as an extension instead of a core sampling strategy? Extensions were added after you opened this PR (see #42573), but have been working well for us, and could provide a good location for more experimental sampling strategies. |
|
Agree with @csmarchbanks that an extension might be a good start. The demonstrated cpu increase is around 30%, but from experience I'm guessing the TSP is actually very little of the baseline cpu. Typically that is taken up by protobuf and garbage collection, so a 30% increase could actually be 2x or more on the actual tsp cpu. I'd recommend using pprof to check the cpu usage before / after instead. |
|
@csmarchbanks or @Logiraptor would you briefly explain what it means for us to create these sampling policies as extensions? Do you mean to create extensions registered somewhere within the processor/tailsamplingprocessor? How would this be built and deployed for a user in that form? |
|
They would be registered as normal extensions, so instead of added to I am fairly new to the collector, so please let me know if any of that is incorrect, but that is what I want to support with extensions! |
|
Very cool! I want to know more about this process. @csmarchbanks would you be willing to help document this as a way to help @dhanyarmathews in this effort? 😁 |
|
Yep, gladly. I can look at creating an example next week as well. |
|
Thank you @csmarchbanks , @Logiraptor , @jmacd . I would like to learn more about the process. As I’m relatively new to this repository, any assistance or pointers would be very helpful. Additionally, are there any similar extensions we could review as a starting point? |
|
Good timing :). I literally just opened #43972 to demonstrate how to create an extension for the tail sampling processor. Happy to help out if you have any questions! |
|
Thank you @csmarchbanks. I’ll review the material and follow up with any questions I may have. |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
@csmarchbanks , @jmacd , I am refactoring the sampling policy to be an extension. A major problem that I am facing is that I am not able to register this extension as a sampling policy. While deploying the collector in a Kubernetes setup, I am getting the following error (from logs): From logs, I can see that the extension is getting started but the policy is not made effective. The config yaml snippet that I have is: processors: service: Any help on how to correctly register the extension as a sampling policy an make it work alongside other sampling policies in tail sampling processor is greatly appreciated. Thanks! |
|
At a glance that config looks reasonable, and if you are seeing the extension start then I would recommend adding a bit of debug logging around here to see if the extension is not being added to the tail sampling processor correct. My initial guess is that it either isn't present in the list for some reason, or the cast to the interface isn't working. |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
Changes are being made and the testing is in progress. Will re-submit the changes for review asap. |
|
Moved to draft while things are being tested. Please mark as ready for review once testing is complete and conflicts are resolved. |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Description
Adding stratified sampling policy to the tailsampling processor
Link to tracking issue
Fixes 40917
Testing
Following local testing has been performed:
Also have deployed the binary to a Kubernetes setup to verify the changes
Documentation
This new sampling policy, called the stratified sampling policy, samples a new trajectory whenever it is encountered for the first time within a sampling interval. If a trajectory has already been observed within that interval, the policy will revert to a probabilistic sampling approach, where trajectories are selected based on predefined probabilities. This ensures that newly encountered trajectories are prioritized for sampling while maintaining flexibility for previously seen trajectories.
The sampling policy can be used as follows:
tail_sampling:
decision_wait: <decision_wait>
num_traces: <num_traces>
expected_new_traces_per_sec: <expected_traces>
policies:
- name: stratifiedprob-sample
type: stratified
stratified:
sampling_percentage: <desired_sampling_percentage>