Skip to content

Conversation

@dhanyarmathews
Copy link

Description

Adding stratified sampling policy to the tailsampling processor

Link to tracking issue

Fixes 40917

Testing

Following local testing has been performed:

  1. make otelcontribcol
  2. make golint
  3. make gotest
    Also have deployed the binary to a Kubernetes setup to verify the changes

Documentation

This new sampling policy, called the stratified sampling policy, samples a new trajectory whenever it is encountered for the first time within a sampling interval. If a trajectory has already been observed within that interval, the policy will revert to a probabilistic sampling approach, where trajectories are selected based on predefined probabilities. This ensures that newly encountered trajectories are prioritized for sampling while maintaining flexibility for previously seen trajectories.

The sampling policy can be used as follows:
tail_sampling:
decision_wait: <decision_wait>
num_traces: <num_traces>
expected_new_traces_per_sec: <expected_traces>
policies:
- name: stratifiedprob-sample
type: stratified
stratified:
sampling_percentage: <desired_sampling_percentage>

@dhanyarmathews dhanyarmathews requested a review from a team as a code owner August 9, 2025 13:33
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 9, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@atoulme
Copy link
Contributor

atoulme commented Aug 9, 2025

Please check the CLA and mark ready for review again.

@dhanyarmathews dhanyarmathews requested a review from atoulme August 11, 2025 11:18
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Aug 26, 2025
@dhanyarmathews dhanyarmathews marked this pull request as ready for review August 28, 2025 04:42
@github-actions github-actions bot removed the Stale label Aug 28, 2025
@github-actions github-actions bot added the first-time contributor PRs made by new contributors label Sep 2, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Sep 2, 2025

Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib.

Important reminders:

A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better!

@dhanyarmathews
Copy link
Author

@atoulme , The suggested changed have been made. We appreciate your insights or suggestions on this. Also, if there’s anything further we can clarify or update in the PR to move it forward, please let us know

@atoulme
Copy link
Contributor

atoulme commented Sep 5, 2025

Please take a look at the lint issues with the CI. These are minor and can be addressed quickly. @portertech please review?

Comment on lines +180 to +183
// HashSalt allows one to configure the hashing salts. This is important in scenarios where multiple layers of collectors
// have different sampling rates: if they use the same salt all passing one layer may pass the other even if they have
// different sampling rates, configuring different salts avoids that.
HashSalt string `mapstructure:"hash_salt"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider using the pkg/sampling support in this repository instead of a hash-based approach. OpenTelemetry systems are expected to observe the W3C TraceContext Level 2 specification, which means there are 56 bits of randomness available in one of two ways implemented by that library. We do not encourage hash-based sampling, see the approach we've taken to upgrade in probabilisticsampling processor, which is also the subject of this (current) blog post draft: open-telemetry/opentelemetry.io#7735.

Moreover, there are other probability samplers in this component's configuration: I would expect them all to use the same approach, whatever it is, and would prefer to keep this code as simple as possible.


// StratifiedProbabilisticCfg holds the configurable settings to create a stratified probabilistic
// sampling policy evaluator.
type StratifiedProbabilisticCfg struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the description I read, I am not sure the term "Stratified" has been quite earned, though from the PR description of ("at least once") sampling, there's something useful here. Note that the composite sampling policy of this component is similar to what you're proposing, too, except (IIUC) you're adding an at-least-once fallback instead of a default-bucket approach.

If, as I take it, what you're trying to achieve is not based specifically on this at-least-once principal, but instead you are aiming just to achieve good coverage across all values in a key-space, then I support, but it leaves me with questions for this configuration struct. I would imagine wanting a rate-limited sampler that tries to achieve balance, which means estimating the most-frequent values in the set and assigning (somehow) the percentage to use for the remaining bunch. (This is what the composite sampler policy in this component does.) I believe from looking into this problem, that the best answer would be only to configure a rate limit and nothing else; let the component figure out what sampling probabilities to use for which strata and also let the component control the relative weight of the "other bunch", which is to say: how much weight of the distribution falls into the default bucket vs how much is explicitly managed with a fixed-size lookup table used to calculate the probability that will achieve the intended rate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhanyarmathews
Copy link
Author

@jmacd Thank you for your response! We'll explore the pkg/sampling package for possible alternatives to implement hash-based sampling.
To provide some context: in application services, each use case corresponds to a distinct trajectory of microservices involved in fulfilling that use case. During our experiments with various tail sampling policies, we observed that the sampled traces often do not accurately reflect actual application usage patterns. In particular, commonly used use cases tend to dominate the sampled set, especially when using policies like rate_limiting or composite. Conversely, less frequent but important use cases are often underrepresented or entirely missing. This poses challenges for downstream tasks such as anomaly detection and root cause analysis, which rely on a more representative and balanced trace set.

We couldn’t find an existing policy that allows sampling traces based on specific use-case semantics. That’s why we’ve been exploring a stratified probabilistic sampling approach, which ensures at least one trace is sampled per distinct use case. This method provides a more accurate reflection of how the application is actually used, and significantly improves the usability of the sampled trace set for downstream analysis. If we rely solely on automated components to determine sampling probabilities (e.g., purely based on frequency or volume), the resulting trace set often lacks the diversity and completeness required for effective observability and diagnostics.

We’d appreciate any suggestions or guidance you may have on achieving such a sampling.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Sep 28, 2025
@dhanyarmathews
Copy link
Author

@jmacd , just a gentle reminder to let us know your suggestions.

@github-actions github-actions bot removed the Stale label Oct 11, 2025
@jmacd
Copy link
Contributor

jmacd commented Oct 23, 2025

@dhanyarmathews Thank you for bringing this topic to the Sampling SIG today! I will be glad to help review and resolve the concerns I had, and thank you for your patience.

@csmarchbanks
Copy link
Contributor

👋 Would this be a good candidate to add as an extension instead of a core sampling strategy? Extensions were added after you opened this PR (see #42573), but have been working well for us, and could provide a good location for more experimental sampling strategies.

@Logiraptor
Copy link
Contributor

Agree with @csmarchbanks that an extension might be a good start. The demonstrated cpu increase is around 30%, but from experience I'm guessing the TSP is actually very little of the baseline cpu. Typically that is taken up by protobuf and garbage collection, so a 30% increase could actually be 2x or more on the actual tsp cpu. I'd recommend using pprof to check the cpu usage before / after instead.

@jmacd
Copy link
Contributor

jmacd commented Oct 23, 2025

@csmarchbanks or @Logiraptor would you briefly explain what it means for us to create these sampling policies as extensions? Do you mean to create extensions registered somewhere within the processor/tailsamplingprocessor? How would this be built and deployed for a user in that form?

@csmarchbanks
Copy link
Contributor

csmarchbanks commented Oct 23, 2025

They would be registered as normal extensions, so instead of added to processor/tailsamplingprocessor the new sampler would be added to extension/tailsampling/stratified. I believe this would then need to be added to manifest.yaml to include the extension in the otelcol-contrib distribution, at which point a user could elect to use it like any other extension:

processors:
  tail_sampling:
    policies:
      - name: stratified
        type: stratified
        stratified:
          sampling_percentage: <desired_sampling_percentage>
service:
  extensions: [stratified]

I am fairly new to the collector, so please let me know if any of that is incorrect, but that is what I want to support with extensions!

@jmacd
Copy link
Contributor

jmacd commented Oct 23, 2025

Very cool! I want to know more about this process. @csmarchbanks would you be willing to help document this as a way to help @dhanyarmathews in this effort? 😁

@csmarchbanks
Copy link
Contributor

Yep, gladly. I can look at creating an example next week as well.

@dhanyarmathews
Copy link
Author

Thank you @csmarchbanks , @Logiraptor , @jmacd . I would like to learn more about the process. As I’m relatively new to this repository, any assistance or pointers would be very helpful. Additionally, are there any similar extensions we could review as a starting point?

@csmarchbanks
Copy link
Contributor

Good timing :). I literally just opened #43972 to demonstrate how to create an extension for the tail sampling processor. Happy to help out if you have any questions!

@dhanyarmathews
Copy link
Author

Thank you @csmarchbanks. I’ll review the material and follow up with any questions I may have.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 17, 2025
@dhanyarmathews
Copy link
Author

dhanyarmathews commented Nov 19, 2025

@csmarchbanks , @jmacd , I am refactoring the sampling policy to be an extension. A major problem that I am facing is that I am not able to register this extension as a sampling policy. While deploying the collector in a Kubernetes setup, I am getting the following error (from logs):
"collector server run finished with error: cannot start pipelines: failed to start "tail_sampling" processor: failed to create policy evaluator for "stratified": unknown sampling policy type stratified_probabilistic_sampler"

From logs, I can see that the extension is getting started but the policy is not made effective.

The config yaml snippet that I have is:
extensions:
health_check: {}
stratified_probabilistic_sampler:
sampling_percentage: 10

processors:
tail_sampling:
decision_wait: 5s
num_traces: 50000
expected_new_traces_per_sec: 140
policies:
- name: stratified
type: stratified_probabilistic_sampler
stratified:
sampling_percentage: 10

service:
extensions: [health_check, stratified_probabilistic_sampler]

Any help on how to correctly register the extension as a sampling policy an make it work alongside other sampling policies in tail sampling processor is greatly appreciated. Thanks!

@csmarchbanks
Copy link
Contributor

At a glance that config looks reasonable, and if you are seeing the extension start then I would recommend adding a bit of debug logging around here to see if the extension is not being added to the tail sampling processor correct. My initial guess is that it either isn't present in the list for some reason, or the cast to the interface isn't working.

@github-actions github-actions bot removed the Stale label Nov 20, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Dec 4, 2025
@dhanyarmathews
Copy link
Author

Changes are being made and the testing is in progress. Will re-submit the changes for review asap.

@github-actions github-actions bot removed the Stale label Dec 16, 2025
@crobert-1 crobert-1 marked this pull request as draft December 18, 2025 00:12
@crobert-1
Copy link
Member

Moved to draft while things are being tested. Please mark as ready for review once testing is complete and conflicts are resolved.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 1, 2026

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 1, 2026
@github-actions
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants