Stratified Sampling Policy for Tailsampling processor #41877

dhanyarmathews · 2025-08-09T13:33:01Z

Description

Adding stratified sampling policy to the tailsampling processor

Link to tracking issue

Fixes 40917

Testing

Following local testing has been performed:

make otelcontribcol
make golint
make gotest
Also have deployed the binary to a Kubernetes setup to verify the changes

Documentation

This new sampling policy, called the stratified sampling policy, samples a new trajectory whenever it is encountered for the first time within a sampling interval. If a trajectory has already been observed within that interval, the policy will revert to a probabilistic sampling approach, where trajectories are selected based on predefined probabilities. This ensures that newly encountered trajectories are prioritized for sampling while maintaining flexibility for previously seen trajectories.

The sampling policy can be used as follows:
tail_sampling:
decision_wait: <decision_wait>
num_traces: <num_traces>
expected_new_traces_per_sec: <expected_traces>
policies:
- name: stratifiedprob-sample
type: stratified
stratified:
sampling_percentage: <desired_sampling_percentage>

linux-foundation-easycla · 2025-08-09T13:33:05Z

The committers listed above are authorized under a signed CLA.

✅ login: dhanyarmathews / name: Dhanya Mathews (d2edd94, 05b7187, 96de03c)
✅ login: dhanyarmathews / name: dhanyarmathews (c7d8ad4, b937a77, e14ef2a)

.chloggen/stratified.yaml

atoulme · 2025-08-09T17:40:33Z

Please check the CLA and mark ready for review again.

github-actions · 2025-08-26T05:21:49Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2025-09-02T05:49:11Z

Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib.

Important reminders:

Please review our Contributing Guidelines.
Don't forget to sign the Contributor License Agreement (CLA) if you haven't already.

A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better!

dhanyarmathews · 2025-09-05T05:32:43Z

@atoulme , The suggested changed have been made. We appreciate your insights or suggestions on this. Also, if there’s anything further we can clarify or update in the PR to move it forward, please let us know

atoulme · 2025-09-05T05:52:43Z

Please take a look at the lint issues with the CI. These are minor and can be addressed quickly. @portertech please review?

jmacd · 2025-09-08T22:21:33Z

processor/tailsamplingprocessor/config.go

+	// HashSalt allows one to configure the hashing salts. This is important in scenarios where multiple layers of collectors
+	// have different sampling rates: if they use the same salt all passing one layer may pass the other even if they have
+	// different sampling rates, configuring different salts avoids that.
+	HashSalt string `mapstructure:"hash_salt"`


Please consider using the pkg/sampling support in this repository instead of a hash-based approach. OpenTelemetry systems are expected to observe the W3C TraceContext Level 2 specification, which means there are 56 bits of randomness available in one of two ways implemented by that library. We do not encourage hash-based sampling, see the approach we've taken to upgrade in probabilisticsampling processor, which is also the subject of this (current) blog post draft: open-telemetry/opentelemetry.io#7735.

Moreover, there are other probability samplers in this component's configuration: I would expect them all to use the same approach, whatever it is, and would prefer to keep this code as simple as possible.

jmacd · 2025-09-08T22:31:10Z

processor/tailsamplingprocessor/config.go


+// StratifiedProbabilisticCfg holds the configurable settings to create a stratified probabilistic
+// sampling policy evaluator.
+type StratifiedProbabilisticCfg struct {


From the description I read, I am not sure the term "Stratified" has been quite earned, though from the PR description of ("at least once") sampling, there's something useful here. Note that the composite sampling policy of this component is similar to what you're proposing, too, except (IIUC) you're adding an at-least-once fallback instead of a default-bucket approach.

If, as I take it, what you're trying to achieve is not based specifically on this at-least-once principal, but instead you are aiming just to achieve good coverage across all values in a key-space, then I support, but it leaves me with questions for this configuration struct. I would imagine wanting a rate-limited sampler that tries to achieve balance, which means estimating the most-frequent values in the set and assigning (somehow) the percentage to use for the remaining bunch. (This is what the composite sampler policy in this component does.) I believe from looking into this problem, that the best answer would be only to configure a rate limit and nothing else; let the component figure out what sampling probabilities to use for which strata and also let the component control the relative weight of the "other bunch", which is to say: how much weight of the distribution falls into the default bucket vs how much is explicitly managed with a fixed-size lookup table used to calculate the probability that will achieve the intended rate.

See also https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/probabilisticsamplerprocessor/README.md, which describes two modes of sampling that do not use any "salt".

dhanyarmathews · 2025-09-13T13:25:41Z

@jmacd Thank you for your response! We'll explore the pkg/sampling package for possible alternatives to implement hash-based sampling.
To provide some context: in application services, each use case corresponds to a distinct trajectory of microservices involved in fulfilling that use case. During our experiments with various tail sampling policies, we observed that the sampled traces often do not accurately reflect actual application usage patterns. In particular, commonly used use cases tend to dominate the sampled set, especially when using policies like rate_limiting or composite. Conversely, less frequent but important use cases are often underrepresented or entirely missing. This poses challenges for downstream tasks such as anomaly detection and root cause analysis, which rely on a more representative and balanced trace set.

We couldn’t find an existing policy that allows sampling traces based on specific use-case semantics. That’s why we’ve been exploring a stratified probabilistic sampling approach, which ensures at least one trace is sampled per distinct use case. This method provides a more accurate reflection of how the application is actually used, and significantly improves the usability of the sampled trace set for downstream analysis. If we rely solely on automated components to determine sampling probabilities (e.g., purely based on frequency or volume), the resulting trace set often lacks the diversity and completeness required for effective observability and diagnostics.

We’d appreciate any suggestions or guidance you may have on achieving such a sampling.

github-actions · 2025-09-28T05:20:49Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

dhanyarmathews · 2025-10-10T07:50:07Z

@jmacd , just a gentle reminder to let us know your suggestions.

jmacd · 2025-10-23T16:04:01Z

@dhanyarmathews Thank you for bringing this topic to the Sampling SIG today! I will be glad to help review and resolve the concerns I had, and thank you for your patience.

csmarchbanks · 2025-10-23T17:11:00Z

👋 Would this be a good candidate to add as an extension instead of a core sampling strategy? Extensions were added after you opened this PR (see #42573), but have been working well for us, and could provide a good location for more experimental sampling strategies.

Logiraptor · 2025-10-23T17:20:53Z

Agree with @csmarchbanks that an extension might be a good start. The demonstrated cpu increase is around 30%, but from experience I'm guessing the TSP is actually very little of the baseline cpu. Typically that is taken up by protobuf and garbage collection, so a 30% increase could actually be 2x or more on the actual tsp cpu. I'd recommend using pprof to check the cpu usage before / after instead.

jmacd · 2025-10-23T17:41:24Z

@csmarchbanks or @Logiraptor would you briefly explain what it means for us to create these sampling policies as extensions? Do you mean to create extensions registered somewhere within the processor/tailsamplingprocessor? How would this be built and deployed for a user in that form?

csmarchbanks · 2025-10-23T17:58:04Z

They would be registered as normal extensions, so instead of added to processor/tailsamplingprocessor the new sampler would be added to extension/tailsampling/stratified. I believe this would then need to be added to manifest.yaml to include the extension in the otelcol-contrib distribution, at which point a user could elect to use it like any other extension:

processors:
  tail_sampling:
    policies:
      - name: stratified
        type: stratified
        stratified:
          sampling_percentage: <desired_sampling_percentage>
service:
  extensions: [stratified]

I am fairly new to the collector, so please let me know if any of that is incorrect, but that is what I want to support with extensions!

jmacd · 2025-10-23T19:21:10Z

Very cool! I want to know more about this process. @csmarchbanks would you be willing to help document this as a way to help @dhanyarmathews in this effort? 😁

csmarchbanks · 2025-10-23T21:27:37Z

Yep, gladly. I can look at creating an example next week as well.

dhanyarmathews · 2025-11-03T05:06:13Z

Thank you @csmarchbanks , @Logiraptor , @jmacd . I would like to learn more about the process. As I’m relatively new to this repository, any assistance or pointers would be very helpful. Additionally, are there any similar extensions we could review as a starting point?

csmarchbanks · 2025-11-03T05:12:13Z

Good timing :). I literally just opened #43972 to demonstrate how to create an extension for the tail sampling processor. Happy to help out if you have any questions!

dhanyarmathews · 2025-11-03T05:14:45Z

Thank you @csmarchbanks. I’ll review the material and follow up with any questions I may have.

github-actions · 2025-11-17T05:21:27Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

dhanyarmathews · 2025-11-19T11:05:02Z

@csmarchbanks , @jmacd , I am refactoring the sampling policy to be an extension. A major problem that I am facing is that I am not able to register this extension as a sampling policy. While deploying the collector in a Kubernetes setup, I am getting the following error (from logs):
"collector server run finished with error: cannot start pipelines: failed to start "tail_sampling" processor: failed to create policy evaluator for "stratified": unknown sampling policy type stratified_probabilistic_sampler"

From logs, I can see that the extension is getting started but the policy is not made effective.

The config yaml snippet that I have is:
extensions:
health_check: {}
stratified_probabilistic_sampler:
sampling_percentage: 10

processors:
tail_sampling:
decision_wait: 5s
num_traces: 50000
expected_new_traces_per_sec: 140
policies:
- name: stratified
type: stratified_probabilistic_sampler
stratified:
sampling_percentage: 10

service:
extensions: [health_check, stratified_probabilistic_sampler]

Any help on how to correctly register the extension as a sampling policy an make it work alongside other sampling policies in tail sampling processor is greatly appreciated. Thanks!

csmarchbanks · 2025-11-19T16:07:00Z

At a glance that config looks reasonable, and if you are seeing the extension start then I would recommend adding a bit of debug logging around here to see if the extension is not being added to the tail sampling processor correct. My initial guess is that it either isn't present in the list for some reason, or the cast to the interface isn't working.

github-actions · 2025-12-04T05:21:22Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

dhanyarmathews · 2025-12-15T08:55:21Z

Changes are being made and the testing is in progress. Will re-submit the changes for review asap.

crobert-1 · 2025-12-18T00:13:14Z

Moved to draft while things are being tested. Please mark as ready for review once testing is complete and conflicts are resolved.

github-actions · 2026-01-01T05:22:15Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2026-01-15T05:22:23Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

dhanyarmathews added 3 commits August 9, 2025 18:36

Added stratified sampling policy to tail sampling processor.

d2edd94

Code changes for stratified tail sampling policy.

05b7187

Add changelog entry for stratified sampling policy

96de03c

dhanyarmathews requested a review from a team as a code owner August 9, 2025 13:33

dhanyarmathews requested a review from edmocosta August 9, 2025 13:33

github-actions bot assigned atoulme Aug 9, 2025

github-actions bot added the processor/tailsampling Tail sampling processor label Aug 9, 2025

github-actions bot requested a review from portertech August 9, 2025 13:33

dhanyarmathews mentioned this pull request Aug 9, 2025

Tail Sampling Processor Enhancement for Stratified Trajectory-based sampling #40917

Open

atoulme marked this pull request as draft August 9, 2025 17:38

atoulme reviewed Aug 9, 2025

View reviewed changes

.chloggen/stratified.yaml Outdated Show resolved Hide resolved

Update stratified.yaml subtext with additional information

e14ef2a

dhanyarmathews requested a review from atoulme August 11, 2025 11:18

github-actions bot added the Stale label Aug 26, 2025

Merge branch 'main' into stratified

b937a77

dhanyarmathews marked this pull request as ready for review August 28, 2025 04:42

github-actions bot assigned bogdandrutu Aug 28, 2025

Merge branch 'main' into stratified

c7d8ad4

github-actions bot removed the Stale label Aug 28, 2025

atoulme added the waiting-for-code-owners label Sep 2, 2025

github-actions bot added the first-time contributor PRs made by new contributors label Sep 2, 2025

jmacd requested changes Sep 8, 2025

View reviewed changes

github-actions bot added the Stale label Sep 28, 2025

github-actions bot removed the Stale label Oct 11, 2025

csmarchbanks mentioned this pull request Nov 3, 2025

[extension/tailsampling] Add example tail sampling extension #43972

Closed

github-actions bot added the Stale label Nov 17, 2025

github-actions bot removed the Stale label Nov 20, 2025

github-actions bot added the Stale label Dec 4, 2025

github-actions bot removed the Stale label Dec 16, 2025

crobert-1 marked this pull request as draft December 18, 2025 00:12

github-actions bot added the Stale label Jan 1, 2026

github-actions bot closed this Jan 15, 2026

jmacd mentioned this pull request Jan 15, 2026

[tailsamplingprocessor] Propose: add jmacd@ as co-owner #45434

Closed

Stratified Sampling Policy for Tailsampling processor #41877

Stratified Sampling Policy for Tailsampling processor #41877

Uh oh!

Conversation

dhanyarmathews commented Aug 9, 2025

Description

Link to tracking issue

Testing

Documentation

Uh oh!

linux-foundation-easycla bot commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

atoulme commented Aug 9, 2025

Uh oh!

github-actions bot commented Aug 26, 2025

Uh oh!

github-actions bot commented Sep 2, 2025

Uh oh!

dhanyarmathews commented Sep 5, 2025

Uh oh!

atoulme commented Sep 5, 2025

Uh oh!

jmacd Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

jmacd Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

jmacd Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

dhanyarmathews commented Sep 13, 2025

Uh oh!

github-actions bot commented Sep 28, 2025

Uh oh!

dhanyarmathews commented Oct 10, 2025

Uh oh!

jmacd commented Oct 23, 2025

Uh oh!

csmarchbanks commented Oct 23, 2025

Uh oh!

Logiraptor commented Oct 23, 2025

Uh oh!

jmacd commented Oct 23, 2025

Uh oh!

csmarchbanks commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmacd commented Oct 23, 2025

Uh oh!

csmarchbanks commented Oct 23, 2025

Uh oh!

dhanyarmathews commented Nov 3, 2025

Uh oh!

csmarchbanks commented Nov 3, 2025

Uh oh!

dhanyarmathews commented Nov 3, 2025

Uh oh!

github-actions bot commented Nov 17, 2025

Uh oh!

dhanyarmathews commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csmarchbanks commented Nov 19, 2025

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

dhanyarmathews commented Dec 15, 2025

Uh oh!

crobert-1 commented Dec 18, 2025

Uh oh!

github-actions bot commented Jan 1, 2026

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

linux-foundation-easycla bot commented Aug 9, 2025 •

edited

Loading

csmarchbanks commented Oct 23, 2025 •

edited

Loading

dhanyarmathews commented Nov 19, 2025 •

edited

Loading