Description
This is a meta-issue on tail-based sampling.
Tail-based sampling comes up frequently in bug reports, as there is minimal documentation and guidance on TBS configuration. It is not clear to users how TBS works, which leads to misconfigured TBS storage size, and consequently apm-server and ES issues.
When TBS local storage (badger) is filled, it results in error in writing traces (where apm-server logs received error writing sampled trace: configured storage limit reached (current: 127210377485, limit: 126000000000)
) and bypassing TBS as sampling rate jumps to 100%, causing a performance cliff and downstream effects: surprising significant increase on writes to ES, and either slowing ES and causing backpressure to apm-server, or unexpected high storage usage in ES.
The task list contains tasks to either document it properly, investigate/fix bugs, and to provide escape hatches for compromises.
Impact: TBS is a popular feature among heavy apm-server users who rely on TBS to reduce ES storage requirements while retaining the value of the sampled traces. We need to ensure and show that TBS is good for high load, like the rest of apm-server.
Tasks
- docs: Benchmark and document tail-based sampling performance #11346
- Configurable option to handle events failed to be processed by TBS #11127
- TBS: Expose TTL config via integration policy #13525
- TBS: apm-server never recovers from storage limit exceeded in rare cases #14923
- Update badger to latest version #11546
- Revisit default TBS storage size limit
sampling.tail.storage_limit
and storage limit handling #14933 - TBS: Document monitoring of disk space used by Tail Based Sampling in public docs. #14996
- TBS: Expired entries stay much longer than TTL and consume disk space #15121
- TBS: Explore replacing badger with pebble #15246
- TBS automatic migration #15500
- monitoring: apm-server not shiping tbs monitoring metrics #14247
- TBS: Document discard_on_write_failure + expose it to the APM Integration #15330