Skip to content

Conversation

@albertlockett
Copy link
Contributor

An issue was identified where otel-arrow would perform worse than OTLP in the TestMetricsMultipart size test. This PR resolves the issue by changing how the results are compressed.

otel-arrow can apply compression to telemetry batches in two places:

  • the record within the Arrow IPC stream
  • the protobuf serialized BatchArrowRecord message

This test was originally compressing just the IPC stream, but otel-arrow actually gets better compression ratio when compressing the proto messages.

This PR also upgrades otel-arrow to the latest version.

Results from the original TestMetricsMultipart size test:

go test -timeout 90s -count=1 -v -run ^TestMetricsMultipart$ github.com/splunk/stef/benchmarks  
=== RUN   TestMetricsMultipart
hostandcollector-otelmetrics   Comp     Bytes Ratio
OTLP                           none  22219873 x 1.00
STEF                           none   1493558 x 14.88
STEFU                          none   1559921 x 14.24
Otel ARROW                     none  13571951 x 1.64
astronomy-otelmetrics          Comp     Bytes Ratio
OTLP                           none 145844039 x 1.00
STEF                           none  10343709 x 14.10
STEFU                          none  10793486 x 13.51
Otel ARROW                     none  92128187 x 1.58
hostandcollector-otelmetrics   Comp     Bytes Ratio
OTLP                           zstd   2675305 x 1.00
STEF                           zstd    246609 x 10.85
STEFU                          zstd    371536 x 7.20
Otel ARROW                     zstd   1316311 x 2.03
astronomy-otelmetrics          Comp     Bytes Ratio
OTLP                           zstd  19596366 x 1.00
STEF                           zstd   3381914 x 5.79
STEFU                          zstd   4171492 x 4.70
Otel ARROW                     zstd  11690516 x 1.68
--- PASS: TestMetricsMultipart (10.18s)
PASS
ok      github.com/splunk/stef/benchmarks       10.562s

Interpretation of the results

The fact that STEF systematically achieves better compression than OTAP is expected. The tradeoffs between OTAP and STEF are radically different. OTAP seeks to optimize across multiple dimensions: zero deserialization, low memory allocations, data processing speed (SIMD support), compression rate, and better impedance with modern telemetry backends, all of which are columnar-oriented. By contrast, STEF is mainly optimized for compression rate (inter data center use case). The other dimensions mentioned above are not optimized.

We are considering a second round of optimization on the compression rate for OTAP (there are still some ideas left to explore), which should reduce the gap with STEF. However, for the reasons mentioned earlier, it is unlikely that OTAP will achieve a better compression rate than STEF, except perhaps for very large batches. These two protocols have complementary use cases.

@albertlockett albertlockett changed the title Albert/otel arrow bench update update size test metrics stream implementation for otel-arrow Oct 3, 2025
@albertlockett albertlockett changed the title update size test metrics stream implementation for otel-arrow update size test metrics stream implementation for otel-arrow encoding Oct 3, 2025
@tigrannajaryan
Copy link
Collaborator

Thanks for the PR @albertlockett

It is good to see Otel/Arrow performing better when used correctly. I will take a more detailed look at the PR and will get back to you.

I have briefly run the tests with your changes and one thing that is puzzling is that the ztd-compressed OTLP sizes have improved as well.

Before:

===== Encoded sizes
astronomy-otelmetrics.zst               Uncompressed           Zstd Compressed
                                     Bytes Ratio By/pt        Bytes Ratio By/pt
OTLP                             145844039  1.00 185.4      6304039  1.00   8.0

hipstershop-otelmetrics.zst             Uncompressed           Zstd Compressed
                                     Bytes Ratio By/pt        Bytes Ratio By/pt
OTLP                              21148675  1.00 316.5       549012  1.00   8.2

hostandcollector-otelmetrics.zst        Uncompressed           Zstd Compressed
                                     Bytes Ratio By/pt        Bytes Ratio By/pt
OTLP                              22219873  1.00 106.6       846035  1.00   4.1

After:

===== Encoded sizes
astronomy-otelmetrics.zst               Uncompressed           Zstd Compressed
                                     Bytes Ratio By/pt        Bytes Ratio By/pt
OTLP                             145844039  1.00 185.4      5987068  1.00   7.6

hipstershop-otelmetrics.zst             Uncompressed           Zstd Compressed
                                     Bytes Ratio By/pt        Bytes Ratio By/pt
OTLP                              21148675  1.00 316.5       469789  1.00   7.0

hostandcollector-otelmetrics.zst        Uncompressed           Zstd Compressed
                                     Bytes Ratio By/pt        Bytes Ratio By/pt
OTLP                              22219873  1.00 106.6       773929  1.00   3.7

I may need to split this PR into 2 parts: 1) to update otlp/pdata dependencies to isolate and understand their impact first, 2) to apply the changes you made for Otel Arrow.

I don't think anything is wrong with your change, but I want to do it just to have a clearer understanding of what's going on.

@tigrannajaryan
Copy link
Collaborator

I have isolated the changes in OTLP compressed size to pdata dependency change from v1.38 to v1.39 where they introduced new marshalers which seem to produce more zstd-friendly payloads. This is unrelated to your change, so nothing to worry about.

Copy link
Collaborator

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking temporarily since I see performance degradation in seemingly unrelated parts. Can be another effect of newer dependencies. I will need to look into it.

pkg: github.com/splunk/stef/benchmarks
cpu: Apple M2 Pro
                                 │ bench_base.txt │         bench_current.txt          │
                                 │     sec/op     │   sec/op     vs base               │
DeserializeNative/STEF/deser-10       1.412m ± 0%   1.548m ± 0%  +9.63% (p=0.000 n=15)
DeserializeNative/STEFU/deser-10      4.206m ± 0%   4.330m ± 0%  +2.95% (p=0.000 n=15)
geomean                               2.437m        2.589m       +6.24%

                                 │ bench_base.txt │         bench_current.txt          │
                                 │   sec/point    │  sec/point   vs base               │
DeserializeNative/STEF/deser-10       21.12n ± 0%   23.15n ± 0%  +9.61% (p=0.000 n=15)
DeserializeNative/STEFU/deser-10      62.91n ± 0%   64.77n ± 0%  +2.96% (p=0.000 n=15)
geomean                               36.45n        38.72n       +6.23%

                                 │ bench_base.txt │          bench_current.txt          │
                                 │      B/op      │     B/op      vs base               │
DeserializeNative/STEF/deser-10      934.4Ki ± 0%   934.2Ki ± 0%  -0.03% (p=0.000 n=15)
DeserializeNative/STEFU/deser-10     1.471Mi ± 0%   1.470Mi ± 0%  -0.02% (p=0.000 n=15)
geomean                              1.158Mi        1.158Mi       -0.02%

                                 │ bench_base.txt │          bench_current.txt          │
                                 │   allocs/op    │ allocs/op   vs base                 │
DeserializeNative/STEF/deser-10        465.0 ± 0%   465.0 ± 0%       ~ (p=1.000 n=15) ¹
DeserializeNative/STEFU/deser-10       469.0 ± 0%   469.0 ± 0%       ~ (p=1.000 n=15) ¹
geomean                                467.0        467.0       +0.00%

@tigrannajaryan
Copy link
Collaborator

Blocking temporarily since I see performance degradation in seemingly unrelated parts. Can be another effect of newer dependencies. I will need to look into it.

OK, this is non-issue. It was because this PR is branched from an older commit and main has been improved since. Rebasing fixed the performance.

@tigrannajaryan tigrannajaryan force-pushed the albert/otel-arrow-bench-update branch from bedc334 to dea89f8 Compare October 3, 2025 21:43
@tigrannajaryan tigrannajaryan dismissed their stale review October 3, 2025 21:44

Perf degradation fixed by rebasing.

@tigrannajaryan
Copy link
Collaborator

I am also posting size improvements for Ote/Arrow for posterity.

Before this change

Otel Arrow is larger that OTLP:

image

After this change

Otel Arrow is smaller that OTLP:

image

@tigrannajaryan
Copy link
Collaborator

I rebased, fixed go.mod and re-created benchmarks.html. Everything builds and runs correctly.

Copy link
Collaborator

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @albertlockett

LGTM.

@tigrannajaryan tigrannajaryan changed the title update size test metrics stream implementation for otel-arrow encoding Update size test metrics stream implementation for otel-arrow encoding Oct 3, 2025
@tigrannajaryan tigrannajaryan merged commit 9e55ec4 into splunk:main Oct 3, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants