Internal metrics telemetry pipeline and SDK design by jmacd · Pull Request #2623 · open-telemetry/otel-arrow

jmacd · 2026-04-10T00:40:55Z

Change Summary

Part of #1950; a design only.

Part of #2405

Fixes #1378

Part of #2411

Part of #2507

Proposes an OTAP-direct metrics SDK in four phases.

codecov · 2026-04-10T00:44:10Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.16%. Comparing base (495588e) to head (2239a14).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2623      +/-   ##
==========================================
- Coverage   88.17%   88.16%   -0.01%     
==========================================
  Files         633      633              
  Lines      238646   238646              
==========================================
- Hits       210428   210410      -18     
- Misses      27694    27712      +18     
  Partials      524      524

Components	Coverage Δ
otap-dataflow	`89.85% <ø> (-0.01%)`	⬇️
query_abstraction	`80.61% <ø> (ø)`
query_engine	`90.74% <ø> (ø)`
otel-arrow-go	`52.45% <ø> (ø)`
quiver	`92.27% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

lquerel

I haven't understood everything yet. I'll probably need to have a discussion with you to better understand the details of your plan :-)

What I had in mind was something like:

Support for bounded dynamic attributes (e.g. outcome, signal type) in the current metric set system
Converting the TelemetryRegistry into an internal metric receiver
Removing the OpenTelemetry SDK
Creating the semantic conventions
Replacing the code generated by the metric_set macro with code generated by Weaver

lquerel · 2026-04-12T16:46:08Z

+    x-otap-levels:
+      basic:
+        disabled: true
+      normal:
+        dimensions: [outcome]
+      detailed:
+        dimensions: [outcome, signal]


This won’t be recognized by Weaver as is. For this kind of thing, I think the best approach is to use the annotation mechanism.

groups: - id: metric.consumer.items type: metric metric_name: consumer.items instrument: counter unit: "{item}" brief: "Items consumed by a node." attributes: - ref: outcome requirement_level: recommended - ref: signal requirement_level: optional annotations: x-otap-levels: basic: disabled: true normal: dimensions: [outcome] detailed: dimensions: [outcome, signal]

lquerel · 2026-04-12T23:04:57Z

+```rust
+/// Consumed items by outcome after
+#[metric_set(name = "otap.consumer")]
+#[derive(Debug, Default, Clone)]
+pub struct ConsumerMetrics {
+    #[metric(unit = "{item}")]
+    consumed_success: Counter<u64>,
+
+    #[metric(unit = "{item}")]
+    consumed_failed: Counter<u64>,
+
+    #[metric(unit = "{item}")]
+    consumed_refused: Counter<u64>,
+
+    // ... and more (e.g., duration)
+}
+```


To cover all aspects of metric sets, I suggest adding another typical kind of metric set like the one below (in complement). In this example, the metrics in the set cannot only be distinguished by one or two dimensions such as outcome and signal type.

/// Engine-wide metrics emitted once per engine instance. #[metric_set(name = "engine.metrics")] #[derive(Debug, Default, Clone)] pub struct EngineMetrics { /// Process-wide Resident Set Size — physical RAM currently used by the process. /// Matches what external tools report (e.g. `kubectl top pod`, `htop`, `ps rss`). #[metric(unit = "{By}")] pub memory_rss: ObserveUpDownCounter<u64>, /// Process-wide CPU utilization as a ratio in [0, 1], normalized across all /// logical CPU cores on the system (not just engine-assigned cores). /// Aligned with the OTel semantic convention `process.cpu.utilization`. /// /// The `cpu.mode` attribute is not set; this reports combined user + system time. #[metric(unit = "{1}")] pub cpu_utilization: Gauge<f64>, /// Process-wide memory limiter state encoded as `0=normal`, `1=soft`, `2=hard`. #[metric(unit = "{state}")] pub memory_pressure_state: Gauge<u64>, /// Most recent process-wide memory limiter sample, in bytes. #[metric(unit = "{By}")] pub process_memory_usage_bytes: Gauge<u64>, /// Effective process-wide memory limiter soft limit, in bytes. #[metric(unit = "{By}")] pub process_memory_soft_limit_bytes: Gauge<u64>, /// Effective process-wide memory limiter hard limit, in bytes. #[metric(unit = "{By}")] pub process_memory_hard_limit_bytes: Gauge<u64>, }

lquerel · 2026-04-13T05:38:04Z

+#[derive(Debug, Default, Clone)]
+pub struct ConsumerMetrics {
+    consumed_items: ConsumedItemsByOutcomeAndSignal,
+
+    // ... and more (e.g., duration)
+}
+
+#[derive(Debug, Default, Clone)]
+pub enum ConsumedItemsByOutcomeAndSignal {
+    #[default]
+    Basic,                       // disabled
+    Normal(Box<[Counter; 3]>),   // 1 dimension: outcome
+    Detailed(Box<[Counter; 9]>), // 2 dimensions: outcome x signal
+}


I think I need to see the code as a whole to understand.

What I'm wondering is: why don't we have more or less the same code as before, except that instead of being generated by a macro, it would be generated by Weaver from the semantic conventions?

lquerel · 2026-04-13T05:40:22Z

+Note the use of `Box<_>` in the enum variant ensures that metric level
+actually controls how much memory is used by instrumentation.
+
+In another file, we will define the translation from the `v1` to `v0`
+schema, which we will eventually deprecate. Separately, we will
+define how to generate either the `v1` or the `v0` schema from the
+current instrumentation.


I’m confused, but probably I just don't have the full picture yet.

lquerel · 2026-04-13T05:53:55Z

+    scopes:
+      metrics:
+        metrics.otap.consumer:
+          schema_url: otap-dataflow/consumer@v1


I don't understand. Are we supposed to list all the metric sets in the configuration? That seems strange to me.

lquerel · 2026-04-13T06:00:09Z

+With all `#[metric_set]` definitions replaced by generated code, we
+remove the OpenTelemetry SDK from the OTAP Dataflow dependencies.
+Users configure either the Builtin Prometheus support or an ITS
+internal telemetry pipeline (e.g., batch processor, OTLP exporter).


I'm not saying this is necessarily a better plan, but note that removing the OpenTelemetry SDK could be done as early as phase 1, even before introducing semantic conventions and code generation with Weaver.

jmacd · 2026-04-14T15:30:42Z

I will build more prototype as I incorporate the feedback.

…d/metrics_sdk_design

jmacd added 2 commits April 9, 2026 17:24

Internal metrics SDK design draft

7f4579a

edit

83250c0

github-project-automation Bot added this to OTel-Arrow Apr 10, 2026

github-actions Bot added the rust Pull requests that update Rust code label Apr 10, 2026

jmacd added 9 commits April 10, 2026 09:51

one graph

fd8b66e

two graphs

1f3b539

save

ebc4152

rewrite expo

c3ae820

grammar

253b9d6

two incons

236a10f

some rewrite

a830d20

less text

e745e3e

more

210e387

jmacd marked this pull request as ready for review April 10, 2026 18:59

jmacd requested a review from a team as a code owner April 10, 2026 18:59

lquerel reviewed Apr 13, 2026

View reviewed changes

jmacd marked this pull request as draft April 14, 2026 15:30

jmacd mentioned this pull request Apr 15, 2026

Support per-metric internal telemetry level configuration #2670

Open

1 task

jmacd added 3 commits April 15, 2026 11:06

diagram

1655061

diagram this

5382612

Merge branch 'main' of github.com:open-telemetry/otel-arrow into jmac…

2239a14

…d/metrics_sdk_design

drewrelmas mentioned this pull request Apr 23, 2026

Expand MetricSelector to support scope_attribute view selectors #2742

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal metrics telemetry pipeline and SDK design#2623

Internal metrics telemetry pipeline and SDK design#2623
jmacd wants to merge 14 commits intoopen-telemetry:mainfrom
jmacd:jmacd/metrics_sdk_design

jmacd commented Apr 10, 2026

Uh oh!

codecov Bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

lquerel left a comment •

edited

Loading

Uh oh!

lquerel Apr 12, 2026

Uh oh!

lquerel Apr 12, 2026

Uh oh!

lquerel Apr 13, 2026

Uh oh!

lquerel Apr 13, 2026

Uh oh!

lquerel Apr 13, 2026

Uh oh!

lquerel Apr 13, 2026

Uh oh!

jmacd commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jmacd commented Apr 10, 2026

Change Summary

Uh oh!

codecov Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lquerel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lquerel Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

lquerel Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

lquerel Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

lquerel Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

lquerel Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

lquerel Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

jmacd commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 10, 2026 •

edited

Loading

lquerel left a comment •

edited

Loading