Skip to content

Internal metrics telemetry pipeline and SDK design#2623

Draft
jmacd wants to merge 14 commits intoopen-telemetry:mainfrom
jmacd:jmacd/metrics_sdk_design
Draft

Internal metrics telemetry pipeline and SDK design#2623
jmacd wants to merge 14 commits intoopen-telemetry:mainfrom
jmacd:jmacd/metrics_sdk_design

Conversation

@jmacd
Copy link
Copy Markdown
Contributor

@jmacd jmacd commented Apr 10, 2026

Change Summary

Part of #1950; a design only.

Part of #2405

Fixes #1378

Part of #2411

Part of #2507

Proposes an OTAP-direct metrics SDK in four phases.

@github-actions github-actions Bot added the rust Pull requests that update Rust code label Apr 10, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.16%. Comparing base (495588e) to head (2239a14).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2623      +/-   ##
==========================================
- Coverage   88.17%   88.16%   -0.01%     
==========================================
  Files         633      633              
  Lines      238646   238646              
==========================================
- Hits       210428   210410      -18     
- Misses      27694    27712      +18     
  Partials      524      524              
Components Coverage Δ
otap-dataflow 89.85% <ø> (-0.01%) ⬇️
query_abstraction 80.61% <ø> (ø)
query_engine 90.74% <ø> (ø)
otel-arrow-go 52.45% <ø> (ø)
quiver 92.27% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jmacd jmacd marked this pull request as ready for review April 10, 2026 18:59
@jmacd jmacd requested a review from a team as a code owner April 10, 2026 18:59
Copy link
Copy Markdown
Contributor

@lquerel lquerel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't understood everything yet. I'll probably need to have a discussion with you to better understand the details of your plan :-)

What I had in mind was something like:

  • Support for bounded dynamic attributes (e.g. outcome, signal type) in the current metric set system
  • Converting the TelemetryRegistry into an internal metric receiver
  • Removing the OpenTelemetry SDK
  • Creating the semantic conventions
  • Replacing the code generated by the metric_set macro with code generated by Weaver

Comment on lines +122 to +128
x-otap-levels:
basic:
disabled: true
normal:
dimensions: [outcome]
detailed:
dimensions: [outcome, signal]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won’t be recognized by Weaver as is. For this kind of thing, I think the best approach is to use the annotation mechanism.

groups:
  - id: metric.consumer.items
    type: metric
    metric_name: consumer.items
    instrument: counter
    unit: "{item}"
    brief: "Items consumed by a node."
    attributes:
      - ref: outcome
        requirement_level: recommended
      - ref: signal
        requirement_level: optional
    annotations:
      x-otap-levels:
        basic:
          disabled: true
        normal:
          dimensions: [outcome]
        detailed:
          dimensions: [outcome, signal]

Comment on lines +75 to +91
```rust
/// Consumed items by outcome after
#[metric_set(name = "otap.consumer")]
#[derive(Debug, Default, Clone)]
pub struct ConsumerMetrics {
#[metric(unit = "{item}")]
consumed_success: Counter<u64>,

#[metric(unit = "{item}")]
consumed_failed: Counter<u64>,

#[metric(unit = "{item}")]
consumed_refused: Counter<u64>,

// ... and more (e.g., duration)
}
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To cover all aspects of metric sets, I suggest adding another typical kind of metric set like the one below (in complement). In this example, the metrics in the set cannot only be distinguished by one or two dimensions such as outcome and signal type.

/// Engine-wide metrics emitted once per engine instance.
#[metric_set(name = "engine.metrics")]
#[derive(Debug, Default, Clone)]
pub struct EngineMetrics {
    /// Process-wide Resident Set Size — physical RAM currently used by the process.
    /// Matches what external tools report (e.g. `kubectl top pod`, `htop`, `ps rss`).
    #[metric(unit = "{By}")]
    pub memory_rss: ObserveUpDownCounter<u64>,

    /// Process-wide CPU utilization as a ratio in [0, 1], normalized across all
    /// logical CPU cores on the system (not just engine-assigned cores).
    /// Aligned with the OTel semantic convention `process.cpu.utilization`.
    ///
    /// The `cpu.mode` attribute is not set; this reports combined user + system time.
    #[metric(unit = "{1}")]
    pub cpu_utilization: Gauge<f64>,

    /// Process-wide memory limiter state encoded as `0=normal`, `1=soft`, `2=hard`.
    #[metric(unit = "{state}")]
    pub memory_pressure_state: Gauge<u64>,

    /// Most recent process-wide memory limiter sample, in bytes.
    #[metric(unit = "{By}")]
    pub process_memory_usage_bytes: Gauge<u64>,

    /// Effective process-wide memory limiter soft limit, in bytes.
    #[metric(unit = "{By}")]
    pub process_memory_soft_limit_bytes: Gauge<u64>,

    /// Effective process-wide memory limiter hard limit, in bytes.
    #[metric(unit = "{By}")]
    pub process_memory_hard_limit_bytes: Gauge<u64>,
}

Comment on lines +140 to +153
#[derive(Debug, Default, Clone)]
pub struct ConsumerMetrics {
consumed_items: ConsumedItemsByOutcomeAndSignal,

// ... and more (e.g., duration)
}

#[derive(Debug, Default, Clone)]
pub enum ConsumedItemsByOutcomeAndSignal {
#[default]
Basic, // disabled
Normal(Box<[Counter; 3]>), // 1 dimension: outcome
Detailed(Box<[Counter; 9]>), // 2 dimensions: outcome x signal
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to see the code as a whole to understand.

What I'm wondering is: why don't we have more or less the same code as before, except that instead of being generated by a macro, it would be generated by Weaver from the semantic conventions?

Comment on lines +167 to +173
Note the use of `Box<_>` in the enum variant ensures that metric level
actually controls how much memory is used by instrumentation.

In another file, we will define the translation from the `v1` to `v0`
schema, which we will eventually deprecate. Separately, we will
define how to generate either the `v1` or the `v0` schema from the
current instrumentation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m confused, but probably I just don't have the full picture yet.

Comment on lines +295 to +298
scopes:
metrics:
metrics.otap.consumer:
schema_url: otap-dataflow/consumer@v1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Are we supposed to list all the metric sets in the configuration? That seems strange to me.

Comment on lines +406 to +409
With all `#[metric_set]` definitions replaced by generated code, we
remove the OpenTelemetry SDK from the OTAP Dataflow dependencies.
Users configure either the Builtin Prometheus support or an ITS
internal telemetry pipeline (e.g., batch processor, OTLP exporter).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying this is necessarily a better plan, but note that removing the OpenTelemetry SDK could be done as early as phase 1, even before introducing semantic conventions and code generation with Weaver.

@jmacd jmacd marked this pull request as draft April 14, 2026 15:30
@jmacd
Copy link
Copy Markdown
Contributor Author

jmacd commented Apr 14, 2026

I will build more prototype as I incorporate the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Internal telemetry metrics dispatcher

2 participants