Internal metrics telemetry pipeline and SDK design#2623
Internal metrics telemetry pipeline and SDK design#2623jmacd wants to merge 14 commits intoopen-telemetry:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2623 +/- ##
==========================================
- Coverage 88.17% 88.16% -0.01%
==========================================
Files 633 633
Lines 238646 238646
==========================================
- Hits 210428 210410 -18
- Misses 27694 27712 +18
Partials 524 524
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
I haven't understood everything yet. I'll probably need to have a discussion with you to better understand the details of your plan :-)
What I had in mind was something like:
- Support for bounded dynamic attributes (e.g. outcome, signal type) in the current metric set system
- Converting the
TelemetryRegistryinto an internal metric receiver - Removing the OpenTelemetry SDK
- Creating the semantic conventions
- Replacing the code generated by the
metric_setmacro with code generated by Weaver
| x-otap-levels: | ||
| basic: | ||
| disabled: true | ||
| normal: | ||
| dimensions: [outcome] | ||
| detailed: | ||
| dimensions: [outcome, signal] |
There was a problem hiding this comment.
This won’t be recognized by Weaver as is. For this kind of thing, I think the best approach is to use the annotation mechanism.
groups:
- id: metric.consumer.items
type: metric
metric_name: consumer.items
instrument: counter
unit: "{item}"
brief: "Items consumed by a node."
attributes:
- ref: outcome
requirement_level: recommended
- ref: signal
requirement_level: optional
annotations:
x-otap-levels:
basic:
disabled: true
normal:
dimensions: [outcome]
detailed:
dimensions: [outcome, signal]| ```rust | ||
| /// Consumed items by outcome after | ||
| #[metric_set(name = "otap.consumer")] | ||
| #[derive(Debug, Default, Clone)] | ||
| pub struct ConsumerMetrics { | ||
| #[metric(unit = "{item}")] | ||
| consumed_success: Counter<u64>, | ||
|
|
||
| #[metric(unit = "{item}")] | ||
| consumed_failed: Counter<u64>, | ||
|
|
||
| #[metric(unit = "{item}")] | ||
| consumed_refused: Counter<u64>, | ||
|
|
||
| // ... and more (e.g., duration) | ||
| } | ||
| ``` |
There was a problem hiding this comment.
To cover all aspects of metric sets, I suggest adding another typical kind of metric set like the one below (in complement). In this example, the metrics in the set cannot only be distinguished by one or two dimensions such as outcome and signal type.
/// Engine-wide metrics emitted once per engine instance.
#[metric_set(name = "engine.metrics")]
#[derive(Debug, Default, Clone)]
pub struct EngineMetrics {
/// Process-wide Resident Set Size — physical RAM currently used by the process.
/// Matches what external tools report (e.g. `kubectl top pod`, `htop`, `ps rss`).
#[metric(unit = "{By}")]
pub memory_rss: ObserveUpDownCounter<u64>,
/// Process-wide CPU utilization as a ratio in [0, 1], normalized across all
/// logical CPU cores on the system (not just engine-assigned cores).
/// Aligned with the OTel semantic convention `process.cpu.utilization`.
///
/// The `cpu.mode` attribute is not set; this reports combined user + system time.
#[metric(unit = "{1}")]
pub cpu_utilization: Gauge<f64>,
/// Process-wide memory limiter state encoded as `0=normal`, `1=soft`, `2=hard`.
#[metric(unit = "{state}")]
pub memory_pressure_state: Gauge<u64>,
/// Most recent process-wide memory limiter sample, in bytes.
#[metric(unit = "{By}")]
pub process_memory_usage_bytes: Gauge<u64>,
/// Effective process-wide memory limiter soft limit, in bytes.
#[metric(unit = "{By}")]
pub process_memory_soft_limit_bytes: Gauge<u64>,
/// Effective process-wide memory limiter hard limit, in bytes.
#[metric(unit = "{By}")]
pub process_memory_hard_limit_bytes: Gauge<u64>,
}| #[derive(Debug, Default, Clone)] | ||
| pub struct ConsumerMetrics { | ||
| consumed_items: ConsumedItemsByOutcomeAndSignal, | ||
|
|
||
| // ... and more (e.g., duration) | ||
| } | ||
|
|
||
| #[derive(Debug, Default, Clone)] | ||
| pub enum ConsumedItemsByOutcomeAndSignal { | ||
| #[default] | ||
| Basic, // disabled | ||
| Normal(Box<[Counter; 3]>), // 1 dimension: outcome | ||
| Detailed(Box<[Counter; 9]>), // 2 dimensions: outcome x signal | ||
| } |
There was a problem hiding this comment.
I think I need to see the code as a whole to understand.
What I'm wondering is: why don't we have more or less the same code as before, except that instead of being generated by a macro, it would be generated by Weaver from the semantic conventions?
| Note the use of `Box<_>` in the enum variant ensures that metric level | ||
| actually controls how much memory is used by instrumentation. | ||
|
|
||
| In another file, we will define the translation from the `v1` to `v0` | ||
| schema, which we will eventually deprecate. Separately, we will | ||
| define how to generate either the `v1` or the `v0` schema from the | ||
| current instrumentation. |
There was a problem hiding this comment.
I’m confused, but probably I just don't have the full picture yet.
| scopes: | ||
| metrics: | ||
| metrics.otap.consumer: | ||
| schema_url: otap-dataflow/consumer@v1 |
There was a problem hiding this comment.
I don't understand. Are we supposed to list all the metric sets in the configuration? That seems strange to me.
| With all `#[metric_set]` definitions replaced by generated code, we | ||
| remove the OpenTelemetry SDK from the OTAP Dataflow dependencies. | ||
| Users configure either the Builtin Prometheus support or an ITS | ||
| internal telemetry pipeline (e.g., batch processor, OTLP exporter). |
There was a problem hiding this comment.
I'm not saying this is necessarily a better plan, but note that removing the OpenTelemetry SDK could be done as early as phase 1, even before introducing semantic conventions and code generation with Weaver.
|
I will build more prototype as I incorporate the feedback. |
Change Summary
Part of #1950; a design only.
Part of #2405
Fixes #1378
Part of #2411
Part of #2507
Proposes an OTAP-direct metrics SDK in four phases.