Skip to content

feat: add apollo.router.operations.rhai.duration histogram metric (copy #9072)#9168

Open
mergify[bot] wants to merge 4 commits intodevfrom
mergify/copy/dev/pr-9072
Open

feat: add apollo.router.operations.rhai.duration histogram metric (copy #9072)#9168
mergify[bot] wants to merge 4 commits intodevfrom
mergify/copy/dev/pr-9072

Conversation

@mergify
Copy link
Copy Markdown
Contributor

@mergify mergify bot commented Apr 10, 2026

Closes #9072

Summary

  • Emits a new `apollo.router.operations.rhai.duration` (`f64` histogram, unit: `s`) for every Rhai script callback execution across all pipeline stages
  • Attributes: `rhai.stage` (e.g. `RouterRequest`, `SubgraphResponse`), `rhai.succeeded` (bool), `rhai.is_deferred` (bool — `true` only for `@defer` stream chunks)
  • The two commented-out `execute()` call sites in the `define request and response body access at the router service level for rhai #3642` TODO blocks are also instrumented, so deferred-chunk metrics for Router stage will activate automatically when that issue is resolved

Design notes

No separate operation count metric: `dev-docs/metrics.md` calls for an `apollo.router.operations.rhai` counter alongside the histogram, following the coprocessor pattern. This PR intentionally omits it — the histogram datapoints are directly countable per attribute combination (`rhai.stage`, `rhai.succeeded`), so a parallel counter would be redundant. Consumers can derive counts from the histogram without the cardinality overhead of a separate instrument.

`rhai.succeeded` attribute: Uses a positive boolean rather than OTel's `error.type` convention, intentionally mirroring the `coprocessor.succeeded` attribute on `apollo.router.operations.coprocessor` for symmetry.

`f64_histogram_with_unit!` vs coprocessor: The coprocessor duration metric uses the deprecated `f64_histogram!` (no unit). This PR uses `f64_histogram_with_unit!` per the guidance in `dev-docs/metrics.md`. Migrating the coprocessor metric is a separate task (would be a breaking rename in Prometheus output).

Test plan

  • `cargo xtask lint --fmt` — passes clean
  • `cargo nextest run --lib -E 'test(rhai)'` — 7 new tests:
    • `test_rhai_metric_router_request` — RouterRequest stage, succeeded=true
    • `test_rhai_metric_supergraph_request` — SupergraphRequest, succeeded=true
    • `test_rhai_metric_subgraph_request` — SubgraphRequest, succeeded=true
    • `test_rhai_metric_subgraph_response` — SubgraphResponse (map_response path), succeeded=true
    • `test_rhai_metric_deferred_response` — SupergraphResponse, both is_deferred=false (first chunk) and is_deferred=true (deferred chunk)
    • `test_rhai_metric_failed_callback` — SupergraphRequest, succeeded=false
    • `test_rhai_metric_no_callback_no_emission` — no callback registered = no metric emitted
  • `cargo nextest run --lib -E 'test(coprocessor)'` — existing coprocessor tests unaffected
  • Manual smoke test: run router with a Rhai script that registers a `supergraph_service` callback; confirm `apollo.router.operations.rhai.duration` appears in OTLP output with correct attributes

🤖 Generated with Claude Code


This is an automatic copy of pull request #9072 done by Mergify.

Emits a new `apollo.router.operations.rhai.duration` (f64 histogram,
unit: s) for every Rhai script callback execution across all pipeline
stages.

Attributes:
- `rhai.stage`: RouterRequest, RouterResponse, SupergraphRequest,
  SupergraphResponse, ExecutionRequest, ExecutionResponse,
  SubgraphRequest, SubgraphResponse
- `rhai.succeeded`: bool — false if the script threw an EvalAltResult
- `rhai.is_deferred`: bool — true for @defer stream-chunk callbacks
  (Supergraph/Execution response stages only; Router-stage deferred
  path activates automatically when #3642 is resolved)

A dedicated RhaiStage enum is defined in rhai/mod.rs rather than
reusing PipelineStep from the coprocessor protocol, avoiding semantic
coupling to coprocessor-only variants (ConnectorRequest/Response).

Uses f64_histogram_with_unit! per dev-docs/metrics.md guidance
(coprocessor uses the deprecated f64_histogram! — migration is a
separate task). The rhai.succeeded attribute follows the coprocessor
convention rather than OTel error.type, intentionally, for symmetry.

No separate operation count metric is added; the histogram datapoints
are directly countable per attribute combination.

Also documents the new metric in standard-instruments.mdx.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 12ed389)
@apollo-cla
Copy link
Copy Markdown

@mergify[bot]: Thank you for submitting a pull request! Before we can merge it, you'll need to sign the Apollo Contributor License Agreement here: https://contribute.apollographql.com/

@mergify mergify bot requested a review from a team as a code owner April 10, 2026 12:56
@apollo-librarian
Copy link
Copy Markdown
Contributor

apollo-librarian bot commented Apr 10, 2026

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

0 new, 1 changed, 0 removed
* graphos/routing/(latest)/observability/router-telemetry-otel/enabling-telemetry/standard-instruments.mdx

Build ID: 36a23a14747cf435eebcb9a9
Build Logs: View logs

URL: https://www.apollographql.com/docs/deploy-preview/36a23a14747cf435eebcb9a9


⚠️ AI Style Review — 2 Issues Found

Summary

The documentation has been updated to align with the style guide. In the 'structural-elements' section, rules now require lists to be introduced with a sentence or fragment ending in a colon. In the 'word-and-symbol-usage' section, guidelines specify the use of 'String' when defining field or argument types within API references.

Duration: 3191ms
Review Log: View detailed log

This review is AI-generated. Please use common sense when accepting these suggestions, as they may not always be accurate or appropriate for your specific context.

So we don't have to do it at every callsite.

Also, I removed the `rhai.is_deferred` property from the request side,
where it's always `false` and doesn't really make sense. There is no
such thing as deferred data on the request side.
Copy link
Copy Markdown
Member

@goto-bus-stop goto-bus-stop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might follow up with some cleanup of the macros here. They don't look necessary and they're confusing to a girl like me

- `apollo.router.operations.rhai.duration` - Time spent executing a Rhai script callback, in seconds
- `rhai.stage`: string (`RouterRequest`, `RouterResponse`, `SupergraphRequest`, `SupergraphResponse`, `ExecutionRequest`, `ExecutionResponse`, `SubgraphRequest`, `SubgraphResponse`)
- `rhai.succeeded`: bool
- `rhai.is_deferred`: bool — present on response stages. `true` for `@defer` and subscription data chunks, `false` for the primary or initial response.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be flagged as null for a response which was not part of a deferred query

let result = execute(
&$rhai_service,
$stage,
Some(ResponseChunk::Primary),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is for the deferred response, shouldn't this be ResponseChunk::Stream?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the macro name means the response can have deferred chunks, but it doesn't have to. An @defer response is made up of a primary response + streaming deferred chunks.

duration,
"rhai.stage" = stage,
"rhai.succeeded" = succeeded,
"rhai.is_deferred" = is_deferred
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what the utility of the rhai.is_deferred attribute is; I wouldn't expect a significant difference and am not sure what you'd do if it were different?

Either way, if you start seeing an increase in duration, I'd think you'd have to dig through spans to find the outliers.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair question, and we don't have it for the coprocessor metrics. @theJC do you miss the absence of .is_deferred on coprocessors?

Copy link
Copy Markdown
Contributor

@theJC theJC Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet 😄 - only a couple of our clients have started to use @defer and we have not had cases where we needed to diagnose and distinguish and have metrics specifically surrounding those yet.

It wouldn't break my heart y'all wanted to simplify and eject that complexity/concern at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants