Skip to content

fix(sdk/metric): avoid deadlock when registering instruments in callb…#8054

Open
garyatsierra wants to merge 2 commits intoopen-telemetry:mainfrom
garyatsierra:fix-deadlock-when-registering-counter-in-callback
Open

fix(sdk/metric): avoid deadlock when registering instruments in callb…#8054
garyatsierra wants to merge 2 commits intoopen-telemetry:mainfrom
garyatsierra:fix-deadlock-when-registering-counter-in-callback

Conversation

@garyatsierra
Copy link

Summary

  • Fix deadlock in pipeline.produce() that occurs when an observable callback registers a new instrument (e.g. creates a
    counter via meter.Int64Counter())
  • produce() held the pipeline mutex while executing callbacks; if a callback called addSync() (via instrument creation),
    it tried to re-acquire the same mutex → deadlock
  • Fix: snapshot callbacks under the lock, release it, execute callbacks lock-free, then re-acquire for the aggregation phase
  • Add regression test TestPipelineNoDeadlockOnInstrumentCreationDuringCallback with a 5s timeout to catch deadlocks

Details

We ran into this and we were a bit surprised by the behavior of being able to deadlock from using sdk apis. The test shows the scenario doing which is on measure we incr a counter if it met a certain threshold and would deadlock only if it wasn't cached.

Looking at the doc we do see:

Callback functions SHOULD be reentrant safe. The SDK expects to evaluate callbacks for each MetricReader independently.

In our case we weren't acquiring and outside mutex but rather the pipeline itself was trying to reaquire the pipeline deadlock

The deadlock call chain was:

produce() [holds p.Lock()]
  → callback execution
    → meter.Int64Counter()
      → cachedAggregator()
        → pipeline.addSync() [tries p.Lock()]
          → DEADLOCK

Executing callbacks without the lock is safe because:

  • Callbacks operate on aggregate.Measure functions which use atomic operations internally
  • The callbacks and multiCallbacks data structures are snapshotted under the lock before release
  • Concurrent addCallback/addMultiCallback calls are still properly synchronized

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 12, 2026

CLA Missing ID CLA Not Signed

One or more co-authors of this pull request were not found. You must specify co-authors in commit message trailer via:

Co-authored-by: name <email>

Supported Co-authored-by: formats include:

  1. Anything <id+login@users.noreply.github.com> - it will locate your GitHub user by id part.
  2. Anything <login@users.noreply.github.com> - it will locate your GitHub user by login part.
  3. Anything <public-email> - it will locate your GitHub user by public-email part. Note that this email must be made public on Github.
  4. Anything <other-email> - it will locate your GitHub user by other-email part but only if that email was used before for any other CLA as a main commit author.
  5. login <any-valid-email> - it will locate your GitHub user by login part, note that login part must be at least 3 characters long.

Please update your commit message(s) by doing git commit --amend and then git push [--force] and then request re-running CLA check via commenting on this pull request:

/easycla

@MrAlias MrAlias added the blocked:CLA Waiting on CLA to be signed before progress can be made label Mar 12, 2026
…acks

produce() held the pipeline mutex while executing observable callbacks.
If a callback created a new instrument (e.g. a counter), the registration
path called addSync() which tried to acquire the same mutex, causing a
deadlock.

Fix by snapshotting the callbacks under the lock, releasing it before
executing them, then re-acquiring for the aggregation phase. This is
safe because callbacks operate on internally-synchronized aggregation
primitives and don't need the pipeline lock for execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@garyatsierra garyatsierra force-pushed the fix-deadlock-when-registering-counter-in-callback branch from 855c99b to 06835cf Compare March 14, 2026 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocked:CLA Waiting on CLA to be signed before progress can be made

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants