Skip to content

Conversation

@agagniere
Copy link

@agagniere agagniere commented Jan 6, 2026

Hi,
I encountered a deadlock in a situation where a callback, called by the metric pipeline's produce function, was trying to acquire a mutex owned by another goroutine, that was itself stuck waiting to acquire the pipeline's mutex.

It would seem to me that the callbacks having no access to the pipeline's members, do not need to hold its mutex when ran.
However a counter argument is that not holding the mutex allows multiCallbacks to be unregistered, so my PR now allows the call of a callback after it was unregistered (if it was unregistered after the produce function started executing but before the callback is effectively called).

Because I saw the open issue #3034 I decided to try to shoot two birds with one stone, and made this first attempt.

Feedback is welcomed, please tell me if a different approach is preferred.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jan 6, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@codecov
Copy link

codecov bot commented Jan 6, 2026

Codecov Report

❌ Patch coverage is 93.54839% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.2%. Comparing base (3dc4ccc) to head (b83e01d).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
sdk/metric/pipeline.go 93.5% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@          Coverage Diff          @@
##            main   #7755   +/-   ##
=====================================
  Coverage   86.2%   86.2%           
=====================================
  Files        302     302           
  Lines      21991   22011   +20     
=====================================
+ Hits       18968   18986   +18     
- Misses      2642    2645    +3     
+ Partials     381     380    -1     
Files with missing lines Coverage Δ
sdk/metric/pipeline.go 90.7% <93.5%> (+0.6%) ⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dashpole
Copy link
Contributor

dashpole commented Jan 6, 2026

It looks like this probably introduced a race. Take a look at test-race and test-concurrent-safe, and let me know if you need help.

To make sure I understand the issue:

I encountered a deadlock in a situation where a callback, called by the metric pipeline's produce function, was trying to acquire a mutex owned by another goroutine, that was itself stuck waiting to acquire the pipeline's mutex.

Is this something that depends on particular user behavior (e.g. writing a callback that tries to acquire a mutex)? Or is this something that can simply happen with a "normal" callback implementation? If it requires users to do something, can you provide a reproduction? If it is non-trivial, it might be best to put that, and the description of the problem into an issue.

@agagniere
Copy link
Author

Is this something that depends on particular user behavior (e.g. writing a callback that tries to acquire a mutex)?

Precisely, it was a situation where a callback wants to acquire some mutex (external to this repo).

@dmathieu
Copy link
Member

dmathieu commented Jan 7, 2026

Precisely, it was a situation where a callback wants to acquire some mutex (external to this repo).

Could you provide a small reproduction then?

@agagniere
Copy link
Author

agagniere commented Jan 7, 2026

@dashpole

It looks like this probably introduced a race. Take a look at test-race and test-concurrent-safe, and let me know if you need help.

Indeed, and it was even anticipated by yourself, offending code is :

	// Access to r.pipe.int64Measures is already guarded b a lock in pipeline.produce.
	// TODO (#5946): Refactor pipeline and observable measures.
	measures := r.pipe.int64Measures[oImpl.observableID]

which was introduced in #5900 (relevant discussion)

So I guess I will modify ObserveFloat64 to acquire the pipeline's lock before accessing its members ? Or do you have another recommendation ?

@pellared
Copy link
Member

pellared commented Jan 7, 2026

run callbacks asynchronously

@open-telemetry/go-maintainers
Wouldn't this be a behavioral change that may be breaking for some users that assume that they run synchronously?
Shouldn't this be an opt-in (configurable) behavior?

I also posted a comment here: #3034 (comment)

@flc1125
Copy link
Member

flc1125 commented Jan 7, 2026

I tend to solve the deadlock problem and asynchronization separately, as they are two different issues.

For the deadlock part, I think we might first temporarily assign the data of the critical lock to a new temporary variable, then immediately unlock it, and finally handle the relevant callback processing.

@agagniere
Copy link
Author

@flc1125

For the deadlock part, I think we might first temporarily assign the data of the critical lock to a new temporary variable, then immediately unlock it, and finally handle the relevant callback processing.

Yes this is exactly the approach I went with:

  • acquiring the mutex
  • copy the list of callbacks
  • then releasing the mutex
  • run the callbacks (concurrently or not, doesn't matter to me)
  • re-acquire the mutex
  • fill the scope metrics

I tend to solve the deadlock problem and asynchronization separately, as they are two different issues.

Sure yes, it seemed from #3034 that there was a demand for asynchrony, but if there are diverging opininions let's just focus on the locking part and leave the asynchrony for an ulterior PR

- do not own the mutex when calling callbacks
- at the end, return the first callback error if any
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants