fix: Add recursion guard in LoggingHandler.emit to prevent deadlock#4302
Conversation
|
|
...n/opentelemetry-instrumentation-logging/src/opentelemetry/instrumentation/logging/handler.py
Outdated
Show resolved
Hide resolved
|
@DylanRussell Fixed the import sorting per ruff lint rules. Could you approve the CI workflow run? Thanks! |
...n/opentelemetry-instrumentation-logging/src/opentelemetry/instrumentation/logging/handler.py
Outdated
Show resolved
Hide resolved
|
@deepanshuclickpost Please run |
219944b to
2cc9e2f
Compare
|
I'm thinking we might want to explicitly add a |
| _propagate_false_logger = logging.getLogger(__name__ + ".propagate.false") | ||
| _propagate_false_logger.propagate = False |
There was a problem hiding this comment.
What do we think about
| _propagate_false_logger = logging.getLogger(__name__ + ".propagate.false") | |
| _propagate_false_logger.propagate = False | |
| _internal_logger = logging.getLogger(__name__ + ".internal") | |
| _internal_logger.propagate = False |
as an alternate name?
There was a problem hiding this comment.
@herin049 Great Catch !. I have made the changes.
Thanks
LoggingHandler.emit() can deadlock when _translate() triggers internal logging (e.g., _clean_extended_attribute calling _logger.warning() for invalid attribute types). The warning re-enters the handler chain on the same thread, creating infinite recursion that prevents the handler lock from ever being released, blocking all other threads. This adds a ContextVar-based guard that detects re-entrant calls to emit() on the same thread and skips the recursive call, preventing the deadlock while preserving all application logs. Fixes open-telemetry/opentelemetry-python#3858 Relates to open-telemetry/opentelemetry-python#3560 Relates to open-telemetry/opentelemetry-python#4585
2cc9e2f to
e84b623
Compare
|
@herin049 CI is awaiting workflow approval.Could someone please trigger it? All review feedback has been addressed. Thanks! |
|
@xrmx This PR has 3 approvals and all review feedback has been addressed. Could you please approve the CI workflows so the checks can run? Thanks! |
|
@xrmx 784 checks have passed with 3 approvals. The remaining 94 are awaiting workflow approval — likely triggered because the branch is out of date (can be cleanly merged). Could you please merge when you get a chance? Thanks! |
LoggingHandler.emit() can deadlock when _translate() triggers internal logging (e.g., _clean_extended_attribute calling _logger.warning() for invalid attribute types). The warning re-enters the handler chain on the same thread, which tries to acquire the same handler lock, causing a permanent deadlock that freezes all worker threads.
This adds a threading.local() guard that detects re-entrant calls to emit() on the same thread and skips the recursive call, preventing the deadlock while preserving all application logs.
Fixes open-telemetry/opentelemetry-python#3858
Relates to open-telemetry/opentelemetry-python#3560
Relates to open-telemetry/opentelemetry-python#4585
Description
As suggested by @xrmx in opentelemetry-python#4954, opening this fix against the contrib repo since
LoggingHandlerinopentelemetry-sdkis deprecated.LoggingHandler.emit()can deadlock when_translate()triggers internal logging. The call chain is:emit()→_translate()→BoundedAttributes()→_clean_extended_attribute()_clean_extended_attribute()finds a non-primitive attribute and calls_logger.warning("Invalid type...")Once one thread deadlocks, every other thread that attempts to log blocks behind the same lock. In production (Gunicorn with 4 workers × 5 threads), this caused 100% request failure (Nginx 499/504) and complete service outage.
Thread dump captured via
py-spyduring production outage confirmed all 20 request threads stuck atacquire (logging/__init__.py:927):While #4808 (in 1.39.0) reduces the most common trigger by converting unknown types to
str(), the deadlock can still occur via the invalid key path (_logger.warning("invalid key...")) and whenstr(value)itself raises. This recursion guard protects against all current and future_logger.warning()calls in the pipeline as a defense-in-depth measure.Fixes open-telemetry/opentelemetry-python#3858
Type of change
How Has This Been Tested?
4 new unit tests added in
test_logging_handler_recursion.py:test_recursive_emit_is_skipped— verifies guard blocks recursive callstest_normal_emit_works— verifies normal logging is not affectedtest_guard_is_reset_after_emit— verifies guard resets after emit completestest_guard_is_thread_local— verifies guard on one thread doesn't block other threadsDoes This PR Require a Core Repo Change?
Checklist: