Skip to content

fix: Add recursion guard in LoggingHandler.emit to prevent deadlock#4302

Open
deepanshuclickpost wants to merge 1 commit intoopen-telemetry:mainfrom
deepanshuclickpost:fix/logging-handler-recursion-deadlock
Open

fix: Add recursion guard in LoggingHandler.emit to prevent deadlock#4302
deepanshuclickpost wants to merge 1 commit intoopen-telemetry:mainfrom
deepanshuclickpost:fix/logging-handler-recursion-deadlock

Conversation

@deepanshuclickpost
Copy link

@deepanshuclickpost deepanshuclickpost commented Mar 5, 2026

LoggingHandler.emit() can deadlock when _translate() triggers internal logging (e.g., _clean_extended_attribute calling _logger.warning() for invalid attribute types). The warning re-enters the handler chain on the same thread, which tries to acquire the same handler lock, causing a permanent deadlock that freezes all worker threads.

This adds a threading.local() guard that detects re-entrant calls to emit() on the same thread and skips the recursive call, preventing the deadlock while preserving all application logs.

Fixes open-telemetry/opentelemetry-python#3858
Relates to open-telemetry/opentelemetry-python#3560
Relates to open-telemetry/opentelemetry-python#4585

Description

As suggested by @xrmx in opentelemetry-python#4954, opening this fix against the contrib repo since LoggingHandler in opentelemetry-sdk is deprecated.

LoggingHandler.emit() can deadlock when _translate() triggers internal logging. The call chain is:

  1. emit()_translate()BoundedAttributes()_clean_extended_attribute()
  2. _clean_extended_attribute() finds a non-primitive attribute and calls _logger.warning("Invalid type...")
  3. This warning re-enters the logging handler chain on the same thread
  4. The handler tries to acquire the same lock already held by step 1
  5. Deadlock — the thread waits for itself, lock is never released

Once one thread deadlocks, every other thread that attempts to log blocks behind the same lock. In production (Gunicorn with 4 workers × 5 threads), this caused 100% request failure (Nginx 499/504) and complete service outage.

Thread dump captured via py-spy during production outage confirmed all 20 request threads stuck at acquire (logging/__init__.py:927):

acquire (logging/__init__.py:927)                       ← STUCK
warning (logging/__init__.py:1501)                      ← OTel internal warning
_clean_extended_attribute (attributes/__init__.py:209)
_translate (_logs/_internal/__init__.py:637)
emit (_logs/_internal/__init__.py:656)                  ← OTel LoggingHandler
handle (logging/__init__.py:978)                        ← Lock first acquired here

While #4808 (in 1.39.0) reduces the most common trigger by converting unknown types to str(), the deadlock can still occur via the invalid key path (_logger.warning("invalid key...")) and when str(value) itself raises. This recursion guard protects against all current and future _logger.warning() calls in the pipeline as a defense-in-depth measure.

Fixes open-telemetry/opentelemetry-python#3858

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

4 new unit tests added in test_logging_handler_recursion.py:

  • test_recursive_emit_is_skipped — verifies guard blocks recursive calls
  • test_normal_emit_works — verifies normal logging is not affected
  • test_guard_is_reset_after_emit — verifies guard resets after emit completes
  • test_guard_is_thread_local — verifies guard on one thread doesn't block other threads

Does This PR Require a Core Repo Change?

  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 5, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: deepanshu406 / name: Deepanshu Kartikey (e84b623)

@deepanshu406
Copy link

@DylanRussell Fixed the import sorting per ruff lint rules. Could you approve the CI workflow run? Thanks!

@xrmx xrmx moved this to Ready for review in Python PR digest Mar 5, 2026
@deepanshuclickpost deepanshuclickpost requested a review from a team as a code owner March 5, 2026 16:30
@xrmx
Copy link
Contributor

xrmx commented Mar 6, 2026

@deepanshuclickpost Please run tox -e precommit

@tammy-baylis-swi tammy-baylis-swi moved this from Ready for review to Approved PRs in Python PR digest Mar 6, 2026
@deepanshuclickpost deepanshuclickpost force-pushed the fix/logging-handler-recursion-deadlock branch 2 times, most recently from 219944b to 2cc9e2f Compare March 7, 2026 04:48
@herin049
Copy link
Contributor

herin049 commented Mar 9, 2026

I'm thinking we might want to explicitly add a StreamHandler to the internal logger instead of having no loggers just in case a user unknowingly sets the logging.lastResort to the OpenTelemetry logger (which would trigger the deadlock again). Let me know what you think.

Comment on lines +39 to +40
_propagate_false_logger = logging.getLogger(__name__ + ".propagate.false")
_propagate_false_logger.propagate = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we think about

Suggested change
_propagate_false_logger = logging.getLogger(__name__ + ".propagate.false")
_propagate_false_logger.propagate = False
_internal_logger = logging.getLogger(__name__ + ".internal")
_internal_logger.propagate = False

as an alternate name?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@herin049 Great Catch !. I have made the changes.

Thanks

LoggingHandler.emit() can deadlock when _translate() triggers internal
logging (e.g., _clean_extended_attribute calling _logger.warning() for
invalid attribute types). The warning re-enters the handler chain on the
same thread, creating infinite recursion that prevents the handler lock
from ever being released, blocking all other threads.

This adds a ContextVar-based guard that detects re-entrant calls to
emit() on the same thread and skips the recursive call, preventing the
deadlock while preserving all application logs.

Fixes open-telemetry/opentelemetry-python#3858
Relates to open-telemetry/opentelemetry-python#3560
Relates to open-telemetry/opentelemetry-python#4585
@deepanshuclickpost deepanshuclickpost force-pushed the fix/logging-handler-recursion-deadlock branch from 2cc9e2f to e84b623 Compare March 9, 2026 02:02
@deepanshu406
Copy link

@herin049 CI is awaiting workflow approval.Could someone please trigger it? All review feedback has been addressed.

Thanks!

@deepanshu406
Copy link

@xrmx This PR has 3 approvals and all review feedback has been addressed. Could you please approve the CI workflows so the checks can run? Thanks!

@deepanshu406
Copy link

@xrmx 784 checks have passed with 3 approvals. The remaining 94 are awaiting workflow approval — likely triggered because the branch is out of date (can be cleanly merged). Could you please merge when you get a chance? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Approved PRs

Development

Successfully merging this pull request may close these issues.

HTTP Worker Thread Blocked and Invalid Type Warning from Opentelemetry _logs

6 participants