Skip to content

Conversation

@frankie567
Copy link

More complete and reasonable fix for the problem I described in #802.

Problem description

In situations with lot of tasks being executed, and some of them raising exceptions with large tracebacks, we might encounter huge spikes in memory consumption that might lead to OOM issues. Mainly, this is because the garbage collector may not run timely enough under high pressure. In a sense, this is similar to the old issue #351.

The problem has several sources:

  1. The coroutine retains its exception context until it's garbage collected:

async def wrapped_coro() -> R:
try:
return await coro
finally:
done.set()

  1. The future retains the exception context and traceback raised by the coroutine until it's garbage collected:

future = asyncio.run_coroutine_threadsafe(wrapped_coro(), self.loop)
try:
while True:
try:
# Use a timeout to be able to catch asynchronously
# raised dramatiq exceptions (Interrupt).
return future.result(timeout=self.interrupt_check_ival)

Proposed solution

The solution is to make sure to detach the exception context from the coroutine and future so we can manually manage its lifecycle and force garbage collection. Basically, we use local variables to store the result and exception of the coroutine; instead of relying on future.result() directly.

Similar to #351, we manually call gc.collect() after we handled the exception to make sure we clear it right away.

I also added a unit test for this. Admittedly, it's a bit complex since it's hard to replicate this behaviour inside pytest with the stub broker.

Side note

The happy side-effect of this change is that it makes the TimeoutError fix (#791) a bit more readable, since we directly handle the TimeoutError raised inside the coroutine.

@LincolnPuzey
Copy link
Collaborator

  1. The coroutine retains its exception context until it's garbage collected:
  1. The future retains the exception context and traceback raised by the coroutine until it's garbage collected:

Is there actually a memory leak? Or is garbage collection just happening too infrequently?

@frankie567
Copy link
Author

I'm not sure. I've ran my test script in two scenarios: high pressure (one task every 100 ms) and low pressure (one task every 1 second). Here are the results:

High pressure

Figure_fast

Low pressure

Figure_slow

In both cases, we see the memory grow rapidly but stabilises at some point (but at a very high value). In the low pressure scenario, we do see some drops which shows that we free up memory at some point, but we're still on that "plateau". My reading is that the EventLoopThread might retain the objects from the last run until a new run overrides them.


With the fix in this PR, here is the memory consumption with the same script (one task every 100 ms):

Figure_fast_fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants